Data Mining - Courses - Institute of Computer Science

HW6. Regression methods (02.04)

You can now submit to Kaggle (NB! there are new links)!

EX1. Dataset HW6_ex1.txt contains two features: x and y. Plot functions y = x, y = 1.5x and y = 2x on the data. Implement and calculate root mean squared error (RMSE) for each of the functions. Describe in few sentences (or make an illustration) about what does RMSE measure? Which of the models describes data the best?

EX2. Use the red-wine dataset to build a regression model associating quality value of red wines with other measured features. There’s some additional information about the features in this supplementary file. You can use standard pre-implemented tools such as lm() in R for model fitting. Build the model iteratively by adding one feature at a time and retain features based on regression coefficient p-values (eg using a p-value threshold 0.05). Report the model building process and intermediate results. Interpret the included variables and corresponding coefficients of the final model. Report the adjusted R-squared and RMSE score.

EX3. Use the iris dataset for testing differences of means with a t-test. First present your hypothesis (what are you going to check with the t-test). For formulating the hypothesis you need to choose a feature and two different flower types (ie Virginica, Setosa - you can choose the pair yourself). Then you can check whether the feature means of 2 different Iris flowers differ from each other with a pre-implemented t-test method.

Next, recreate this calculation by implementing two-sample t-test yourself. To do that, first calculate the t-statistic (you can find the formula online or from the lecture slides). Then use the distribution function of t-distribution to convert the t-statistic to p-value. Are the two classes means different (p-value)? Interpret and explain all steps and answers.

Present a visual illustration that would support your conclusion (plot the distributions for example).

Hint: Make sure you get the same results with both methods.

Extra knowledge and intuition: You can watch this video, it does not explain the t-test, but might help you to get a general idea of the logic behind this kind of testing.

EX4. In this task we are using diamonds data from the package ggplot2 (data(diamonds)). Build regression models predicting price from the rest of the features, where

A) model 1 has all the features
B) model 2 has all the features + 'carat' and 'depth' of degree 2
C) model 3 has all the features + 3rd degree polynomials of 'carat' and 'depth' (i.e. carat^3, carat^2, carat, depth^3,...)
D) model 4 has all the features + 3rd degree polynomials of 'carat' and 'depth' + 'x','y','z' of degree 2

In R you can use poly(x,d) to evaluate a polynomial of degree d, e.g. lm(price ~ poly(x,3) + ..., data=diamonds).

Use the regular 80% train / 20% test split. Measure the RMSE for all the models on the train and test set and plot a graph, where on x-axis models are sorted according to the complexity of the model (A, B, C, D) and on y-axis RMSE for train and test split (one line for train, other for test RMSE's). What do you observe? Can we diagnose under- or overfitting problems?

EX5. We have prepared a small Kaggle competition, wherein your task is to apply the regression techniques you have learnt to predict the numerical value of the target variable in the test set. The competition is to be done individually. As you make submissions, you will immediately see how they evaluate (in terms of RMSE) on the public leaderboard. Everyone who makes the first submission before the next practice session , will be awarded one point. A more detailed description is available on the competition page. You can join it here.

To earn points from this tasks, write here under which name you made the submission (hopefully you will use your own names) and also the result and the method you used (which method, did you do feature selection, how did you select the model which to submit, etc - whatever seems reasonable to mention).

EX6 (bonus, until 23rd of April). For the above competition, students who will finish in the first top 25% will get 4 extra points, 2nd quartile - 2 extra points, 3rd - 1 point. Important! To avoid overfitting on the public data, kaggle performs the evaluation only on the 50% of the test set. The predictions on the remaining 50% will be used in final evaluation after the competition closes.

EX 7 (bonus, 1p). Try ridge and lasso regressions for model 4 from task 4 and add the resulting RMSE of training and test set on the plot generated in task 4. Explain in general terms what do ridge and lasso regression do. Then interpret the RMSE outcomes.

Data Mining 2016/17 spring

HW6. Regression methods (02.04)