Andmekaeve - Kursused - Arvutiteaduse instituut

Homework 7 (HW07) - Machine learning III

Exercise 1 (EX1) (2 points) Understanding receiver operating characteristic (ROC)

The purpose of this exercise is to deepen your understanding of ROC curves and points on this curve. For that we are going to use synthetic data that you can access from train.csv, val.csv and test.csv files. You can visualise data using:

        
              library(ggplot2)
ggplot(train, aes(x=V1,y=V2,color=y)) + geom_point(alpha = 0.5, size = 2) + theme_bw()

Train a good old random forest classifier using rf method from caret (function make.names is meant to generate proper categorical output labels):

        
              library(caret)
set.seed(10) # important for correct behaviour!
ctrl <- trainControl(method="none", number = 1, repeats = 1, classProbs = TRUE)
rf_fit <- train(make.names(y)~., data = train, method = 'rf', trControl = ctrl)

(a) Use validation data, trained model and function predict to extract probability scores for each validation example. Note that output of predict function will be a data.frame with two columns. These columns represent probability scores for each validation example to be of one class or another. These probabilities sum up to 1, thus, we can use only one of them. Run the code below.

        
              probs <- predict(rf_fit, newdata = val, type = 'prob')[,2]

Note, type = 'prob', means that we want probability scores as an output.

(b) Get all possible cutoffs (thresholds) from extracted probabilities. How can you interpret these values?

        
              cutoffs <- sort(unique(probs))

        
              tpr <- sapply(cutoffs, function(cut) sum((probs >= cut) == TRUE & val$y == 1)/(sum(val$y == 1)))
fpr <- ## FILL IN ##
stats <- data.frame(cutoffs, tpr, fpr)

Print out the 6th row of stats data.frame. Give an intuitive explanation to each number in this row.

(d) Use ggplot to make a figure where TPR is on y-axis and FPR on x-axis. Using geom_point we will visualise all the TPR and FPR combinations we have computed as points on this figure.

        
              roc_plot <- ggplot(stats, aes(x = fpr, y = tpr)) + geom_point() + theme_bw()

Now, instead of geom_point, use geom_path to obtain a line that would connect all these points. This is ROC curve for this classifier on validation data.

Is the performance of the RF classifier better than random guess? How ROC would look for the random classifier?

(e) By now it should be clear that there are multiple ways how classifier could divide probability scores into two classes (by picking a different cutoff). It is the right time to check, which cutoff was used by our random forest classifier. For that create a data.frame where one column would contain probability estimates and another predicted classes:

        
              rf_dt <- data.frame(class = predict(rf_fit, newdata = val, type = 'raw'),
           probs = probs)

Sort rf_dt by the probs column and report the cutoff used by random forest to differentiate between 1s and 0s. What is the accuracy of our classifier given this default cutoff on validation data?

(f) Highlight this cutoff point on the ROC curve by running the following code (note, that you need to fill in relevant parts for it to work):

        
              roc_plot <- roc_plot + 
  geom_point(data=stats[which(stats$cutoffs == ?), ], aes(x=fpr, y=tpr), colour="red", size=3) + 
  geom_label(data = stats[which(stats$cutoffs == ?), ], label = '?', nudge_y = 0.05)
roc_plot

(g) For each of the cutoffs in stats compute number of false positives (FP) and false negatives (FN) on a validation data and also potential accuracy of the classifier:

        
              fp_count <- sapply(cutoffs, function(cut) sum((probs >= cut) == TRUE & val$y == 0))
fn_count <- sapply(cutoffs, function(cut) sum((probs >= cut) == FALSE & val$y == 1))
acc <- sapply(cutoffs, function(cut) (sum((probs >= cut) == FALSE & val$y == 0) + sum((probs >= cut) == TRUE & val$y == 1))/length(probs))

add them to stats

        
              stats <- data.frame(stats, fp_count, fn_count, acc)

(h) Now, think about any real world classification problem (disease diagnosis, bird detection on the image, etc.), each of these problems have some cost associated with a misclassified example. Depending on a particular domain costs of FPs and FNs are different. In this part of the exercise, you need to define a new column cost in stats data.frame according to three scenarios:

Cost of FP equals to cost of FN
Cost of FP is 3 times larger than cost of FN
Cost of FN is 3 times larger than cost of FP

For each scenario find and report the most optimal cutoff and corresponding TPR, FPR, number of FPs and FNs and accuracy of the classifier. For each scenario also add a corresponding point to the ROC curve (colour points into different colours).

Here is an example code for the first scenario:

        
              stats$cost <- stats$fp_count + stats$fn_count
stats[which.min(stats$cost),]
 
roc_plot <- roc_plot + geom_point(data=stats[which.min(stats$cost),], aes(x=fpr, y=tpr), colour="blue", size=3) + 
  geom_label(data = stats[which.min(stats$cost),], label = stats$cutoffs[which.min(stats$cost)], nudge_y = 0.05)
roc_plot

Report the final ROC curve with four points on it (one for the default cutoff and the other three related to three scenarios above). Which of the points on the ROC (cutoffs) corresponds to the maximal possible accuracy of our random forest classifier? Store this cutoff.

(i) You are almost there, one more thing has left (stop complaining, this is 2 points exercise, after all!). Now, use test data in order to test our random forest one more time (use code above). Extract both probability scores and predicted classes (as we have done before).

Compute accuracy of the random forest classifier based on default cutoff and the cutoff from task (h). Compare them and thoroughly explain the reasons for the difference that you observe.

Exercise 2 (EX2) (1 point) Evaluating the goodness of regression line

Dataset ex2 contains one feature (feature1) and a numeric label (label). Plot the data and overlay 3 functions (with different colors) where y stands for the predicted label.

f1: y = 500
f2: y = 40 * feature1 - 300
f3: y = feature1² + 2 * feature1

To easily add function lines to ggplot plot, write it first as a function, e.g.

        
              f1 <- function(x) {
  return(500)
}

And then you can add this line to the plot:

        
              stat_function(fun=f1, colour="red")

Now answer the following questions:

(a) Which function seems to be the best fit for the data visually?

(b) Calculate the MSE (mean square error) and RMSE (root mean square error - just take sqrt(MSE)) for the three functions and our data. Program the function for calculating MSE and RMSE yourself! Describe in few sentences (or make an illustration) about what MSE and RMSE measure?

(d) Which function is the best according to MSE and RMSE? Is it the same as what you found to be the best visually?

(e) Why do you think RMSE is often preferred to MSE?

Exercise 3 (EX3) (1 point) Stepwise learning of multi-variate linear regression models

The goal of this task is to study the options of feature selection in linear regression with the help of R commands lm and step. For this we will use the Auto-MPG dataset given here as CSV (this is a cleaned version of the dataset here at UCI repository). We will solve the regression task of predicting how many miles a car can drive with a gallon of petrol (feature mpg), given all other features. This is similar to the example in the lecture slides but has more cars in it and there are some differences in features as well.

(a) Read in the data, remove the name column and separate the data into training data d_train and test data d_test with equal sizes;

(b) define the following function to be used in evaluation:

        
              MSE = function(model,data) mean((data$mpg-predict(model,newdata=data))^2)

        
              empty_model = lm(formula = mpg ~ 1,data = d_train)
full_model = lm(formula = mpg ~ .,data = d_train)

(d) create a function to report the performance of a model:

        
              add_to_report = function(report,model,d_train,d_test) {
  formula_string = as.character(formula(model)[3])
  n_features = (length(strsplit(formula_string,split=" ")[[1]])+1)/2
  df = data.frame(formula=formula_string,n_features=n_features,train_mse=MSE(model,d_train),test_mse=MSE(model,d_test))
  report = rbind(report,df)
  return(report)
}

Create an empty report and add the empty and full models to it:

        
              report = data.frame()
report = add_to_report(report,empty_model,d_train,d_test)
report = add_to_report(report,full_model,d_train,d_test)
print(report)

(e) start eliminating features from the full model until AIC (Akaike information criterion [not covered in the lectures]) stops improving. This can be done using the function step which can be called as follows:

        
              model = step(full_model,direction='backward',steps=n_steps)

where n_steps is the number of steps you want to be made (i.e. number of features you want to be eliminated). Please include the results after each step into the report table (i.e. after step 1, after step 2, etc. until convergence. This can be done by calling step multiple times with steps equal to 1,2,...).

(f) start adding features to the empty model until AIC stops improving, again using function step:

        
              model = step(empty_model,direction='forward',scope=formula(full_model),steps=n_steps)

Again evaluate the model after each step and include in the report.

(g) show the final report table and discuss the results. Which method was better, eliminating or adding, (e) or (f)? As you saw, sometimes a model with less features can work better than the full model. Why, how can it be?

(h) As a comparison, pick the features that have p-values (Pr(>|t|)) below 0.001 and build a new linear model only on those, with lm. Is it better or worse than the models obtained with stepwise modelling? What if you change the p-value threshold from 0.001 to 0.01 or 0.1?

(i) Repeat all of the above analysis by splitting the data so that the training set has only 20 instances. Please do not show all the calculations here in the report, just show the final results table. Discuss the results. What is the winning method now? Which model is best? Are the results different from the case with a larger training set, and why?

(j) Choose the best of the models that you have learned so far in this exercise and extract its coefficients with the command coef(model) (note that you need to repeat the command that you applied to learn this model, because the model itself was not stored during reporting). Do the signs (positive/negative) of coefficients make sense, thinking of the meaning of the features in this task?

Bonus exercise 1 (BEX1) (1 point) Art of machine learning

Last week for the EX3 we have been using a grid of points in order to visualise the classifier's predictions. We were using class predictions to colour points into two colours according to the predicted class. Now, as we already know how to extract probability scores from the trained models, we can attempt to visualise them in order to gain more insight. Use test data from last week's EX3 (100 data points) to make a contour plot. For that use library plotly. Try out different styles of layout, colour schemes, classifiers (at least 4, but the more the merrier). Combine several plots into one using function subplot from plotly package. Try to make your visualisation as unique as possible, to convince us that it is worth a bonus point (leave this exercise for individual work). Interpret obtained visualisation(s). Is there anything new you learned about tested classifiers?

Andmekaeve 2017/18 sügis

Homework 7 (HW07) - Machine learning III

Exercise 1 (EX1) (2 points) Understanding receiver operating characteristic (ROC)

Exercise 2 (EX2) (1 point) Evaluating the goodness of regression line

Exercise 3 (EX3) (1 point) Stepwise learning of multi-variate linear regression models

Bonus exercise 1 (BEX1) (1 point) Art of machine learning