Homework 7 (HW07) - Machine learning III
Exercise 1 (EX1) (2 points) Understanding receiver operating characteristic (ROC)
The purpose of this exercise is to deepen your understanding of ROC curves and points on this curve. For that we are going to use synthetic data that you can access from train.csv, val.csv and test.csv files. You can visualise data using:
library(ggplot2) ggplot(train, aes(x=V1,y=V2,color=y)) + geom_point(alpha = 0.5, size = 2) + theme_bw()
Train a good old random forest classifier using rf
method from caret
(function make.names
is meant to generate proper categorical output labels):
library(caret) set.seed(10) # important for correct behaviour! ctrl <- trainControl(method="none", number = 1, repeats = 1, classProbs = TRUE) rf_fit <- train(make.names(y)~., data = train, method = 'rf', trControl = ctrl)
(a) Use validation data, trained model and function predict
to extract probability scores for each validation example. Note that output of predict
function will be a data.frame with two columns. These columns represent probability scores for each validation example to be of one class or another. These probabilities sum up to 1, thus, we can use only one of them. Run the code below.
probs <- predict(rf_fit, newdata = val, type = 'prob')[,2]
Note, type = 'prob'
, means that we want probability scores as an output.
(b) Get all possible cutoffs (thresholds) from extracted probabilities. How can you interpret these values?
cutoffs <- sort(unique(probs))
(c) Compute true positive rate (TPR) and false positive rate (FPR) for all possible cutoffs using validation data:
tpr <- sapply(cutoffs, function(cut) sum((probs >= cut) == TRUE & val$y == 1)/(sum(val$y == 1))) fpr <- ## FILL IN ## stats <- data.frame(cutoffs, tpr, fpr)
Print out the 6th row of stats
data.frame. Give an intuitive explanation to each number in this row.
(d) Use ggplot
to make a figure where TPR is on y-axis and FPR on x-axis. Using geom_point
we will visualise all the TPR and FPR combinations we have computed as points on this figure.
roc_plot <- ggplot(stats, aes(x = fpr, y = tpr)) + geom_point() + theme_bw()
Now, instead of geom_point
, use geom_path
to obtain a line that would connect all these points. This is ROC curve for this classifier on validation data.
Is the performance of the RF classifier better than random guess? How ROC would look for the random classifier?
(e) By now it should be clear that there are multiple ways how classifier could divide probability scores into two classes (by picking a different cutoff). It is the right time to check, which cutoff was used by our random forest classifier. For that create a data.frame where one column would contain probability estimates and another predicted classes:
rf_dt <- data.frame(class = predict(rf_fit, newdata = val, type = 'raw'), probs = probs)
Sort rf_dt
by the probs
column and report the cutoff used by random forest to differentiate between 1s and 0s. What is the accuracy of our classifier given this default cutoff on validation data?
(f) Highlight this cutoff point on the ROC curve by running the following code (note, that you need to fill in relevant parts for it to work):
roc_plot <- roc_plot + geom_point(data=stats[which(stats$cutoffs == ?), ], aes(x=fpr, y=tpr), colour="red", size=3) + geom_label(data = stats[which(stats$cutoffs == ?), ], label = '?', nudge_y = 0.05) roc_plot
(g) For each of the cutoffs in stats
compute number of false positives (FP) and false negatives (FN) on a validation data and also potential accuracy of the classifier:
fp_count <- sapply(cutoffs, function(cut) sum((probs >= cut) == TRUE & val$y == 0)) fn_count <- sapply(cutoffs, function(cut) sum((probs >= cut) == FALSE & val$y == 1)) acc <- sapply(cutoffs, function(cut) (sum((probs >= cut) == FALSE & val$y == 0) + sum((probs >= cut) == TRUE & val$y == 1))/length(probs))
add them to stats
stats <- data.frame(stats, fp_count, fn_count, acc)
(h) Now, think about any real world classification problem (disease diagnosis, bird detection on the image, etc.), each of these problems have some cost associated with a misclassified example. Depending on a particular domain costs of FPs and FNs are different. In this part of the exercise, you need to define a new column cost
in stats
data.frame according to three scenarios:
- Cost of FP equals to cost of FN
- Cost of FP is 3 times larger than cost of FN
- Cost of FN is 3 times larger than cost of FP
For each scenario find and report the most optimal cutoff and corresponding TPR, FPR, number of FPs and FNs and accuracy of the classifier. For each scenario also add a corresponding point to the ROC curve (colour points into different colours).
Here is an example code for the first scenario:
stats$cost <- stats$fp_count + stats$fn_count stats[which.min(stats$cost),] roc_plot <- roc_plot + geom_point(data=stats[which.min(stats$cost),], aes(x=fpr, y=tpr), colour="blue", size=3) + geom_label(data = stats[which.min(stats$cost),], label = stats$cutoffs[which.min(stats$cost)], nudge_y = 0.05) roc_plot
Report the final ROC curve with four points on it (one for the default cutoff and the other three related to three scenarios above). Which of the points on the ROC (cutoffs) corresponds to the maximal possible accuracy of our random forest classifier? Store this cutoff.
(i) You are almost there, one more thing has left (stop complaining, this is 2 points exercise, after all!). Now, use test data in order to test our random forest one more time (use code above). Extract both probability scores and predicted classes (as we have done before).
Compute accuracy of the random forest classifier based on default cutoff and the cutoff from task (h). Compare them and thoroughly explain the reasons for the difference that you observe.
Exercise 2 (EX2) (1 point) Evaluating the goodness of regression line
Dataset ex2 contains one feature (feature1
) and a numeric label (label
). Plot the data and overlay 3 functions (with different colors) where y stands for the predicted label.
- f1: y = 500
- f2: y = 40 * feature1 - 300
- f3: y = feature12 + 2 * feature1
To easily add function lines to ggplot plot, write it first as a function, e.g.
f1 <- function(x) { return(500) }
And then you can add this line to the plot:
stat_function(fun=f1, colour="red")
Now answer the following questions:
(a) Which function seems to be the best fit for the data visually?
(b) Calculate the MSE (mean square error) and RMSE (root mean square error - just take sqrt(MSE)) for the three functions and our data. Program the function for calculating MSE and RMSE yourself! Describe in few sentences (or make an illustration) about what MSE and RMSE measure?
(c) Show the values of MSE and RMSE for each function.
(d) Which function is the best according to MSE and RMSE? Is it the same as what you found to be the best visually?
(e) Why do you think RMSE is often preferred to MSE?
Exercise 3 (EX3) (1 point) Stepwise learning of multi-variate linear regression models
The goal of this task is to study the options of feature selection in linear regression with the help of R commands lm
and step
.
For this we will use the Auto-MPG dataset given here as CSV (this is a cleaned version of the dataset here at UCI repository).
We will solve the regression task of predicting how many miles a car can drive with a gallon of petrol (feature mpg
), given all other features.
This is similar to the example in the lecture slides but has more cars in it and there are some differences in features as well.
(a) Read in the data, remove the name
column and separate the data into training data d_train
and test data d_test
with equal sizes;
(b) define the following function to be used in evaluation:
MSE = function(model,data) mean((data$mpg-predict(model,newdata=data))^2)
(c) learn the empty linear model (without any features) and the full linear model (using all features)
empty_model = lm(formula = mpg ~ 1,data = d_train) full_model = lm(formula = mpg ~ .,data = d_train)
(d) create a function to report the performance of a model:
add_to_report = function(report,model,d_train,d_test) { formula_string = as.character(formula(model)[3]) n_features = (length(strsplit(formula_string,split=" ")[[1]])+1)/2 df = data.frame(formula=formula_string,n_features=n_features,train_mse=MSE(model,d_train),test_mse=MSE(model,d_test)) report = rbind(report,df) return(report) }
Create an empty report and add the empty and full models to it:
report = data.frame() report = add_to_report(report,empty_model,d_train,d_test) report = add_to_report(report,full_model,d_train,d_test) print(report)
(e) start eliminating features from the full model until AIC (Akaike information criterion [not covered in the lectures]) stops improving.
This can be done using the function step
which can be called as follows:
model = step(full_model,direction='backward',steps=n_steps)
where n_steps
is the number of steps you want to be made (i.e. number of features you want to be eliminated).
Please include the results after each step into the report table (i.e. after step 1, after step 2, etc. until convergence. This can be done by calling step
multiple times with steps
equal to 1,2,...).
(f) start adding features to the empty model until AIC stops improving, again using function step
:
model = step(empty_model,direction='forward',scope=formula(full_model),steps=n_steps)
Again evaluate the model after each step and include in the report.
(g) show the final report table and discuss the results. Which method was better, eliminating or adding, (e) or (f)? As you saw, sometimes a model with less features can work better than the full model. Why, how can it be?
(h) As a comparison, pick the features that have p-values (Pr(>|t|)) below 0.001 and build a new linear model only on those, with lm
.
Is it better or worse than the models obtained with stepwise modelling? What if you change the p-value threshold from 0.001 to 0.01 or 0.1?
(i) Repeat all of the above analysis by splitting the data so that the training set has only 20 instances. Please do not show all the calculations here in the report, just show the final results table. Discuss the results. What is the winning method now? Which model is best? Are the results different from the case with a larger training set, and why?
(j) Choose the best of the models that you have learned so far in this exercise and extract its coefficients with the command coef(model)
(note that you need to repeat the command that you applied to learn this model, because the model itself was not stored during reporting). Do the signs (positive/negative) of coefficients make sense, thinking of the meaning of the features in this task?
Bonus exercise 1 (BEX1) (1 point) Art of machine learning
Last week for the EX3 we have been using a grid of points in order to visualise the classifier's predictions. We were using class predictions to colour points into two colours according to the predicted class. Now, as we already know how to extract probability scores from the trained models, we can attempt to visualise them in order to gain more insight. Use test data from last week's EX3 (100 data points) to make a contour plot. For that use library plotly. Try out different styles of layout, colour schemes, classifiers (at least 4, but the more the merrier). Combine several plots into one using function subplot
from plotly
package. Try to make your visualisation as unique as possible, to convince us that it is worth a bonus point (leave this exercise for individual work). Interpret obtained visualisation(s). Is there anything new you learned about tested classifiers?