Homework 6 (HW06) - Machine learning II
Exercise 1 (EX1) (1 point)
There is information about the predictions of two classifiers A and B in ex1.csv (2 first columns) and the true labels (third column). The labels indicate whether a patient had a disease or not. Label 0 stands for no disease and label 1 stands for disaease. Build confusion matrix for both classifiers by substituting the ?? in the table given below with appropriate numbers. Add TP, FP, TN, FN in the correct parenthesis.
ACTUAL\PREDICTED | DISEASE (1) | NO DISEASE (0) | TOTAL |
---|---|---|---|
DISEASE (1) | ?? (?) | ?? (?) | ?? |
NO DISEASE (0) | ?? (?) | ?? (?) | ?? |
TOTAL | ?? | ?? | ?? |
On both matrices, calculate (show the calculation, it would be good to write the formulas using TP, FP, etc notation) and interpret (definition of interpretation is given below) the following measures:
- accuracy
- precision
- recall
- F-measure (since direct interpretation is difficult for this one, try to just give an intuitive undertstanding here)
Under interpretation we mean that you should come up with a very easily understandable sentence that explains the measure and result in a clear way (that would be understandable to all). For example if we would get 0.87 as accuracy on this data, it can be interpreted in the following way: 87% of the patients were diagnosed correctly.
Write down for each measure which classifier (A or B) it prefers. Are all measures agreeing on the best classifier or are there differences (if yes then where)? Which classifier do you think is better and why?
Exercise 2 (EX2) (1 point)
In this exercise we are going to use bank marketing dataset that was downloaded from UCI website. The data is about direct marketing campaigns of a Portuguese bank. The marketing campaigns were based on phone calls. Often, more than one call to the same client was required, in order to say if the product (bank loan) would be taken ('1') or not ('0') by the given client. Bank executives asked you to build a classifier that would help them to predict outcome of the campaign for new clients.
You are given a training data (training.csv) and two test datasets (testing_1.csv and testing_2.csv). The problem is that training data is heavily unbalanced, with 95% of the instances reporting negative outcome of the marketing campaign (bad PR). You need to take this fact into account.
First, use code from the previous homework in order to build a RandomForest classifier (method = 'rf'
).
ctrl <- trainControl(method="none", number = 1, repeats = 1) (rf_fit <- train(as.factor(y)~., data = training, method = 'rf', trControl = ctrl))
Then, use function predict
to test your model on both testing datasets, use confusionMatrix()
to interpret the results. How would you assess the performance of the model?
Then, use ROSE
package to implement over-, under- and a combination of over- and under- sampling approaches on our imbalanced data. For this, use function ovun.sample
and set method
parameter accordingly. For oversampling and undersampling generate a dataset that contains equal number of positive and negative examples tuning N
parameter. For a combination of them, no need to strive for perfectly equal number of positive and negative examples, just use p = 0.5
and N=1000
. Briefly describe in the report the idea behind each of these sampling approaches.
Now, train RandomForest classifier on these three datasets and each model test on both testing datasets. Interpret the results using confusion matrices and the same measures from the EX1. Was there any improvement in performance comparing to the original attempt to build a classifier on the initial data? Which sampling method worked best?
Exercise 3 (EX3) (2 points)
In this task you are going to generate your own data and build several classifiers for it. The aim is to investigate how the models "look like". We will end up with something like this.
a. First generate the training data. The data will have 2 features (x-coordinate and y-coordinate) and a binary label. Generate 2 datasets: one with 20 datapoints and another with 100 datapoints. To generate a single datapoint, choose x-coordinate randomly (as real number) from (0, 10), same for y-coordinate. As you already understood, this data is interpretable as 2D data. You have to fix the labels so that inside a circle defined with the formula (x - 5)2 + (y - 5)2 = 32 the points will be positive and outside that the points will be negative (note that in here y stands for the feature, not the class label). By plugging in the datapoint to the formula you can generate the label: if it is inside the circle (class 1) or outside (class 0).
Hint: use runif
command to generate random numbers.
b. Visualize all training data you generated (2 plots). The plots should look similar to this.
c. Now train classifiers on the data using 4 different classification methods (4 classification methods x 2 datasets = 8 classifiers).
- Decision tree
- Random forest
- SVM I (linear)
- SVM II (radial kernel)
Note: you can use some other classification methods if you want to try out something else (but make sure there are 4 of them).
d. The aim is now to see which of the classification methods were able to learn the correct shape (circle) from the training data and predict class 1 for the points inside the circle and class 0 for other points (remember, the classifier has no information about the circle formula, it is just learning by the information given to it by the training data). To do that generate testing data from the same range but this time as a grid with small step (for example 0.25 or even smaller). For every testing point predict the label with all 8 classifiers and plot the results. Example of a perfectly recognized circular shape on a testing data (points generated to be on a grid) is shown below.
Hint: use seq
command to generate values between 0 and 10 with step for example 0.25 (you need to specify the correct parameters). Then use expand.grid
to make all combinations of these points and end up with a grid dataset.
e. Interpret the results. Which classifiers were able to recognize the circular shape. How did different learning algorithms perform, are the results what you would have expected? Was the original training data size important and how did it influence the results?