Andmekaeve - Kursused - Arvutiteaduse instituut

HW09 (10.04) - Machine Learning II

1. Read the article by Domingos: A few useful things to know about machine learning (communications of the ACM, Vol. 55 No. 10, Pages 78-87 doi: 10.1145/2347736.2347755 via ACM Digital library, https://courses.cs.ut.ee/MTAT.03.183/2012_fall/uploads/Main/domingos.pdf). Make a list of key messages with a supporting 1-2 sentence example or clarification of that message (something like short summary of the article)

2. Draw the ROC curves and calculate the ROC AUC for 4 classifiers based on the following data Attach:roc_data.zip. The data.class is the true class, and the roc1.txt etc are the orders in which different classifiers would classify examples as positive (so some are true positive, some false positive; after certain cutoff there remain false negatives and true negatives).

Hint: you can choose top k % of the ordered observations, "classify" them as positives and calculate the corresponding TPR and FPR. Repeat it for k from 0 to 100 with some step, and these pairs of (TPR, FPR) values will produce a ROC curve. k is known as cutoff point.
You can read a nice explanation about ROC curves from here.

3. Characterize the behavior of the 4 classifiers in task 2. Also, provide the "best" cutoff for each of the classifiers.

Hint: If cutoff k=0%, everything is simply predicted as negative class, so each truly positive label is (falsely!) classified as negative, i.e. TPR=0% and each truly negative label is also classified as negative, i.e FPR=0%. As we increase k, we recover more positive samples, i.e. TPR increases, but so does FPR (luckily, at a lower rate, usually!). In the extreme case when k= 100%, everything is predicted positive, so TPR=100%, but FPR will also be 100%.
Obviously, we would prefer a classifier with the highest possible TPR and lowest FPR. You can take a difference between them (TPR-FPR), called Youden's index, to find the "best" cutoff point

4. Use the data about housing (http://archive.ics.uci.edu/ml/datasets/Housing) and estimate by regression analysis the last column - report RMSE score.

5. Estimate every variable one by one using all other attributes in this data set - report RMSE scores for each. What are the most important predictors and what are the most correlated ones? Which variables are "easier" to predict than others? If so, then why?

6. (Bonus 2p) Continue the task from ROC examples. Assume there is different cost assigned for different types of mistakes. E.g. cost 20€ for missing a case (false negative) and 15€ for false classification (false positive). Or vice versa. Calculate for each of the 4 classifiers with ROC curve, what would be the optimal cutoff to minimize cost. Provide yourself examples of four such costs based on which you can say for each of the 4 classifiers that exactly that provides the best classification.

Andmekaeve 2015/16 kevad

HW09 (10.04) - Machine Learning II