HW8. Machine Learning II (16.04)
Use the following data in next three tasks and predict the "play" column.
Outlook Temp Humidity Windy play Sunny Hot High FALSE No Sunny Hot High TRUE No Overcast Hot High FALSE Yes Rainy Mild High FALSE Yes Rainy Cool Normal FALSE Yes Rainy Cool Normal TRUE No Overcast Cool Normal TRUE Yes Sunny Mild High FALSE No Sunny Cool Normal FALSE Yes Rainy Mild Normal FALSE Yes Sunny Mild Normal TRUE Yes Overcast Mild High TRUE Yes Overcast Hot Normal FALSE Yes Rainy Mild High TRUE No
And classify this example by manual simulation of three following algortihms:
Sunny Cool High TRUE ???
1. Build a decision tree simulating ID3 with Information Gain. Classify the example.
2. Build a Naïve Bayes classifier. Classify the example.
3. Classify the example using k-nearest neighbour method. Use 1-nn, 3-nn, and 5-nn method. For distance measure you can for example count how many variables differ (Manhattan distance); possibly making ordinal values like hot-cool differ by "distance" (e.g. hot-cool=2, hot-mild=1).
4. Use the data set "Bank Marketing". Apply decision tree and Naïve Bayes classifier on this data. Perform 5-fold cross validation (withhold 20% of data each time) and report the goodness of the built classifiers.
5. Apply the Random Forest method on the same data set. Report the quality similar as in 4. Draw an ROC curve for RF classifier. What is the basis of scoring and ordering the predictions in order to achieve the ROC curve?
6. (Bonus 1p) Extract the most relevant features from the above Bank Marketing data based on inspection of random forest classifier, decision tree, and Naïve Bayes. Justify and interpret these features.