HW 10 (22.11) Machine Learning I

1. This is a small example data about who has developed sunburn or not. Attributes and their value ranges have been described first.

@RELATION 'sunburn'
@ATTRIBUTE 'hair'   {blonde, brown, red}
@ATTRIBUTE 'height' {short, average, tall}
@ATTRIBUTE 'weight'   {light, average, heavy}
@ATTRIBUTE 'lotion' {yes,no}
@ATTRIBUTE 'burned' {burned, none}

@DATA
blonde,	average,light,	no, burned
blonde,	tall,	average,yes, none
brown,	short,	average,yes, none
blonde,	short,	average,no, burned
red,	average,heavy,	no, burned
brown,	tall,	heavy,	no, none
brown,	average,heavy,	no, none
blonde,	short,	light,	yes, none

Make a decision tree based on ID3 algorithm (manual simulation). Most importantly, calculate the information gain for the root node if you take haircolor, heght, or weigt of a person, or whether he/she uses sun lotion to prevent sunburn, into account.

2. Install WEKA from http://www.cs.waikato.ac.nz/~ml/weka or run it on some machine on your own. Describe the main functionality of the software. Run some tests.

3. Read in to WEKA the Titanic survivals data set Attach:titanic.arff.txt (Weka reads in .arff format)

Run the decision tree algorithm (called J48)
Read and interpret the learned tree (e.g. re-draw it by hand)
Characterise the TP, FP, TN, FN rates, precision, recall, ... on this data.

How much can be learned on this data?

4. Netflix was running a $1M challenge for the best possible machine learning algorithm. The test set was used to measure the goodness of the current best method (and call the competition to an end when the first team would beat the state of the art method by more than 10%. I.e. everyone could evaluate their best algorithm against this test data and get their current standing in the rankings. But the final evaluation happened on the third data set that was completely hidden from any contestants until the competition had ended. Why was that? Explain the reasons for this third data set for evaluations.

5. Think of ONE business or science related problem where you would like to use directly some machine learning method. Explain the need and the value. Estimate the business value or profit for the good classifier and the misclassification cost for type I and type II errors?

6. Bonus(2p) Read the article by Domingos: A few useful things to know about machine learning. Communications of the ACM, Vol. 55 No. 10, Pages 78-87 doi: 10.1145/2347736.2347755 (via ACM Digital library, Attach:domingos.pdf ) Make a list of key messages with a supporting 1-2 sentence example or clarification of that message. (a kind of condensed summary of the article)

Data Mining 2012/13 fall

HW 10 (22.11) Machine Learning I