Institute of Computer Science
  1. Courses
  2. 2012/13 fall
  3. Data Mining (MTAT.03.183)
ET
Log in

Data Mining 2012/13 fall

Edit page
Past edits Uploaded files

DM - 2012

  • Main
  • Lectures
  • Projects
  • Links
  • Homework
    • Homework upload
    • admin
  • Feedback
Edit sidebar

HW 10 (22.11) Machine Learning I

1. This is a small example data about who has developed sunburn or not. Attributes and their value ranges have been described first.

@RELATION 'sunburn'
@ATTRIBUTE 'hair'   {blonde, brown, red}
@ATTRIBUTE 'height' {short, average, tall}
@ATTRIBUTE 'weight'   {light, average, heavy}
@ATTRIBUTE 'lotion' {yes,no}
@ATTRIBUTE 'burned' {burned, none}

@DATA
blonde,	average,light,	no, burned
blonde,	tall,	average,yes, none
brown,	short,	average,yes, none
blonde,	short,	average,no, burned
red,	average,heavy,	no, burned
brown,	tall,	heavy,	no, none
brown,	average,heavy,	no, none
blonde,	short,	light,	yes, none

Make a decision tree based on ID3 algorithm (manual simulation). Most importantly, calculate the information gain for the root node if you take haircolor, heght, or weigt of a person, or whether he/she uses sun lotion to prevent sunburn, into account.

2. Install WEKA from http://www.cs.waikato.ac.nz/~ml/weka or run it on some machine on your own. Describe the main functionality of the software. Run some tests.

3. Read in to WEKA the Titanic survivals data set Attach:titanic.arff.txt (Weka reads in .arff format)

  • Run the decision tree algorithm (called J48)
  • Read and interpret the learned tree (e.g. re-draw it by hand)
  • Characterise the TP, FP, TN, FN rates, precision, recall, ... on this data.

How much can be learned on this data?

4. Netflix was running a $1M challenge for the best possible machine learning algorithm. The test set was used to measure the goodness of the current best method (and call the competition to an end when the first team would beat the state of the art method by more than 10%. I.e. everyone could evaluate their best algorithm against this test data and get their current standing in the rankings. But the final evaluation happened on the third data set that was completely hidden from any contestants until the competition had ended. Why was that? Explain the reasons for this third data set for evaluations.

5. Think of ONE business or science related problem where you would like to use directly some machine learning method. Explain the need and the value. Estimate the business value or profit for the good classifier and the misclassification cost for type I and type II errors?

6. Bonus(2p) Read the article by Domingos: A few useful things to know about machine learning. Communications of the ACM, Vol. 55 No. 10, Pages 78-87 doi: 10.1145/2347736.2347755 (via ACM Digital library, Attach:domingos.pdf ) Make a list of key messages with a supporting 1-2 sentence example or clarification of that message. (a kind of condensed summary of the article)

  • Institute of Computer Science
  • Faculty of Science and Technology
  • University of Tartu
In case of technical problems or questions write to:

Contact the course organizers with the organizational and course content questions.
The proprietary copyrights of educational materials belong to the University of Tartu. The use of educational materials is permitted for the purposes and under the conditions provided for in the copyright law for the free use of a work. When using educational materials, the user is obligated to give credit to the author of the educational materials.
The use of educational materials for other purposes is allowed only with the prior written consent of the University of Tartu.
Terms of use for the Courses environment