Arvutiteaduse instituut
  1. Kursused
  2. 2015/16 kevad
  3. Andmekaeve (MTAT.03.183)
EN
Logi sisse

Andmekaeve 2015/16 kevad

  • Course Home
  • Lectures
  • Projects
  • Homeworks
    • Submit
  • Software
  • Links

...

HW08 (03.04) - Machine Learning start ...

1. Read - http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ What is the quality of the classifier? Can you understand when it works well and when not?

2. Use this small data example and build a decision tree (manually, explaining all steps/choices).

 OutlookTempHumidityWindyPlay
1SunnyHotHighFALSENo
2SunnyHotHighTRUENo
3OvercastHotHighFALSEYes
4RainyMildHighFALSEYes
5RainyCoolNormalFALSEYes
6RainyCoolNormalTRUENo
7OvercastCoolNormalTRUEYes
8SunnyMildHighFALSENo
9SunnyCoolNormalFALSEYes
10RainyMildNormalFALSEYes
11SunnyMildNormalTRUEYes
12OvercastMildHighTRUEYes
13OvercastHotNormalFALSEYes
14RainyMildHighTRUENo
15OvercastCoolHighFALSENo

Providing that there is overcast, mild, high humidity and high wind weather - should one play tennis or not?

3. Use the Cars data set and apply decision trees for classification. Describe the tree. (you can use R, or Weka (install Weka from here), or python... ). Compare the decision tree approach to the association rules derived from the same data.

  • To make your life easier, we recommend you remove observations with two infrequent classes - good and v-good. You can get the resulting dataset here
  • in R, you can use library rpart to build the trees and rpart.plot to visualize them, and e1071 to tune the model parameters

4. Use the same cars data set. Apply decision trees and Naive Bayes classifiers on the same data. Can you confirm that one method is better than the other in some way? Perform 10-fold cross-validation. Report average performance over the folds as confusion matrices and accuracy, precision, recall.

  • in R, to obtain the confusion matrix, you can use the following:
model = rpart(class ~ ., ...)
pred_bin = predict(model, test, type="class")
table(predicted = pred_bin, actual = test$class)
  • For more advanced evaluation (optional), you can try:
library(ROCR)
pred_probs = predict(model, test, type="prob")
pred = prediction(pred_probs[,2], test$class)
accuracy = performance(pred, measure = "acc") # get accuracy 
plot(accuracy) # how accuracy depends on the probability cutoff to separate class1 from class2
accuracy@y.values[[1]][max(which(accuracy@x.values[[1]] >= 0.5))] # accuracy when cutoff=0.5

5. Use the Titanic data set - compare your classifiers learned from Titanic data - decision trees, Bayes rules, association rules - and try to characterise the rules observed in data using these approaches. How can they be interpreted against each other?

  • For association rules, you can split the dataset once, and from the "training set" derive top k rules wherein rhs contains the target class, and apply them on the "test set" to predict the class

6. (Bonus 1p) How to detect and avoid overfitting? What is the good (optimal?) size of the decision tree classifiers? Use the above Cars data, and for comparison use one of the two data sets - Mushroom or Connect 4. As an example, consult Figure 1.5 from Christopher Bishop: "Pattern Recognition and Machine Learning"

  • Arvutiteaduse instituut
  • Loodus- ja täppisteaduste valdkond
  • Tartu Ülikool
Tehniliste probleemide või küsimuste korral kirjuta:

Kursuse sisu ja korralduslike küsimustega pöörduge kursuse korraldajate poole.
Õppematerjalide varalised autoriõigused kuuluvad Tartu Ülikoolile. Õppematerjalide kasutamine on lubatud autoriõiguse seaduses ettenähtud teose vaba kasutamise eesmärkidel ja tingimustel. Õppematerjalide kasutamisel on kasutaja kohustatud viitama õppematerjalide autorile.
Õppematerjalide kasutamine muudel eesmärkidel on lubatud ainult Tartu Ülikooli eelneval kirjalikul nõusolekul.
Courses’i keskkonna kasutustingimused