Andmekaeve - Kursused - Arvutiteaduse instituut

HW4. Visualisation, FIM and Association rules (12.03)

1. Watch the video presentation by Tamara Munzner: Keynote on Visualization Principles:

and slides - http://www.cs.ubc.ca/~tmm/talks/vizbi11/vizbi11.pdf. Summarize the key take-home messages from her presentation.

2. Fetch the UK child height/weight measurements data NHS web and Data Download. (a local copy too, just in case).

Study a sufficiently large random subset of measurements. Pay attention and comment on how did you make a random subset? What is it representative of? Be ready to use external scripts (subsetting data with some tools/scripts before reading it in R/Python) rather than reading all data in at once.

Make two nice, interesting and informative plots for example by using scatterplots - comparing age, gender, height, weight, BMI, school deprivation data (choose some). Incorporate categorical features (or actually you can do it even with continuous ones) by playing with color of the points.

Hint: You may want to develop ideas first with smaller subsets of data. Goal is to find some interesting visualisations, and provide brief interpretation! You may also try to find out, how large data is still ok to plot? Also think about how you incorporate the plot to the report, make sure, that the plot is not too large (sometimes all points are drawn separately and the plot becomes so large (in layers) so it is impossible to load).

3. Here is a small set of transactions over 8 possible items. Identify all frequent itemsets that have support of 5 or more. For that follow and describe apriori principle. You have to be quite clever in keeping track of all possible frequent itemsets. Feel free to create some small scripts or helpers for tabulations - on paper you also need to organise data well.

B D F H
C D F G
A D F G
A B C D H
A C F G
D H
A B E F
A D F G H
A C D F G
D F G H
A C D E
B E F H
D F G
C F G H
A C D F H

4. Pick some largest frequent itemsets from above and generate association rules from them. Which have the highest confidence and lift values? (Interpret and comment the results).

5. Get the Titanic survival data from https://courses.cs.ut.ee/MTAT.03.183/2014_spring/uploads/Main/titanic.txt

Find interesting associations from that data. Report clearly the most "interesting" rules discovered from Titanic data, and how you came up with those in R or Python or some command line packages for association rules (see Wikipedia). If you selected the most interesting rules by some measure (support, confidence, lift), interpret the meaning of the measure on those rules.

Bonuses: Select one or another, not both!

6. (Bonus 2p) Perform more analysis of height+weight data. Make a compelling story - comparing deprived regions against better off regions; or any other criterion that you can observe from data. Idea - show some very clear differences for children depending on their background. For example use the distance to nearest school; school deprivation information (you would need to see what these mean).

You can derive some new combined features to plot: e.g. height*weight ; height/weight; height*height*weight; Body Surface Area, etc ; Plot them against each other; and against BMI, height, or weight. Try to identify interesting meaningful trends or examples, provide some interpretation. Goal: try to separate girls and boys by their provided age, height, weight data. Is it easier in younger (Reception class) or higher ages? Note: data is large enough to also plot data for same age only. Fixing one parameter may help seeing other variables better.

7. (Bonus 2p) Calculate approximate growth curves for top-1%, top-10%, 50% (median), low-10% , low-1% children (for height, weight, height*weight). Compare girls-boys. Perhaps in between different regions or deprivation scores. (note - efficiency is not primary in here - be clever by using some general programming skills)

Note - discussion and sharing info on the meaning of attributes through mailing list or Piazza is ok.

Titanic helper with R:

# Install R packages arules and arulesViz 
install.packages("arules")
install.packages("arulesViz")

# Make a note where your data lies ... 
titanic <- read.table( "data/titanic.txt", sep = ',' , header = TRUE)

#observe the data
##first 6 observations
head(titanic)
#types of features
str(titanic)
#dimensionality of the data
dim(titanic)

#load package for frequent set mining
library(arules)

#help with apriori
?apriori

#run apriori algorithm with default settings
rules = apriori(titanic)

#inspection of the result
inspect(rules)

#now let us assume, we want to see only those rules that have rhs as survived:
rules = apriori(titanic,appearance = list(rhs=c("Survived=No", "Survived=Yes"),default="lhs"))
inspect(rules)

#let us relax the default settings for the rules we are looking for
rules = apriori(titanic,parameter = list(minlen=2, supp=0.05, conf=0.8),appearance = list(rhs=c("Survived=No", "Survived=Yes"),default="lhs"))

#visualization
library(arulesViz)
plot(rules, method="graph", control=list(type="items"))

Andmekaeve 2016/17 kevad

HW4. Visualisation, FIM and Association rules (12.03)