Andmekaeve - Kursused - Arvutiteaduse instituut

HW06 (20.03) - Association rules, 2x2 tables, interestingness...

1. Construct an FP-tree using the same data set as last week (use the support count threshold smin = 2). Explain all the steps of the tree construction and draw a resulting tree. Based on this tree answer the questions: how many transactions contain {E,F} and {C,H} ?

B C A F H
F E C H
E D B
A C H F 
E F A
D H B
E C F B D 
A H C E 
G A E
B H E

2. Evaluate various interestingness measures for association rules. Generate randomly a broad range of various 2x2 contingency tables (f11, f10, f01, f00) for N=10,000 items. Sample the space so that each cell individually, in pairs, or triples is larger than "others". In this way sample at least 10,000 different possible contingency tables. Calculate 5 various scores based on those data (feel free to select) and report 10 top 2x2 tables that are the "best" according to that measure. Use rows to represent the 4 numbers; and if useful, also the marginal sums and N.

3. Compare interestingness measures starting from various fixed examples of (f11, f10, f01, f00) and experimenting with each of the four values - by increasing or decreasing it, one at a time.

E.g. starting from (250, 250, 250, 250) or ( 200, 200 , 200 , 400 ), make X axis on f11 and varying f11 from 0 to 1000, and Y axis the respective interestingness measure. Likewise, for other fields f10, f01, f11 you can do the same - varying them around that fixed value adding ( -100,-99,...,-1,0,+1,+2,...+100 ) or from 0 to 1000, while keeping the other values intact (N will change, obviously). Try plotting one score on one scatterplot while varying each of the fields fii independently from others. Explore 4-5 various measures.

4. Install R packages arules and arulesViz

install.packages("arules")
install.packages("arulesViz")

Get the Titanic survival data from https://courses.cs.ut.ee/MTAT.03.183/2014_spring/uploads/Main/titanic.txt

Make sure to explore all these commands, vary parameters, read the manual ... Try to vary them to provide nice interpretable outputs. See also 6. and 7.

# Make a note where your data lies ... 
titanic <- read.table( "data/titanic.txt", sep = ',' , header = TRUE)

#observe the data
##first 6 observations
head(titanic)
#types of features
str(titanic)
#dimensionality of the data
dim(titanic)

#load package for frequent set mining
library(arules)

#help with apriori
?apriori

#run apriori algorithm with default settings
rules = apriori(titanic)

#inspection of the result
inspect(rules)

#now let us assume, we want to see only those rules that have rhs as survived:
rules = apriori(titanic,appearance = list(rhs=c("Survived=No", "Survived=Yes"),default="lhs"))
inspect(rules)

#let us relax the default settings for the rules we are looking for
rules = apriori(titanic,parameter = list(minlen=2, supp=0.05, conf=0.8),appearance = list(rhs=c("Survived=No", "Survived=Yes"),default="lhs"))

#visualization
library(arulesViz)
plot(rules, method="graph", control=list(type="items"))

5. Report clearly the most "interesting" rules discovered from Titanic data, and how you came up with those in R.

6. (Bonus, 2p) Continue exploring various interestingness measures - ho to describe them the best, using perhaps the scatterplots measuring the effect of each field in the 2x2 tables. (e.g. how would symmetry look like, or other properties).

Andmekaeve 2015/16 kevad

HW06 (20.03) - Association rules, 2x2 tables, interestingness...