HW 2 (due March 1st) Frequent itemsets, association rules, Apriori, FP-Tree
For exercises 1, 2, 3 and 4, it is not necessary to load pharmacy data into R, you can do them manually with pen and paper.
Let's have an example of drug purchases from a pharmacy -
CustomerID TransactionID BasketContent 1 1234 {Aspirin, Panadol} 1 4234 {Aspirin, Sudafed} 2 9373 {Tylenol, Cepacol} 2 9843 {Aspirin, Vitamin C, Sudafed} 3 2941 {Tylenol, Cepacol} 3 2753 {Aspirin, Cepacol} 4 9643 {Aspirin, Vitamin C} 4 9691 {Aspirin, Ibuprofen, Panadol} 5 5313 {Panadol, Vitamin C} 5 1003 {Tylenol, Cepacol, Ibuprofen} 6 5636 {Tylenol, Panadol, Cepacol} 6 3478 {Panadol, Sudafed, Ibuprofen}
1. a. For the data from above Table compute the support and support count for itemsets {Aspirin}, {Tylenol, Cepacol}, {Aspirin, Ibuprofen, Panadol}.
b. Compute the confidence for the following association rules: {Aspirin, Vitamin C} → {Sudafed}, {Aspirin} → {Vitamin C}, {Vitamin C} → {Aspirin}. Why the results for last two rules are different?
c. List all the frequent itemsets if the support count threshold smin = 3.
d. What does the anti-monotonicity property of support imply? Give an example using the above data set.
2. Apply Apriori algorithm on the drug data set example 1 with support count threshold smin = 4. Show the candidate and frequent itemsets for each iteration. Enumerate all the final frequent itemsets. Also indicate the association rules that could be generated from these itemsets and highlight the strongest ones.
3. Construct an FP-tree using the same data set 1 (use the same support count threshold smin = 4) . Explain all the steps of the tree construction and draw a resulting tree. Based on this tree answer the questions: how many transactions contain {Aspirin} and {Cepacol}? How many transactions were made in total?
4. Simulate frequent pattern enumeration based on the FP-tree constructed in the previous exercise. Report all the frequent patterns.
5. Install R packages arules and arulesViz
install.packages("arules") install.packages("arulesViz")
Get the Titanic survival data from https://courses.cs.ut.ee/MTAT.03.183/2014_spring/uploads/Main/titanic.txt
Make sure to explore all these commands, vary parameters, read the manual ... Try to vary them to provide nice interpretable outputs. See also 6. and 7.
# Make a note where your data lies ... titanic <- read.table( "data/titanic.txt", sep = ',' , header = TRUE) #observe the data ##first 6 observations head(titanic) #types of features str(titanic) #dimensionality of the data dim(titanic) #load package for frequent set mining library(arules) #help with apriori ?apriori #run apriori algorithm with default settings rules = apriori(titanic) #inspection of the result inspect(rules) #now let us assume, we want to see only those rules that have rhs as survived: rules = apriori(titanic,appearance = list(rhs=c("Survived=No", "Survived=Yes"),default="lhs")) inspect(rules) #let us relax the default settings for the rules we are looking for rules = apriori(titanic,parameter = list(minlen=2, supp=0.05, conf=0.8),appearance = list(rhs=c("Survived=No", "Survived=Yes"),default="lhs")) #visualization library(arulesViz) plot(rules, method="graph", control=list(type="items"))
6. (1p) Report clearly the most "interesting" rules discovered from Titanic data, and how you came up with those in R.
7. (1p) Some rules do not provide extra knowledge as other rules already contain the information. For example, if there is the rule {Class=’2nd’, Age=’Child’}→{’Survived’=’Yes’}, then the rule {Class=’2nd’, Age=’Child’,Sex=’Female’}→{’Survived’=’Yes’} is not so informative. Such rules are called ’redundant’. Come up with the definition of the redundancy of the rules. Using the script and the data from task 4 tune default parameters so that there are redundant rules in the output. Next, add the ”filter” that outputs only non-redundant rules.