Homework 4 (HW04) - Frequent pattern mining
We have added a comment to EX2 part (2e) (shown in italics) to help you.
We have clarified the text of EX4 part (4b) by replacing support by support count
In 4c and 4d we have now emphasised that you should sort all rules, it might have been a bit ambiguous earlier
Exercise 1 (EX1) (1 point)
The goal in this exercise is to study the frequencies of itemsets in a dataset with transactions covering 8 items. Please first run the following commands to build the dataset:
# required only once, please comment this out after the first time install.packages("arules") library(arules) data = list( c('B','D','F','H'), c('C','D','F','G'), c('A','D','F','G'), c('A','B','C','D','H'), c('A','C','F','G'), c('D','H'), c('A','B','E','F'), c('A','D','F','G','H'), c('A','C','D','F','G'), c('D','F','G','H'), c('A','C','D','E'), c('B','E','F','H'), c('D','F','G'), c('C','F','G','H'), c('A','C','D','F','H') ) data = as(data, "transactions")
Next, use the command inspect(data)
to take a look at the dataset, but please do not include the result into the report PDF.
By performing all counting manually (without programming), please answer the following questions:
- (1a) Calculate the support and support count of patterns {D}, {D,F} and {D,F,G}
- (1b) Report the row indices (identifiers) of transactions which include the pattern {D,F,G}
- (1c) Explain what anti-monotonicity of support means, in the example of these patterns {D}, {D,F} and {D,F,G}
Now let us check the results using the following code which finds all itemset with support count at least 5:
library(dplyr) find_freq_itemsets = function(data, min_support_count) { # convert support count into support min_support = min_support_count / length(data) # find itemsets with support>=min_support itemsets = eclat(data, parameter=list(support=min_support)) # convert to data.frame, it is easier to manipulate itemsets = as(itemsets, "data.frame") # items are factors, convert them to strings itemsets$items = as.character(itemsets$items) # sort by length of string, and among equal-length strings sort alphabetically itemsets = itemsets %>% arrange(nchar(items), items) } itemsets = find_freq_itemsets(data,5)
Next, use the command print(itemsets)
to take a look at the result, but please do not report it into the PDF.
Please, answer the following questions:
- (1d) How many itemsets could be generated in total from 8 items?
- (1e) What percentage of these itemsets have positive support (occur at least once in the data)? Use
find_freq_itemsets(data,1)
to find it out.
Let us now consider the task of finding all frequent patterns of size 3 with minimum support count 5.
- (1f) Naive method would have to look through all possible subsets of size 3. How many subsets of size 3 out of 8 items are there altogether? Calculate this with function
choose(8,3)
. - (1g) Apriori builds candidate 3-sets out from frequent 2-sets. Look at the output of
find_freq_itemsets(data,5)
and manually (without programming) find and report all 3-sets that can be obtained as a union of two frequent 2-sets (please do not discard any of the resulting sets yet, this will be done in next steps). - (1h) Study the 3-sets reported in (1g) and discard all the 3-sets for which some subset of size 2 is not frequent. Report the remaining candidate 3-sets.
- (1i) Instead of counting the frequencies of all candidate 3-sets just report all the frequent 3-sets from the output of
find_freq_itemsets(data,5)
.
Exercise 2 (EX2) (1 point)
The goal in this exercise is to study the association rules in the same dataset as built in EX1.
Look at the output of find_freq_itemsets(data,5)
and manually (without programming) answer the following questions:
- (2a) Create and report all possible association rules where the union of the antecedent (left-hand-side) and the consequent (right-hand-side) is equal to the set {D,F,G}.
- (2b) Organise the rules from (2a) into a lattice (please see the lecture slides about this). No need to make a visualisation, just list the rules in each layer separately.
- (2c) Calculate the support, confidence and lift of all the rules from (2a), report by layers as in (2b).
- (2d) Find and report all rules from (2a) that have confidence at least 0.5.
Now try out the command apriori
as follows:
- (2e) Apply the command
rules = apriori(data,parameter=list(support=5/length(data),conf=0.5))
. See the results usinginspect(rules)
(do not print into report PDF) and report the rules with 3 items. Note that the apriori command considers only rules with 1 item in the consequent. Is this result in agreement with what you obtained in (2d)?
Exercise 3 (EX3) (1 point)
Please download the Titanic survival data titanic.csv from here in CSV format. Read in the file and explore the data using the following code:
titanic = read.csv('titanic.csv') head(titanic) str(titanic) table(titanic$Class) table(titanic$Sex) table(titanic$Age) table(titanic$Survived)
Next load the arules library with library(arules)
and study the output of apriori(titanic)
.
Please answer the following questions:
- (3a) How many rules did the apriori algorithm find? (note that apriori was using the default parameters min support 0.1 and min confidence 0.8)
- (3b) Consider all rules with confidence equal to 1.0. Which of these is the most interesting? One rule among them can explain all others, which one? (since this particular rule has confidence 1.0, all other rules considered here have also confidence 1.0). How would you explain these rules?
- (3c) Consider the two rules with the highest lift value (find them manually or use code from EX4 to sort the rules by lift). These two rules have the same lift, the same support, and the same confidence. Why? Hint: the reason is related to what you discovered in (3b).
- (3d) What is the most interesting rule in these results, other than the ones discussed in (3b) and (3c)?
Exercise 4 (EX4) (1 point)
Consider the same Titanic dataset as in EX3. Please run the apriori algorithm again, but this time with very low min support and min confidence, and sort by lift:
rules = apriori(titanic,parameter=list(supp=0.000001,conf=0.000001)) rules = as(rules,"data.frame") rules = rules %>% arrange(-lift) head(rules, n = 10) # remember you can show as many rows as you want by changing n
- (4a) Discuss what you can learn from the 3 rules with the highest lift.
- (4b) Calculate the support count of the antecedent (left-hand-side) in the rules of (4a) by dividing the count (last column) by confidence (3rd column). Which of these rules do you find the most interesting?
- (4c) Sort all rules by confidence. What can you learn from the 9 rules with confidence 1.0 and lift greater than 3?
- (4d) Sort all rules by support. What can you learn from the 4 rules with support greater than 0.7?