HW5. FIM and Association rules II (19.03)
1. Use the example data and simulate FP-Tree growth algorithm. Report all itemsets with support count 3 or more. Report most frequent 3-element itemsets and some "most interesting" rules.
C D E A D B C E B C D E A C D C D C E B C E A C B C E
2. Make a vertical layout for the same above data for manual inspection. Show how the vertical layout can be used to calculate the 2x2 contingency table for rules like (B C -> E; E C -> B; C -> A). Calculate for these rules the lift, odds ratio and correlation measures.
3. Generate 10,000 random examples of different contingency tables such that in the 2x2 contingency table values are at least 50 in each of the four cells and that all four values add up to 10,000 (i.e. are example 2x2 tables for a database of size 10,000 transactions). Generate the values in the way that only one value of 4 is "small" (between 50-99), any combination of two are "small", any combination of 3 are "small" and all four are "large". Try to cover the range of all possible 2x2 values quite evenly. Describe your idea and procedure clearly. And visualise the generated data. Note that fourth value is dependent (10,000 - sum_of_three), so you need to visualise only three values and leave for example the f11 unvisualised.
4. Based on this synthetic data identify three best examples using lift, odds ratio and correlation measures. Explain them - what would the corresponding rules mean in the sense of supermarket shopping basket analysis (provide some scenario and example based on that).
5. Use the "Titanic" data set and find rules with highest odds ratio. Explain tyhe odds ratio, the found rule and make a comparison to lift. Are there examples where lift and odds ratio give different "best" rules?
6. (Bonus 1p) Describe the measure Mutual Information. Report the rules with highest mutual information from synthetic data (3) and the Titanic data sets. Which association rules would have highest mutual information?