HW 3 (due March 8th) Association rule interestingness
1. Look at the data about patients disease diagnoses and assigned medical treatments (drugs) - Attach:disease_treatment.txt. Identify rules that show which treatments are typically used for which diseases. Look at rules with disease as one of the items on the left hand size of the rules. Output 10 interesting rules, each about different disease.
2. Now look at rules where on the left hand size there are only medications - at what confidence could you "predict" the diagnosis?
3. Try out some alternative "goodness measures" in the above tasks. Can you identify "more interesting" patterns in data?
4. Describe 6 other scoring functions - the symmetric and non-symmetric ones. Identify the properties that they should possess.
5. Generate randomly 10,000 2x2 contingency tables representing various association rules for a hypothetic database of 100,000 transactions. Make sure to "implant" rules of various frequencies and confidence levels. Try to cover the entire 4-dimensional space - creating all kinds of value combinations (small, mid, high values) for each of the 2x2 cells. E.g once you have 4 values that add up to 100,000, you can make 24 different "tables" assigning them in any of the 24 permutation orders. Make sure all tables are unique. Output 5 tables with highest support, highest confidence and highest support+confidence. Make sure to be able to complete this task - see the next week's assignment.
6. (bonus 2p) Compare the speed of apriori and FP-tree on some larger data sets. Can you identify cases when some of the tools is superior for some large data?
7. (bonus 2p) You are welcome to test some other association rule mining tools than R arules. For example, use some standalone tool or algorithms from other packages (e.g. Weka). Document what and how you used. It's sufficient to do this for at least one of the tasks 1.-3. or 6. State also the "speed" of these tools.