HW 4 (due March 15th) Association rule interestingness, elements of descriptive analysis.
1. Use the similar approach generating 10,000 random 2x2 contingency tables from the last week. Make sure to have a broad range of various types of tables representing the possible "universe" of 2x2 tables for total database of 100,000 elements. Generate only such tables where the smallest value in any cell is 100 (the baseline). Apply also the condition that any two tables must differ by more than 100 values from each other. If one table is (30,000 20,000 35,000 15,000) then (30,002 19,998 35,005 14,995) differs only by 2+2+5+5=14. To validate that this restriction is fulfilled one can be clever in generation (to or test continuously against the previously generated values (slower). What are now the top-5 min and max values for different goodness measures (5-6 different measures)?
2. Calculate the goodness measure using up to 5 different symmetric measures. Plot the two measures on the x-y dot plot. Identify visually the cases where two measures differ significantly for one measure but not for other. Describe such "extreme" cases and based on those how the two measures differ from each other.
3. The same as above, but for unsymmetric measures. Output the best scoring tables, but also their symmetric counterpart. What happens to the score when you look at the "mirror" version (A->B vs B->A)?
4. Based on experiments of task 3, select one of the unsymmetric goodness measures. Plot on the X-Y plot the unsymmetric scores comparing the original (X) and the "mirror" (Y) of the contingency table. Characterise some of the extreme cases based on the plot.
5. Plot the distribution of two fitness measures over all 10,000 contingency tables - one from Task 2 the other from Task 3. You can use histograms, density plots, or simply order the values and print on X-Y plot.
6. (Bonus, 2p) Apply the new knowledge from this week's tasks to the data set of diagnoses and drugs from last week. Which of the methods would be "best" for that data? Can you identify more rules that you could not during the last week?