HW 03 - 27.09.2012

Please remind (or learn) a bit of probability and conditional probability. Example materials:

Please find enclosed a data set with 1000 cases of 5 throws of dice (values 1..6). File: 02_dice.txt

Your task is to study the dice T1..T5. Are they "loaded" ? Are there any "dependencies" in between them?

1. Calculate the probability distributions of different values for each die. (count the frequency of each outcome)

2. Calculate conditional probabilities of last two dice given outcomes of T1, T2, T3. I.e. the frequencies of each outcome given outcomes of other dice (T1,T2).

P(T4|T1), P(T5|T1), P(T1|T4), P(T1|T5)
P( T4 | T1,T2 ), P( T5 | T1,T2 )

3. Simulate FP Growth given a data set of following transactions (order items by frequency). Calculate (manually) all frequent itemsets with support at least 2.

 1: {a, d, e} 
 2: {b, c, d} 
 3: {a, c, e} 
 4: {a, c, d, e} 
 5: {a, e} 
 6: {a, c, d} 
 7: {b, c} 
 8: {a, c, d, e} 
 9: {b, c, e} 
 10: {a, d, e}

4. Generate 1000 "random" 2x2 contingency tables for 1000 elements (distributed into f11, f10, f01, f00). Try to make randomness so that the cells are not too evenly distributed but are also likely to contain some more extreme values. Calculate the Piatetsky-Shapiro, Correlation and J-measure values. Identify best 2x2 tables according to your data.

5. Plot the above three measures values against each other (3 comparisons) and try to characterise verbally how and why the measures are different from each other.

6. (Bonus 1p) Eliminate from the above 1000 tables those with support less than 1%, 5%, 10%, 20% , 50% - how the comparisons of measures as done in task 5 changes?

7. (Bonus 1p) Listen to the presentation by Peter Donnely. http://www.ted.com/talks/peter_donnelly_shows_how_stats_fool_juries.html "Extract" from there in a formal way the examples of statistical argumentation (coin tosses, HIV, and cot death).

Data Mining 2012/13 fall

HW 03 - 27.09.2012