HW 04 (11.10) Clustering
1. Perform a "Single Link" clustering of 2-D data from slide 28. Use Euclidean distance as a distance measure. Draw a dendrogram/tree with node height at the distance at where the clusters were merged. Hint: Draw the points first on 2D and then perform manual simulation. (Solutions on paper are ok :-)
X Y A 2 4 B 7 3 C 3 5 D 5 3 E 7 4 F 6 8 G 6 5 H 8 4 I 2 5 J 3 7
2. Problem with the association analysis was that often they produced far too many association rules. Propose a distance measure to compare association rules. Envision an hierarchical clustering procedure. How would you present such clustering result to end-users? (e.g. make one "Powerpoint slide" with such a sketchup) Discuss good and bad sides of your solution.
3. Compare the UPGMA and WPGMA hierarchical clustering methods. In which situations would you recommend to use one over the other?
4. Listen to the presentation by Tamara Munzner: Keynote on Visualization Principles - http://vizbi.org/Videos/26205288 (use the PDF slide-deck from there as well http://bit.ly/nCJM5U ). Which aspects were most interesting or striking to you?
5. Revisit 2 after lessons from Tamara's presentation. Improve your visualisation on "Powerpoint". Try to make a better version using some ideas from Tamara's presentation to "add spice".
6. Bonus (2p) Read about document similarity measures: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Summarise the chapter contents and main structure on 1 page.
7. Bonus (2p) Use MeV or R or any other tool offering hierarchical clustering and cluster hierarchically some data of interest to you. If nothing else, then use the generated 2x2 contingency table from last week. In that case add also the column/columns for some "interestingness" scores to those data.