Data Mining - Courses - Institute of Computer Science

HW 8 (due April 12th) Clustering ...

1. Use the following data:

Perform hierarchical clustering using Euclidean distance and single link clustering (minimum distance), simulating this on paper. Note: you do not necessarily need to calculate all distances since you can simulate it visually. Sketch manually a dendrogram explaining the clustering.

2. Simulate the complete linkage (maximal distance) version of the hierarchical clustering on above data. Sketch a dendrogram. Compare to previous clustering.

On both tasks use some approximation and judgement. Focus on principles, not the precision of floating point calculations... Likewise, since there are many equal distances, use your own consideration, which one to "choose first".

3. Use the same data, perform the simulation of the K-means clustering into 3 clusters. Use first three points (1,2,3) as the starting seeds.

4. Look at the image in the "ImageShuffle" tool DM2015 image with an image with reshuffled rows. Respective RGB values are in http://biit.cs.ut.ee/imgshuffle/data/DM2015/DM2015.jpg.txt

Sort the row id-s by the sum of R, G, and B channels and the total brightness. Submit respective row id-s to the above "ImageShuffle" tool to see the respective "poor reconstruction" of the original image.

5. Use some (any) implementation of K-means and cluster the above data. Fetch the row-id-s from the clusters in some order, and paste in above web tool - show the respective "emerging image". Vary the K - 10, 20, 40. Are these any better?

6. (bonus 2p) Use some software for hierarchical clustering of the above data and get the row-id's in that clustering order. You can use any software you find - R, python, WEKA, MeV, perl, EPCLUST, ...

Data Mining 2014/15 spring

HW 8 (due April 12th) Clustering ...