Andmekaeve - Kursused - Arvutiteaduse instituut

HW09 (due Nov 27th) Clustering

Exercise 1 (EX1) (1 point)

Consider the following toy dataset consisting of 9 instances with 2 numeric features x and y:

ID	x	y
P1	0	1
P2	1	1
P3	4	0
P4	2	3
P5	0	5
P6	1	6
P7	3	6
P8	7	4
P9	6	8

(1a) Perform agglomerative hierarchical clustering of the above toy data. Calculate distances between instances using Euclidean distance in the 2-dimensional space determined by features x and y. For calculating distances between clusters use single linkage. However, do not write a program to do all this, but instead do it manually. This means that you should make a plot of the data and determine the shortest distances visually, instead of using the computer. At the moment of doubt it is fine to consult the computer regarding the precise values of distances. Each time when you merge two clusters (even if they consist of only a single instance) please report the following:

List of instance IDs in one cluster
List of instance IDs in the other cluster
Distance between these clusters

Sometimes there can be alternatives with the same distances, in such case just decide randomly which of the alternatives to choose. Draw a dendrogram of the clustering. It is fine to solve this task fully with pen and paper and include a photo or scan of this in the submitted report. Hint: a paper with a grid would make your life easier.

(1b) Repeat the same as in (1a) but now calculate distances between clusters with complete linkage instead of single linkage.

Exercise 2 (EX2) (1 point)

Explore the visualization of the K-means algorithm at https://www.naftaliharris.com/blog/visualizing-k-means-clustering/. At this website run K-means on the Uniform dataset initializing with K=4 randomly chosen cluster centers. Note that you must click on the buttons ‘Update Centroids’ and ‘Reassign Points’ multiple times, until convergence, which means until the there are no changes in the contents of clusters. Next, try to cluster the Smiley Face dataset on the same website by choosing initial cluster centers manually. Is it possible to choose the centers so that there would be one cluster which only contains the ‘left eye’ of the smiley? Try different values of K and different ways to initialize the cluster centers. You do not have to report anything from this website, we will just discuss the results in the practice session.

Next, perform K-means on the toy data from the previous exercise EX1. Do it manually, as you did in EX1. Use Euclidean distance measure and K=4, initializing the centers at instances with IDs P1, P2, P3 and P4. While following the algorithm the cluster centers can move to points that do not coincide with any of the original data points. Please use letters A, B, C, ... as identifiers for these new points, in order to simplify reporting of the results of the algorithm. After each step of assigning instances to clusters please report for each cluster:

The ID of the cluster center
The list of instance IDs that have been assigned to this cluster

After each step of calculating a new center please report for each cluster:

The ID of the old center of the cluster
The ID of the new center of the cluster

Exercise 3 (EX3) (2 points)

In this task you have to again deal with MNIST dataset, this time you are given a very small training set (50 images) and much bigger test set (1950 images). Here is a ZIP file with .RDS files (use readRDS to read them).

Use Kmeans function (set method to be "correlation" and iter.max at least 100) from amap package to run the K-means to cluster all 1950 + 50 = 2000 images (choose k = 10, as for number of expected classes).
Look up each of the clusters and assign to each test image a class that corresponds to the most frequent label from the training labels inside this cluster. This way predict labels for all test images.
Compare predicted labels with true test labels and compute the accuracy of this approach (store it somewhere).
Now use function prcomp (or any other implementation of PCA) and perform the PCA on this data.
Repeat steps 1 - 3, but now instead of starting from the dataset with its original features start from the dataset with the first 3 PCA features (PC1, PC2, PC3). Repeat steps 1 - 3 again and again with different numbers of PCs: first 4 PCs, first 5 PCs, etc. until 10, then try 20, 30 and 40. Once you are done with this, you should have stored classification accuracy for each number of PCs that you tried.
Visualize all the 1950+50 instances in a 2-dimensional plot with PC1 on X-axis and PC2 on Y-axis. Use colour to indicate the class of each instance. Report if the classes are well-separable using PC1 and PC2 or not.
At last, train a KNN (with K = 5) algorithm on the 50 training images (no cross-validation or parameter tuning is needed) and test on the test images, record its accuracy.

Make a final figure where you show classification accuracy that you achieved using different numbers of PCs (put number of PCs on x-axis and accuracy on y-axis). In this figure you should also present accuracy that you achieved after clustering the original dataset and accuracy of KNN (e.g. use geom_hline with different colours). Thoroughly interpret your results. Which method is preferred when there is a lack of labelled data?

Andmekaeve 2017/18 sügis

HW09 (due Nov 27th) Clustering

Exercise 1 (EX1) (1 point)

Exercise 2 (EX2) (1 point)

Exercise 3 (EX3) (2 points)