Andmekaeve - Kursused - Arvutiteaduse instituut

Homework 5 (HW05) - Machine learning I

Exercise 1 (EX1) (1 point)

Explain the K-Nearest Neighbour (K-NN) algorithm on the 2D data given below using Euclidean distance. First visualize the dataset. Then explain how to classify the instance with coordinates (3, 5) when K = 1, K = 5 and K = 10. Do you get the same prediction? What happens when K grows? Note: you don’t have to calculate the distances by hand if the visualization is clear enough to identify the K closest neighbours. Pay attention though that the axis are on the same scale (don’t strech the plane, distances will change).

ID	x_coord	y_coord	class
1	3	9	1
2	2	4	1
3	3	3	1
4	1	6	1
5	4	1	0
6	9	3	0
7	5	6	0
8	6	4	0
9	6	2	0
10	3	7	0

Hints for plotting the data in ggplot:

make class attribute into a factor
you can use geom_text() to show the instance ID's
you can use coord_fixed() to avoid the scaling
you can try what this does, you can also do the same for y (change x to y): scale_x_continuous(minor_breaks = seq(0 , 10, 1), breaks = seq(0, 10, 1))

Exercise 2 (EX2) (1 point)

In this exercise you will familiarise yourself with the MNIST dataset. Do it by following these instructions and answering the given questions.

a. Load MNIST dataset from the RDS file from HERE using readRDS() function. What is this dataset about? Get the structure of the dataset using str(). How many digits are there from each class (and which classes)? What is the dimensionality of each image?

        
              mnist <- readRDS(file)

b. Visualise few examples of each class. Code for visualising one image is given below. Do the labels of the images correspond to what is on the image?

        
              # first we need to define colors:
colors <- c('black', 'white')
cus_col <- colorRampPalette(colors=colors)
 
# Play around with an index of image that you want to visualise
index <- 1
img <- array(mnist$x[index,],dim=c(28,28))
img <- img[,28:1]
image(img, col=cus_col(256))

The code for getting the label of the image:

        
              head(mnist$y)
print(paste("Correct label of the first image is:", mnist$y[1]))

c. How can we visualise these labels in a bit more compact way? Use ggplot2 library and geom_bar to visualise class distribution.

d. In principle, we would also like to somehow visualise all of the data in the attributes, but as it is 784-dimensional, there are no straightforward ways to visualise all of it "as-is". None the less, we should try to come up with some visualisations, as those are often key to forming some intuition about the data. Indeed, by visualising the data you essentially "plug it" directly into your brain, the best pattern analysis machine we have so far.

Let us, for example, see how the 400th attribute (a pixel somewhere in the center of the image) is distributed for different digits.

        
              pixel <- data.frame('value' = mnist$x[,400])
 
ggplot(pixel, aes(x = value)) +
  geom_bar(binwidth = 0.1) + 
  theme_bw()

Now use facet_wrap() function to make separate plots for different digits (hint check out ggplot2 cheatsheet here). Explain what you observe!

Exercise 3 (EX3) (1 point)

Continuation of EX2.

Although in practice, you typically want to use as much data as possible to train your models, large datasets are not very suitable for experimenting and visualization - it is just too annoying to have to wait minutes for each step to continue. Thus, 60000 examples we had in EX2 is a bit too much for our purposes, so why don't leave just 4000. For that we are going to use R function sample(). We should also split our data into training and test sets in order to be able to estimate the true performance of our dataset later. Use first 3000 images as training data and last 1000 as test set.

a. Split the data into training and testing data by filling in the ? parts in the following code:

        
              # You can use set.seed() in order to reproduce stochastic results
set.seed(1111) 
new_indx <- sample(c(1:nrow(mnist$x)), size = 4000, replace = FALSE)
 
sample_img <- mnist$x[new_indx, ]
sample_labels <- mnist$y[new_indx]
 
train_img <- sample_img[?,]
train_labels <- sample_labels[?]
 
test_img <- sample_img[?,]
test_labels <- sample_labels[?]
 
str(train_img) # Make sure you have 3000 rows here
str(test_img) # Make sure you have 1000 rows here

b. Next we define a distance function between two images to be euclidean distance. Fill in the code below so the function would calculate and return the euclidean distance between img1 and img2.

        
              dist <- function(img1, img2) {
  # TODO
  # calculate the euclidean distance between img1 and img2
}

You can play with the defined function below. Check that the distance between images of the same digit is typically smaller than the distance between images of different digits.

        
              print(paste("Distance between images of class", train_labels[2], "and", train_labels[8], "is", dist(train_img[2,], train_img[8,])))
print(paste("Distance between images of class",train_labels[2], "and", train_labels[4], "is", dist(train_img[2,], train_img[4,])))

c. Now we implement the actual algorithm of predicting one instance's label with Nearest Neighbour (NN). It consists of three steps:

Compute distances to all points in the dataset
Find the closest point, and
Report the corresponding label.

We shall do it gently and your job is to fill in the TODO parts in the code and answer the questions. First we pick a sample image that we will be classifying. Let us say this will be an unknown image sent by our friend:

        
              unknown_img <- test_img[1,]
true_label <- test_labels[1]

c.1. Compute all distances from the `unknown_img` to the images in the dataset. (No questions here to answer).

we shall use function apply() which is a way of doing list comprehension in R
here we use apply to iterate through images by rows (argument '1') and
for each image compute the distance to unknown image

        
              all_distances <- apply(train_img, 1, function(img) dist(unknown_img, img))
head(all_distances)

c.2. Now let's find out which image is closest to our `unknown_img`. Fill in the code.

        
              closest_index <- # TODO

c.3. Almost done, now report a label with index i in labels by filling in the code.

        
              predicted_label <- # TODO

Compare it to the true label of the first image in the test labels. Is it the same?

        
              (predicted_label == true_label)
print(paste("Predicted class for the first image is", predicted_label ,"and the true label is",  true_label))

Use code from the previous exercise to plot the first example and visually confirm it's label.

d. Now let's make a function out of the code we have already written, fill in the empty parts by using the code you already wrote above.

        
              classify <- function(unknown_img) {
    all_distances <- # ADD YOUR CODE THAT COMPUTES ALL DISTANCES
    return() # REPORT A LABEL OF THE CLOSEST NEIGHBOUR
}

Test it to verify that it works.

        
              print(paste("Predicted class for the first image is", classify(unknown_img),"and the true label is",  true_label))

e. One very popular variation of Nearest Neighbour is K-nearest neighbour. In this algorithm a label for a new instance is chosen by majority vote by k of its nearest neighbours.

The actual algorithm is not very different from vanila nearest neighbour: - Compute distances to all points in the dataset - Find the k closest points - Report the most popular label from these k.

Implement KNN algorithm reusing some code from dist() and classify() functions.

        
              classify_knn <- function(unknown_img, k = 5) {
  # This step we already know from the previous exercises
  all_distances <- # CTRL C + CTRL V FROM ABOVE
 
  # We need to get indexes of K smallest distances
  # (hint: use functions order() and head())
  knn = # FILL THIS IN
 
  # you can print potential predictions
  # print(train_labels[knn])
  # print(names(sort(table(train_labels[knn]), decreasing = TRUE)))
 
  # Very small step is left, return the most frequently predicted label
  return(names(sort(table(train_labels[knn]), decreasing = TRUE))[1])
}

Test this version of KNN, experiment with different `K`s

        
              print(paste("Predicted class for the first image is", classify_knn(unknown_img, k = 100),"and the true label is",  true_label))

Exercise 4 (EX4) (1 point)

In the previous exercise we saw that K Nearest Neighbor indeed works! How about trying applying it on all 1000 test images and then estimating it's effectivness?

a. Classify all test images and store them into a separate variable `test_predicted`, choose `k` = 5. It might be a bit slowish...

        
              test_predicted <- "... your code here ..."
head(test_predicted)

How many instances from the test set the classifier has predicted correctly?

Now we will use accuracy (namely, proportion of correctly guessed classes) to estimate the performance of our nearest neighbour classifier. For that we need to divide number of correctly predicted images by total number of images. Report accuracy of your KNN classifier both on training data and testing data separately. Note: you have to figure out yourself how to calculate it on the training data.

        
              knn_accuracy = # COMPUTE ACCURACY
print(paste("Final accuracy of our nearest neighbor classifier is", knn_accuracy,"- not bad!"))

Let's examine some of the missclassified examples. You can play around with the index of misclassified instance to visually examine some of the difficult cases

        
              # Set an index of missclassified instance you want to examine
index = 12
 
miss_ind = which(test_predicted != test_labels)[index] # remember function `which` in R?
 
colors<-c('black','white')
cus_col<-colorRampPalette(colors=colors)
 
img <- array(test_img[miss_ind,],dim=c(28,28))
img <- img[,28:1]
image(img, col=cus_col(256))
print(paste("This image has a class", test_labels[miss_ind],"was incorrectly predicted as",  test_predicted[miss_ind]))

b. Now let us use the caret package to train and test the k Nearest Neigbor classifier, avoiding the need to implement it ourselves.

Most of the classifiers that we are going to use in this course are implemented in caret, for example KNN that we have implemented in this and the previous lesson is available as knn. All you need to do is to use method = knn in caret train() function.

        
              library(caret)
 
# We will discuss this line further in more details,
# we need it now as without it, caret tries to be very smart 
# and training takes too much time...
ctrl <- trainControl(method="none", number = 1, repeats = 1) 
 
# we should use train_labels as factors, otherwise caret thinks that 
# this is a regression problem
(knn_fit <- train(y = as.factor(train_labels), x = data.frame(train_img), method = "knn", trControl = ctrl, tuneGrid = data.frame(k = 5)))

Use learned nearest neighbor classifier (model) for predicting test images:

        
              test_predicted = predict(knn_fit, data.frame(test_img))
print(paste("Accuracy of caret nearest neighbor classifier is", sum(test_predicted == test_labels)/length(test_labels)))

A useful way to study classification results is by examining the confusion matrix, which counts pairs (true_class, predicted_class):

        
              confusionMatrix(test_predicted, test_labels)

Make an attempt to interpret the output of confusion matrix. Then train 2 K-NN classifiers with different K's. Report the accuracies of both models on training and testing data.

c. What about other classifiers like decision tree, it is also in caret (here is a full list of models: https://topepo.github.io/caret/available-models.html)

Find and train C5.0 algorithm (or other types of decision tree algorithm, e.g. CART algorithm in rpart) on training data and test it on test images. Report the final accuracy on training and testing data and corresponding confusion matrix. If you want, you can also play with the parameters and experiment different settings.

Hint: If you use rpart, then you have to also change this part in the code: tuneGrid = data.frame(k = 5). You have to use the parameter cp instead of k. To understand what it means, read about it from ?rpart.control. If the accuracy is initially bad try to change the parameter value. If you want to use C5.0 you need to use tuneGrid = data.frame(winnow = c(TRUE), trials=2, model="tree"). You can change values of these parameters as well.

d. Summarise the results by nicely showing all accuracies you calculated (on both training and testing data). Comment on the results. Which model worked the best?

Andmekaeve 2017/18 sügis

Homework 5 (HW05) - Machine learning I

Exercise 1 (EX1) (1 point)

Exercise 2 (EX2) (1 point)

Exercise 3 (EX3) (1 point)

Exercise 4 (EX4) (1 point)