Introduction

Recommender systems are one of the most lucrative examples of applied machine learning. Such techniques as cross-sell and up-sell are widely used by many e-commerce companies regardless of their products. This is a large subfield by itself, and today we will just approach the top of the iceberg.

Our dataset about movies taken from: https://grouplens.org/datasets/movielens/.

Install the package recommenderlab that we use for the analysis.

Libraries

library(data.table)
library(tidyr)
library(dplyr)
library(ggplot2)

library(anytime) #install.packages("anytime")
library(recommenderlab) #install.packages("recommenderlab")
library(stringr) #install.packages("stringr")

Loading the data

movies <- <YOUR CODE> # movies.csv
ratings <- <YOUR CODE> # ratings.csv

Again, first step is simple descriptive stats and vizualizations:

First look at the data:

View(movies)
View(ratings)

Let’s look at how much unique movies and users we have:

length(unique(movies$movieId))
## [1] 9125
length(unique(ratings$userId))
## [1] 671

And amount of the movies that were rated:

length(unique(ratings$movieId))
## [1] 9066

As you can see, not all movies were rated.

Let’s plot the rating to see the distribution:

<YOUR CODE>

Films are mostly rated with mark 3, 4 or 5.

Now, lets look at average rating that each user gives:

ratings %>% 
  group_by(userId) %>%
  summarise(avg_rating=mean(rating, na.rm=T)) %>%
  
  <YOUR CODE> # create a plot

And also we can check the average rating per each movie:

ratings %>% 
  <YOUR CODE>

You probably paid attention that we have a timestamps. They represent the exact time when user put the rating to the movie.

Let’s convert the timestamps to dates for 10 first movies and see, how their rating changed during time.

#anytime
ratings <- ratings %>%
  mutate(date = as.Date(anytime(timestamp)))

Next we filter out unnecessary data:

fst_ten_movies <- movies[1:10,]

rating_fst_ten_movies <- ratings %>%
  filter(movieId %in% fst_ten_movies$movieId)

Now we have to decide what how should we see the data. Should it be just years, or should we also include a month?

Let’s see the minnimum and maximum dates:

min(rating_fst_ten_movies$date)
## [1] "1996-03-30"
max(rating_fst_ten_movies$date)
## [1] "2016-10-06"

The period is almost 10 years. Let’s build a plot based which will show us how the rating of the movies changed during the years.

First we have to calculate average rating for each year.

fst_ten <- left_join(rating_fst_ten_movies, fst_ten_movies, by="movieId") %>%
  mutate(year = format(date,"%Y"))


fst_ten <- fst_ten %>%
  <YOUR CODE> %>% # calculate cumulative rating number of rows and average rating
fst_ten  %>%
  ggplot(aes(x=year, y=avRating, group=movieId, color=as.factor(movieId))) +
    geom_line()

Now, lets replace ratings for the same user and movie with an average rating:

dim(ratings)

ratings <- ratings %>%
  <YOUR CODE> # calculate mean rating

dim(ratings)
ratings <- ratings

As we won’t need timestamps later, we just removed them from our data.

Genre of the movie

Next, let’s try first to find movies that are similar based on their genres. As we have our genres stored in format like “genre1|genre2” we will gave to separate them.

splitting_genres = strsplit(movies$genres, "|", fixed=TRUE) # split vector genres by '|'
genres <- unique(unlist(splitting_genres)) # collect all possible genres into one vector

Now that we extracted genres, we will create matrix with zeros, where rows are movies and columns - genres.

movie_genres_dummy <- as.data.frame(matrix(0, ncol=length(genres), nrow=nrow(movies)))
colnames(movie_genres_dummy) <- genres

We will fill the matrix by putting 1 if the genre is present for this movie, or leaving 0 otherwise.

# may take some time
for (movie_id in 1:length(splitting_genres)) {
  movie_genres_dummy[movie_id, splitting_genres[[movie_id]]] <- 1
}

head(movie_genres_dummy)
##   Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 1         1         1        1      1       1       0     0      0     0
## 2         1         0        1      0       1       0     0      0     0
## 3         0         0        0      1       0       1     0      0     0
## 4         0         0        0      1       0       1     1      0     0
## 5         0         0        0      1       0       0     0      0     0
## 6         0         0        0      0       0       0     0      1     1
##   Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 1        0      0       0      0           0    0   0       0       0
## 2        0      0       0      0           0    0   0       0       0
## 3        0      0       0      0           0    0   0       0       0
## 4        0      0       0      0           0    0   0       0       0
## 5        0      0       0      0           0    0   0       0       0
## 6        1      0       0      0           0    0   0       0       0
##   Film-Noir (no genres listed)
## 1         0                  0
## 2         0                  0
## 3         0                  0
## 4         0                  0
## 5         0                  0
## 6         0                  0

We want to find similar movies. It can be used in such a way that if one client watches the movies, we recommend the similar by the amount of genres. The most straightforward way is to calculate the distance (inverse of similarity). Note that we have a large amount of movies and calcualting the distance for the whole matrix is computationally very expensive.

# distance for 1st and 2nd movie
head(movie_genres_dummy,2)
##   Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 1         1         1        1      1       1       0     0      0     0
## 2         1         0        1      0       1       0     0      0     0
##   Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 1        0      0       0      0           0    0   0       0       0
## 2        0      0       0      0           0    0   0       0       0
##   Film-Noir (no genres listed)
## 1         0                  0
## 2         0                  0
dist(movie_genres_dummy[1,], movie_genres_dummy[2,], method = 'binary')
##     2
## 1 0.4
rbind(movie_genres_dummy[1,], movie_genres_dummy[5,])
##   Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 1         1         1        1      1       1       0     0      0     0
## 5         0         0        0      1       0       0     0      0     0
##   Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 1        0      0       0      0           0    0   0       0       0
## 5        0      0       0      0           0    0   0       0       0
##   Film-Noir (no genres listed)
## 1         0                  0
## 5         0                  0
dist(movie_genres_dummy[1,], movie_genres_dummy[5,], method = 'binary')
##     5
## 1 0.8

We can build distance matrix for first 10 movies:

mx_d <- dist(movie_genres_dummy[1:10,],  method = 'binary')
mx_d
##            1         2         3         4         5         6         7
## 2  0.4000000                                                            
## 3  0.8333333 1.0000000                                                  
## 4  0.8571429 1.0000000 0.3333333                                        
## 5  0.8000000 1.0000000 0.5000000 0.6666667                              
## 6  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000                    
## 7  0.8333333 1.0000000 0.0000000 0.3333333 0.5000000 1.0000000          
## 8  0.6000000 0.3333333 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 9  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000
## 10 0.8571429 0.8000000 1.0000000 1.0000000 1.0000000 0.5000000 1.0000000
##            8         9
## 2                     
## 3                     
## 4                     
## 5                     
## 6                     
## 7                     
## 8                     
## 9  1.0000000          
## 10 0.7500000 0.6666667

Let’s find movies with the smallest distance:

min(dist(movie_genres_dummy[1:10,],  method = 'binary'))
## [1] 0

And examine movies with the smallest distance:

rbind(movie_genres_dummy[3,], movie_genres_dummy[7,]) %>%
  cbind(movies[c(3,7),]$title)
##   Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 3         0         0        0      1       0       1     0      0     0
## 7         0         0        0      1       0       1     0      0     0
##   Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 3        0      0       0      0           0    0   0       0       0
## 7        0      0       0      0           0    0   0       0       0
##   Film-Noir (no genres listed) movies[c(3, 7), ]$title
## 3         0                  0 Grumpier Old Men (1995)
## 7         0                  0          Sabrina (1995)

Ratings of the movie

Next, in order to recomend something that user would watch, we want to take into account ratings. For that task we will take advantage of recommenderlab, which requires matrix as an input.

ratings_spread <- <YOUR CODE> # columns - movies, rows-users
ratings_spread <- spread(ratings, key=movieId, value=rating) # columns - movies, rows-users
rating_matrix <- as.matrix(ratings_spread[,-1]) # exclude column with user ids
dimnames(rating_matrix) <- list(paste("u", unique(ratings$userId), sep=""),
                                paste("m", unique(ratings$movieId), sep=""))

Next, we create an objet suitable for the package input:

rating_matrix_lab <- as(rating_matrix, "realRatingMatrix")
getRatingMatrix(rating_matrix_lab)
# can be translated to the list
#as(rating_matrix_lab, "list")
image(rating_matrix_lab) # too big to see

# subset of the data
image(rating_matrix_lab[1:20,1:20])

It is recommended to briefly take a look at this tutorial: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf

There are a lot of different recommender systems, listed by function:

recommenderRegistry$get_entry_names()
##  [1] "ALS_realRatingMatrix"            "ALS_implicit_realRatingMatrix"  
##  [3] "ALS_implicit_binaryRatingMatrix" "AR_binaryRatingMatrix"          
##  [5] "IBCF_binaryRatingMatrix"         "IBCF_realRatingMatrix"          
##  [7] "POPULAR_binaryRatingMatrix"      "POPULAR_realRatingMatrix"       
##  [9] "RANDOM_realRatingMatrix"         "RANDOM_binaryRatingMatrix"      
## [11] "RERECOMMEND_realRatingMatrix"    "SVD_realRatingMatrix"           
## [13] "SVDF_realRatingMatrix"           "UBCF_binaryRatingMatrix"        
## [15] "UBCF_realRatingMatrix"
recommenderRegistry$get_entry("POPULAR", dataType="realRatingMatrix")
## Recommender method: POPULAR for realRatingMatrix
## Description: Recommender based on item popularity.
## Reference: NA
## Parameters:
##   normalize    aggregationRatings aggregationPopularity
## 1  "center" new("standardGeneric" new("standardGeneric"
recommenderRegistry$get_entry("IBCF", dataType="realRatingMatrix")
## Recommender method: IBCF for realRatingMatrix
## Description: Recommender based on item-based collaborative filtering.
## Reference: NA
## Parameters:
##    k   method normalize normalize_sim_matrix alpha na_as_zero
## 1 30 "Cosine"  "center"                FALSE   0.5      FALSE

Let’s try our first model, which is recommendation based on popularity. The method computes average rating for each item based on available ratings and predicts each unknown rating as average for the item.

model <- Recommender(rating_matrix_lab, method = "POPULAR")
recom <- predict(model, rating_matrix_lab[1:4], n=10)
as(recom, "list")
## $u1
##  [1] "m2734" "m2263" "m335"  "m4019" "m2005" "m152"  "m292"  "m231" 
##  [9] "m2108" "m2726"
## 
## $u2
##  [1] "m2734"  "m335"   "m2005"  "m2108"  "m2726"  "m9"     "m1096" 
##  [8] "m2759"  "m4967"  "m48304"
## 
## $u3
##  [1] "m335"   "m2005"  "m292"   "m2108"  "m2726"  "m9"     "m2759" 
##  [8] "m48304" "m3007"  "m2501" 
## 
## $u4
##  [1] "m2734"  "m4019"  "m152"   "m292"   "m2108"  "m9"     "m1096" 
##  [8] "m4967"  "m48304" "m3007"

We can also predict ratings:

prediction <- predict(model, rating_matrix_lab[1:5], type="ratings")
as(prediction, "matrix")[,1:5]
##         m31    m1029    m1061    m1129    m1172
## u1 2.775976 2.394019 2.128042 1.301898 2.227099
## u2 3.712818 3.330861 3.064885 2.238740 3.163941
## u3 3.794603 3.412646 3.146670 2.320525 3.245727
## u4 4.574015 4.192058 3.926082 3.099937 4.025138
## u5 4.135976 3.754019       NA 2.661898 3.587099

Evaluation of the recommender system

We have to create evaluation scheme.

set.seed(5864)
eval_scheme <- evaluationScheme(rating_matrix_lab, method="split", train=0.8, given=-5)
#5 ratings of 20% of users (per user) are excluded for testing

model_popular <- Recommender(getData(eval_scheme, "train"), "POPULAR")

prediction_popular <- predict(model_popular, getData(eval_scheme, "known"), type="ratings")

We can check visually predictions for 50 users and 50 movies:

image(prediction_popular[1:50,1:50])

Now we can obtain statistical measures:

rmse_popular <- calcPredictionAccuracy(prediction_popular, 
                                       getData(eval_scheme, "unknown"))[1]
rmse_popular
##     RMSE 
## 0.935082

If you recall, RMSE (root mean square error) was also used in regression problems. It does not tell us much alone, but we can compare models:

model_ubcf <- Recommender(getData(eval_scheme, "train"), 
                          "UBCF",
                          param=list(
                            normalize = "center", 
                            method="Cosine", 
                            nn=50))

Note that we use the same scheme.

prediction_ubcf <- predict(model_ubcf, getData(eval_scheme, "known"), type="ratings")
rmse_ubcf <- calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown"))[1]
rmse_ubcf
##      RMSE 
## 0.9682466
rbind(calcPredictionAccuracy(prediction_popular, getData(eval_scheme, "unknown")),
      calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown")))
##           RMSE       MSE       MAE
## [1,] 0.9350820 0.8743784 0.7000101
## [2,] 0.9682466 0.9375014 0.7423649