Introduction
Recommender systems are one of the most lucrative examples of applied machine learning. Such techniques as cross-sell and up-sell are widely used by many e-commerce companies regardless of their products. This is a large subfield by itself, and today we will just approach the top of the iceberg.
We will download the dataset about movie recommendations: https://grouplens.org/datasets/movielens/. Please, download movies.csv
and ratings.csv
. Then install the package recommenderlab
that we use for the analysis.
#install.packages("recommenderlab")
library(recommenderlab)
library(data.table)
library(tidyr)
library(dplyr)
library(stringr)
library(ggplot2)
# load the data
movies <- fread(file.choose()) # load movies
ratings <- fread(file.choose()) # load ratings
Again, first step is simple descriptive stats and vizualizations:
head(movies)
## movieId title
## 1: 1 Toy Story (1995)
## 2: 2 Jumanji (1995)
## 3: 3 Grumpier Old Men (1995)
## 4: 4 Waiting to Exhale (1995)
## 5: 5 Father of the Bride Part II (1995)
## 6: 6 Heat (1995)
## genres
## 1: Adventure|Animation|Children|Comedy|Fantasy
## 2: Adventure|Children|Fantasy
## 3: Comedy|Romance
## 4: Comedy|Drama|Romance
## 5: Comedy
## 6: Action|Crime|Thriller
head(ratings)
## userId movieId rating timestamp
## 1: 1 31 2.5 1260759144
## 2: 1 1029 3.0 1260759179
## 3: 1 1061 3.0 1260759182
## 4: 1 1129 2.0 1260759185
## 5: 1 1172 4.0 1260759205
## 6: 1 1263 2.0 1260759151
length(unique(movies$movieId)) # unique movies in db
## [1] 9125
length(unique(ratings$userId)) # 671 users
## [1] 671
length(unique(ratings$movieId)) # 9066 movies rated
## [1] 9066
ggplot(ratings, aes(x=rating)) +
geom_bar(stat='count', fill="#7ba367") +
theme_bw()
ratings %>%
group_by(userId) %>%
summarise(avg_rating=mean(rating, na.rm=T)) %>%
ggplot(aes(x=avg_rating)) +
geom_histogram(fill="#7ba367", color='white', bins = 30) + theme_bw() +
scale_x_continuous("average rating per user")
ratings %>%
group_by(movieId) %>%
summarise(avg_rating=mean(rating, na.rm=T)) %>%
ggplot(aes(x=avg_rating)) +
geom_histogram(fill="#7ba367", color='white', bins = 30) + theme_bw() +
scale_x_continuous("average rating per movie")
Task 1. Please, suggest other things to investigate to get to know the data? Add your R code.
You probably paid attention that we have a timestamps. In case the same user rated the same movie several times, we aggregate the ratings:
# user-ratings as a matrix
dim(ratings)
## [1] 100004 4
ratings <- ratings %>%
group_by(userId, movieId) %>%
summarise(rating=mean(rating)) # in case the same user rated the same movie multiple times
dim(ratings)
## [1] 100004 3
Task 2. Was it the case? How many users rated the same movies several times?
Genre of the movie
Next, let’s try first to find movies that were similar based on their genres. As turned out, it is not as straightforward, as it seems :(. This is one possible solution, and not the most elegant one. You are free to use your own logic here.
splitting_genres = strsplit(movies$genres, "|", fixed=TRUE) # split vector genres by '|'
genres <- unique(unlist(splitting_genres)) # collect all possible genres into one vector
# create matrix with zeros, where rows are movies and columns - genres
movie_genres_dummy <- as.data.frame(matrix(0, ncol=length(genres), nrow=nrow(movies)))
colnames(movie_genres_dummy) <- genres # assign names of genres to columns
# fill 1 if the genre is present for this movie, and 0 otherwise
# warning: may take longer time
for (movie_id in 1:length(splitting_genres)) {
movie_genres_dummy[movie_id, splitting_genres[[movie_id]]] <- 1
}
head(movie_genres_dummy)
## Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 1 1 1 1 1 1 0 0 0 0
## 2 1 0 1 0 1 0 0 0 0
## 3 0 0 0 1 0 1 0 0 0
## 4 0 0 0 1 0 1 1 0 0
## 5 0 0 0 1 0 0 0 0 0
## 6 0 0 0 0 0 0 0 1 1
## Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 1 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 1 0 0 0 0 0 0 0 0
## Film-Noir (no genres listed)
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
Task 3. Explain the matrix that we have. What does it show?
We want to find similar movies. It can be used in such a wy that if one client watches the movies, to recommend the similar in amount of genres. The most straightforward way is to calculate the distance (inverse of similarity). Note that we have a large amount of movies and calcualting the distance for the whole matrix is computationally very expensive.
#similarity between movies
dist(movie_genres_dummy[1,], movie_genres_dummy[2,], method = 'binary')
## 2
## 1 0.4
dist(movie_genres_dummy[1,], movie_genres_dummy[5,], method = 'binary')
## 5
## 1 0.8
dist(movie_genres_dummy[1:10,], method = 'binary')
## 1 2 3 4 5 6 7
## 2 0.4000000
## 3 0.8333333 1.0000000
## 4 0.8571429 1.0000000 0.3333333
## 5 0.8000000 1.0000000 0.5000000 0.6666667
## 6 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 7 0.8333333 1.0000000 0.0000000 0.3333333 0.5000000 1.0000000
## 8 0.6000000 0.3333333 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## 9 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000
## 10 0.8571429 0.8000000 1.0000000 1.0000000 1.0000000 0.5000000 1.0000000
## 8 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 1.0000000
## 10 0.7500000 0.6666667
min(dist(movie_genres_dummy[1:10,], method = 'binary'))
## [1] 0
movie_genres_dummy[3,]
## Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 3 0 0 0 1 0 1 0 0 0
## Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 3 0 0 0 0 0 0 0 0 0
## Film-Noir (no genres listed)
## 3 0 0
movie_genres_dummy[7,]
## Adventure Animation Children Comedy Fantasy Romance Drama Action Crime
## 7 0 0 0 1 0 1 0 0 0
## Thriller Horror Mystery Sci-Fi Documentary IMAX War Musical Western
## 7 0 0 0 0 0 0 0 0 0
## Film-Noir (no genres listed)
## 7 0 0
movies[c(3,7),]$title
## [1] "Grumpier Old Men (1995)" "Sabrina (1995)"
Task 4. Try different examples and find what are the names of similar movies. Does it make sense?
Ratings of the movie
Next, we want to take into account ratings. For that task we will take advantage of recommenderlab
, which requires matrix as an input.
ratings_spread <- spread(ratings, key=movieId, value=rating) # columns - movies, rows-users
rating_matrix <- as.matrix(ratings_spread[,-1]) # exclude column with user ids
dimnames(rating_matrix) <- list(paste("u", unique(ratings$userId), sep=""),
paste("m", unique(ratings$movieId), sep=""))
Next, we create an objet suitable for the package input:
rating_matrix_lab <- as(rating_matrix, "realRatingMatrix")
getRatingMatrix(rating_matrix_lab)
# can be translated to the list
#as(rating_matrix_lab, "list")
image(rating_matrix_lab) # too big to see
# subset of the data
image(rating_matrix_lab[1:20,1:20])
It is recommended to briefly take a look at this tutorial: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
There are a lot of different recommender systems, listed by function
recommenderRegistry$get_entry_names()
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "ALS_implicit_binaryRatingMatrix" "AR_binaryRatingMatrix"
## [5] "IBCF_binaryRatingMatrix" "IBCF_realRatingMatrix"
## [7] "POPULAR_binaryRatingMatrix" "POPULAR_realRatingMatrix"
## [9] "RANDOM_realRatingMatrix" "RANDOM_binaryRatingMatrix"
## [11] "RERECOMMEND_realRatingMatrix" "SVD_realRatingMatrix"
## [13] "SVDF_realRatingMatrix" "UBCF_binaryRatingMatrix"
## [15] "UBCF_realRatingMatrix"
recommenderRegistry$get_entry("POPULAR", dataType="realRatingMatrix")
## Recommender method: POPULAR for realRatingMatrix
## Description: Recommender based on item popularity.
## Reference: NA
## Parameters:
## normalize aggregationRatings aggregationPopularity
## 1 "center" new("standardGeneric" new("standardGeneric"
recommenderRegistry$get_entry("IBCF", dataType="realRatingMatrix")
## Recommender method: IBCF for realRatingMatrix
## Description: Recommender based on item-based collaborative filtering.
## Reference: NA
## Parameters:
## k method normalize normalize_sim_matrix alpha na_as_zero
## 1 30 "Cosine" "center" FALSE 0.5 FALSE
Task 5. Take a look at other methods. Discuss during the class, which ones would make sense for this movie recommender. Why?
Let’s try our first model, which is recommendation based on popularity. The method computes average rating for each item based on available ratings and predicts each unknown rating as average for the item.
model <- Recommender(rating_matrix_lab, method = "POPULAR")
recom <- predict(model, rating_matrix_lab[1:4], n=10)
as(recom, "list")
## $u1
## [1] "m2734" "m2263" "m335" "m4019" "m2005" "m152" "m292" "m231"
## [9] "m2108" "m2726"
##
## $u2
## [1] "m2734" "m335" "m2005" "m2108" "m2726" "m9" "m1096"
## [8] "m2759" "m4967" "m48304"
##
## $u3
## [1] "m335" "m2005" "m292" "m2108" "m2726" "m9" "m2759"
## [8] "m48304" "m3007" "m2501"
##
## $u4
## [1] "m2734" "m4019" "m152" "m292" "m2108" "m9" "m1096"
## [8] "m4967" "m48304" "m3007"
Task 6. Find the titles and genres of the movies that are recommended.
We can also predict ratings:
prediction <- predict(model, rating_matrix_lab[1:5], type="ratings")
as(prediction, "matrix")[,1:5]
## m31 m1029 m1061 m1129 m1172
## u1 2.775976 2.394019 2.128042 1.301898 2.227099
## u2 3.712818 3.330861 3.064885 2.238740 3.163941
## u3 3.794603 3.412646 3.146670 2.320525 3.245727
## u4 4.574015 4.192058 3.926082 3.099937 4.025138
## u5 4.135976 3.754019 NA 2.661898 3.587099
Evaluation of the recommender system
set.seed(5864)
eval_scheme <- evaluationScheme(rating_matrix_lab, method="split", train=0.8, given=-5)
#5 ratings of 20% of users (per user) are excluded for testing
model_popular <- Recommender(getData(eval_scheme, "train"), "POPULAR")
prediction_popular <- predict(model_popular, getData(eval_scheme, "known"), type="ratings")
# check visually predictions for 50 users , 50 movies
image(prediction_popular[1:50,1:50])
rmse_popular <- calcPredictionAccuracy(prediction_popular, getData(eval_scheme, "unknown"))[1]
rmse_popular
## RMSE
## 0.935082
If you recall, RMSE (root mean square error) was also used in regression problems. It does not tell us much alone, but we can compare models:
# learn about input parameters via help
model_ubcf <- Recommender(getData(eval_scheme, "train"), "UBCF",
param=list(normalize = "center", method="Cosine", nn=50))
Note that we use the same scheme
prediction_ubcf <- predict(model_ubcf, getData(eval_scheme, "known"), type="ratings")
rmse_ubcf <- calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown"))[1]
rmse_ubcf
## RMSE
## 0.9682466
rbind(calcPredictionAccuracy(prediction_popular, getData(eval_scheme, "unknown")),
calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown")))
## RMSE MSE MAE
## [1,] 0.9350820 0.8743784 0.7000101
## [2,] 0.9682466 0.9375014 0.7423649
Task 7. Which model is better? Why? Concentrate on discussing the code and the results. Compare the results for different seeds. Are these results consistent?