---
title: "Business Data Analytics. Practice Session"
subtitle: Recommender systems
author: "University of Tartu"
output:
prettydoc::html_pretty: null
highlight: github
html_document: default
html_notebook: default
github_document: default
theme: cayman
editor_options:
chunk_output_type: console
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
```{r setup, echo=FALSE}
library(knitr)
```
## Introduction
Recommender systems are one of the most lucrative examples of applied machine learning. Such techniques as cross-sell and up-sell are widely used by many e-commerce companies regardless of their products. This is a large subfield by itself, and today we will just approach the top of the iceberg.
Our dataset about movies taken from:
https://grouplens.org/datasets/movielens/.
Install the package ```recommenderlab``` that we use for the analysis.
##Libraries
```{r message=FALSE}
library(data.table)
library(tidyr)
library(dplyr)
library(ggplot2)
library(anytime) #install.packages("anytime")
library(recommenderlab) #install.packages("recommenderlab")
library(stringr) #install.packages("stringr")
```
##Loading the data
```{r, eval=FALSE}
movies <- # movies.csv
ratings <- # ratings.csv
```
```{r, include=FALSE}
movies <- fread(file.choose()) # movies.csv
ratings <- fread(file.choose()) # ratings.csv
```
Again, first step is simple descriptive stats and vizualizations:
First look at the data:
```{r}
View(movies)
```
```{r}
View(ratings)
```
Let's look at how much unique movies and users we have:
```{r}
length(unique(movies$movieId))
length(unique(ratings$userId))
```
And amount of the movies that were rated:
```{r}
length(unique(ratings$movieId))
```
As you can see, not all movies were rated.
Let's plot the rating to see the distribution:
```{r, eval=FALSE}
```
```{r, include=FALSE}
ggplot(ratings, aes(x=rating)) +
geom_bar(stat='count', fill="#7ba367") +
theme_bw()
```
Films are mostly rated with mark 3, 4 or 5.
Now, lets look at average rating that each user gives:
```{r}
ratings %>%
group_by(userId) %>%
summarise(avg_rating=mean(rating, na.rm=T)) %>%
ggplot(aes(x=avg_rating)) +
geom_histogram(fill="#7ba367", color='white', bins = 30) + theme_bw() +
scale_x_continuous("average rating per user")
```
And also we can check the average rating per each movie:
```{r, eval=FALSE}
ratings %>%
```
```{r, include=FALSE}
ratings %>%
group_by(movieId) %>%
summarise(avg_rating=mean(rating, na.rm=T)) %>%
ggplot(aes(x=avg_rating)) +
geom_histogram(fill="#7ba367", color='white', bins = 30) + theme_bw() +
scale_x_continuous("average rating per movie")
```
You probably paid attention that we have a timestamps. They represent the exact time when user put the rating to the movie.
Let's convert the timestamps to dates for 10 first movies and see, how their rating changed during time.
```{r}
#anytime
ratings <- ratings %>%
mutate(date = as.Date(anytime(timestamp)))
```
Next we select the first 10 movies:
```{r}
fst_ten_movies <- movies[1:10,]
rating_fst_ten_movies <- ratings %>%
filter(movieId %in% fst_ten_movies$movieId)
```
Now we have to decide what how should we see in the data. Should it be just years, or should we also include months?
Let's see the minnimum and maximum dates:
```{r}
min(rating_fst_ten_movies$date)
max(rating_fst_ten_movies$date)
```
The period is almost 10 years. Let's build a plot which will show us how the rating of the movies changed during these 10 years.
First we have to calculate average rating for each year.
```{r, eval=FALSE}
fst_ten <- left_join(rating_fst_ten_movies, fst_ten_movies, by="movieId") %>%
mutate(year = format(date,"%Y"))
fst_ten <- fst_ten %>%
%>% # calculate cumulative rating number of rows and average rating
```
```{r, include=FALSE}
fst_ten <- left_join(rating_fst_ten_movies, fst_ten_movies, by="movieId") %>%
mutate(year = format(date,"%Y"))
fst_ten <- fst_ten %>%
group_by(movieId,year) %>%
arrange(year) %>%
mutate(cumulativeRating = cumsum(rating),
n = row_number(),
avRating = cumulativeRating/n) %>%
summarise(avRating = last(avRating))
```
```{r}
fst_ten %>%
ggplot(aes(x=year, y=avRating, group=movieId, color=as.factor(movieId))) +
geom_line()
```
Now, let's replace ratings for the same user and movie with an average rating:
```{r, eval=FALSE}
dim(ratings)
ratings <- ratings %>%
# calculate mean rating
dim(ratings)
ratings <- ratings
```
```{r, include=FALSE}
dim(ratings)
ratings <- ratings %>%
group_by(userId, movieId) %>%
summarise(rating=mean(rating))
dim(ratings)
ratings <- ratings
```
As we won't need timestamps later, we just removed them from our data.
## Genre of the movie
Next, let's try first to find movies that are similar based on their genres. As we have our genres stored in format like "genre1|genre2" we will gave to separate them.
```{r}
splitting_genres = strsplit(movies$genres, "|", fixed=TRUE) # split vector genres by '|'
genres <- unique(unlist(splitting_genres)) # collect all possible genres into one vector
```
Now that we extracted genres, we will create matrix with zeros, where rows are movies and columns - genres.
```{r}
movie_genres_dummy <- as.data.frame(matrix(0, ncol=length(genres), nrow=nrow(movies)))
colnames(movie_genres_dummy) <- genres
```
We will fill the matrix by putting 1 if the genre is present for this movie, or leaving 0 otherwise.
```{r}
# may take some time
for (movie_id in 1:length(splitting_genres)) {
movie_genres_dummy[movie_id, splitting_genres[[movie_id]]] <- 1
}
head(movie_genres_dummy)
```
We want to find similar movies. It can be used in such a way that if one client watches the movies, we recommend the similar by the amount of genres. The most straightforward way is to calculate the distance (inverse of similarity). Note that we have a large amount of movies and calcualting the distance for the whole matrix is computationally very expensive.
```{r}
# distance for 1st and 2nd movie
head(movie_genres_dummy,2)
dist(movie_genres_dummy[1,], movie_genres_dummy[2,], method = 'binary')
```
```{r}
rbind(movie_genres_dummy[1,], movie_genres_dummy[5,])
dist(movie_genres_dummy[1,], movie_genres_dummy[5,], method = 'binary')
```
We can build distance matrix for first 10 movies:
```{r}
mx_d <- dist(movie_genres_dummy[1:10,], method = 'binary')
mx_d
```
Let's find movies with the smallest distance:
```{r}
min(dist(movie_genres_dummy[1:10,], method = 'binary'))
```
And examine movies with the smallest distance:
```{r}
rbind(movie_genres_dummy[3,], movie_genres_dummy[7,]) %>%
cbind(movies[c(3,7),]$title)
```
## Ratings of the movie
Next, in order to recomend something that user would watch, we want to take into account ratings. For that task we will take advantage of ```recommenderlab```, which requires matrix as an input.
```{r, eval=FALSE}
ratings_spread <- # columns - movies, rows-users
```
```{r, include=TRUE}
ratings_spread <- spread(ratings, key=movieId, value=rating) # columns - movies, rows-users
```
```{r}
rating_matrix <- as.matrix(ratings_spread[,-1]) # exclude column with user ids
dimnames(rating_matrix) <- list(paste("u", unique(ratings$userId), sep=""),
paste("m", unique(ratings$movieId), sep=""))
```
Next, we create an objet suitable for the package input:
```{r}
rating_matrix_lab <- as(rating_matrix, "realRatingMatrix")
```
```{r, results='hide'}
getRatingMatrix(rating_matrix_lab)
```
```{r}
# can be translated to the list
#as(rating_matrix_lab, "list")
image(rating_matrix_lab) # too big to see
```
```{r}
# subset of the data
image(rating_matrix_lab[1:20,1:20])
```
It is recommended to briefly take a look at this tutorial: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
There are a lot of different recommender systems, listed by function:
```{r}
recommenderRegistry$get_entry_names()
```
```{r}
recommenderRegistry$get_entry("POPULAR", dataType="realRatingMatrix")
recommenderRegistry$get_entry("IBCF", dataType="realRatingMatrix")
```
Let's try our first model, which is recommendation based on popularity. The method computes average rating for each item based on available ratings and predicts each unknown rating as average for the item.
```{r}
model <- Recommender(rating_matrix_lab, method = "POPULAR")
recom <- predict(model, rating_matrix_lab[1:4], n=10)
as(recom, "list")
```
We can also predict ratings:
```{r}
prediction <- predict(model, rating_matrix_lab[1:5], type="ratings")
as(prediction, "matrix")[,1:5]
```
## Evaluation of the recommender system
We have to create evaluation scheme.
```{r}
set.seed(5864)
eval_scheme <- evaluationScheme(rating_matrix_lab, method="split", train=0.8, given=-5)
#5 ratings of 20% of users (per user) are excluded for testing
model_popular <- Recommender(getData(eval_scheme, "train"), "POPULAR")
prediction_popular <- predict(model_popular, getData(eval_scheme, "known"), type="ratings")
```
We can check visually predictions for 50 users and 50 movies:
```{r}
image(prediction_popular[1:50,1:50])
```
Now we can obtain statistical measures:
```{r}
rmse_popular <- calcPredictionAccuracy(prediction_popular,
getData(eval_scheme, "unknown"))[1]
rmse_popular
```
If you recall, RMSE (root mean square error) was also used in regression problems. It does not tell us much alone, but we can **compare** models:
```{r}
model_ubcf <- Recommender(getData(eval_scheme, "train"),
"UBCF",
param=list(
normalize = "center",
method="Cosine",
nn=50))
```
Note that we use the same scheme.
```{r}
prediction_ubcf <- predict(model_ubcf, getData(eval_scheme, "known"), type="ratings")
```
```{r}
rmse_ubcf <- calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown"))[1]
rmse_ubcf
rbind(calcPredictionAccuracy(prediction_popular, getData(eval_scheme, "unknown")),
calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown")))
```