Task 1. Please, suggest other things to investigate to get to know the data? Add your R code.

You probably paid attention that we have a timestamps. In case the same user rated the same movie several times, we aggregate the ratings:
```{r}
# user-ratings as a matrix
dim(ratings)
ratings <- ratings %>%
group_by(userId, movieId) %>%
summarise(rating=mean(rating)) # in case the same user rated the same movie multiple times
dim(ratings)
```
Task 2. Was it the case? How many users rated the same movies several times?

## Genre of the movie
Next, let's try first to find movies that were similar based on their genres. As turned out, it is not as straightforward, as it seems :(. This is one *possible* solution, and not the most elegant one. You are free to use your own logic here.
```{r}
splitting_genres = strsplit(movies$genres, "|", fixed=TRUE) # split vector genres by '|'
genres <- unique(unlist(splitting_genres)) # collect all possible genres into one vector
# create matrix with zeros, where rows are movies and columns - genres
movie_genres_dummy <- as.data.frame(matrix(0, ncol=length(genres), nrow=nrow(movies)))
colnames(movie_genres_dummy) <- genres # assign names of genres to columns
# fill 1 if the genre is present for this movie, and 0 otherwise
# warning: may take longer time
for (movie_id in 1:length(splitting_genres)) {
movie_genres_dummy[movie_id, splitting_genres[[movie_id]]] <- 1
}
head(movie_genres_dummy)
```
Task 3. Explain the matrix that we have. What does it show?

We want to find similar movies. It can be used in such a wy that if one client watches the movies, to recommend the similar in amount of genres. The most straightforward way is to calculate the distance (inverse of similarity). Note that we have a large amount of movies and calcualting the distance for the whole matrix is computationally very expensive.
```{r}
#similarity between movies
dist(movie_genres_dummy[1,], movie_genres_dummy[2,], method = 'binary')
dist(movie_genres_dummy[1,], movie_genres_dummy[5,], method = 'binary')
dist(movie_genres_dummy[1:10,], method = 'binary')
min(dist(movie_genres_dummy[1:10,], method = 'binary'))
movie_genres_dummy[3,]
movie_genres_dummy[7,]
movies[c(3,7),]$title
```
Task 4. Try different examples and find what are the names of similar movies. Does it make sense?

## Ratings of the movie
Next, we want to take into account ratings. For that task we will take advantage of ```recommenderlab```, which requires matrix as an input.
```{r}
ratings_spread <- spread(ratings, key=movieId, value=rating) # columns - movies, rows-users
rating_matrix <- as.matrix(ratings_spread[,-1]) # exclude column with user ids
dimnames(rating_matrix) <- list(paste("u", unique(ratings$userId), sep=""),
paste("m", unique(ratings$movieId), sep=""))
```
Next, we create an objet suitable for the package input:
```{r}
rating_matrix_lab <- as(rating_matrix, "realRatingMatrix")
```
```{r, results='hide'}
getRatingMatrix(rating_matrix_lab)
```
```{r}
# can be translated to the list
#as(rating_matrix_lab, "list")
image(rating_matrix_lab) # too big to see
```
```{r}
# subset of the data
image(rating_matrix_lab[1:20,1:20])
```
It is recommended to briefly take a look at this tutorial: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
There are a lot of different recommender systems, listed by function
```{r}
recommenderRegistry$get_entry_names()
```
```{r}
recommenderRegistry$get_entry("POPULAR", dataType="realRatingMatrix")
recommenderRegistry$get_entry("IBCF", dataType="realRatingMatrix")
```
Task 5. Take a look at other methods. Discuss during the class, which ones would make sense for this movie recommender. Why?

Let's try our first model, which is recommendation based on popularity. The method computes average rating for each item based on available ratings and predicts each unknown rating as average for the item.
```{r}
model <- Recommender(rating_matrix_lab, method = "POPULAR")
recom <- predict(model, rating_matrix_lab[1:4], n=10)
as(recom, "list")
```
Task 6. Find the titles and genres of the movies that are recommended.

We can also predict ratings:
```{r}
prediction <- predict(model, rating_matrix_lab[1:5], type="ratings")
as(prediction, "matrix")[,1:5]
```
## Evaluation of the recommender system
```{r}
set.seed(5864)
eval_scheme <- evaluationScheme(rating_matrix_lab, method="split", train=0.8, given=-5)
#5 ratings of 20% of users (per user) are excluded for testing
model_popular <- Recommender(getData(eval_scheme, "train"), "POPULAR")
prediction_popular <- predict(model_popular, getData(eval_scheme, "known"), type="ratings")
# check visually predictions for 50 users , 50 movies
image(prediction_popular[1:50,1:50])
rmse_popular <- calcPredictionAccuracy(prediction_popular, getData(eval_scheme, "unknown"))[1]
rmse_popular
```
If you recall, RMSE (root mean square error) was also used in regression problems. It does not tell us much alone, but we can **compare** models:
```{r}
# learn about input parameters via help
model_ubcf <- Recommender(getData(eval_scheme, "train"), "UBCF",
param=list(normalize = "center", method="Cosine", nn=50))
```
Note that we use the same scheme
```{r}
prediction_ubcf <- predict(model_ubcf, getData(eval_scheme, "known"), type="ratings")
```
```{r}
rmse_ubcf <- calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown"))[1]
rmse_ubcf
rbind(calcPredictionAccuracy(prediction_popular, getData(eval_scheme, "unknown")),
calcPredictionAccuracy(prediction_ubcf, getData(eval_scheme, "unknown")))
```
Task 7. Which model is better? Why? Concentrate on discussing the code and the results. Compare the results for different seeds. Are these results consistent?