Note that we want to train **generalized model** : we do train the model on customers for whom we know the first and the next year, but we want to apply it on the different customers -- those for whom we know only their first year, but don't know anything about their second year (as it is in the future). For example, we would be able to apply this model using data from 2017 to predict 2018.

```{r}
#collect features, this one you already seen. the only difference is groupings
dt_year <- dt_raw %>%
mutate(year = substr(transaction_date, start=1, stop=4), year=as.integer(year)) %>%
group_by(customer_id) %>%
mutate(min_year = min(year), max_year=max(year), years_active=max_year-min_year) %>% # creating min, max years
ungroup() %>%
group_by(customer_id, year) %>%
summarise(transaction_per_customer = n(), # amount of reservations in each group
amount_per_customer = sum(amount),
amount_per_transaction=amount_per_customer/transaction_per_customer,
min_year=first(min_year),
years_active=first(years_active))
dt_year
```
```{r}
# filter only those who were more than one year active
dt_prep <- filter(dt_year, years_active!=0) %>%
mutate(year_number=row_number()) %>% # identify what is the first, second year and so on
filter(year_number %in% c(1,2)) %>% #let's take only a year old customers next year revenue.
select(customer_id, transaction_per_customer, amount_per_customer, amount_per_transaction, year_number)
#select features of interest
# as tidyr does not allow to spread several columns, we are using dcast in data.table
dt_feat_table <- dcast(setDT(dt_prep), customer_id ~ year_number, value.var=c('transaction_per_customer', 'amount_per_customer', 'amount_per_transaction'))
# it is like "spread" but several columns can be spread simultaniously
# discard all the features about second year (we could not know them in advance)
# but keep our y - we want to predict next year amount per customer
dt_feat_table <- select(dt_feat_table, -contains("_2"), 'amount_per_customer_2')
```
Let's join our transactional preprocessed data with our survey:
```{r}
dt_feat_table <- left_join(dt_feat_table, customer_survey, by='customer_id')
dt_feat_table$gender <- as.factor(dt_feat_table$gender)
dt_feat_table$discount_proposed <- as.factor(dt_feat_table$discount_proposed)
# it is important to translate necessary features
# into factors!
```
Now, we divide our data on two sets: training and test. We want to **train model** on the data, but to **validate the model** we need to use test:
```{r}
set.seed(385) # fix the seed for reproducibility
train_idx <- sample(nrow(dt_feat_table), round(nrow(dt_feat_table)/100*80,0), replace = F) # usualy data is splited in 20% of test data and 80% of train
train <- dt_feat_table[train_idx,]
test <- dt_feat_table[-train_idx,]
```
Now we can use multiple variables to train our model:
```{r}
#model_1 <- lm(data=train[,-1], amount_per_customer_2 ~ .)
start <- Sys.time()
model_1 <- lm(data=train[,-1], amount_per_customer_2 ~ amount_per_customer_1 + transaction_per_customer_1 +
amount_per_transaction_1 + gender + age + discount_proposed + clicks_in_eshop)
lm3_time <- Sys.time() - start
summary(model_1)
```
```{r}
test$prediction_linear <- predict(model_1, newdata=test[,-1]) # predictions on a test set
prediction_quality <- function(predictions, real_values){ # I wrote this function to use different measures of accuracy
diff <- sum((real_values - predictions)^2)
mse <- diff/length(predictions)
rmse <- sqrt(mse)
mae <- sum(abs(real_values - predictions))/length(predictions)
print(paste("mean squared error is ", mse))
print(paste("root mean squared error is ", rmse))
print(paste("mean absolute error is ", mae))
print(lm3_time)
}
prediction_quality(test$prediction_linear, test$amount_per_customer_2)
```