--- title: "Business Data Analytics. Practice Session" subtitle: Customer segmentation author: University of Tartu output: prettydoc::html_pretty: null highlight: github html_document: default html_notebook: default github_document: default theme: cayman --- ```{r global_options, include=FALSE} knitr::opts_chunk\$set(warning=FALSE, message=FALSE) ``` ```{r setup, echo=FALSE} library(knitr) ``` Today practice session is about customer segmentation, which has become an essential part of marketing. During the practice we will deal with the RFM model and the heuristic approach, while later on we will demonstrate the automatic segmentation discovery via kmeans and hierarchical clustering. ## Packages For this practice session you will need next packages: ```{r} library(dplyr) library(ggplot2) library(gridExtra) library(data.table) library(tm) # package for text mining library(wordcloud) # for word visualization library(ape) # dendrogram plotting library(ggdendro) # dendrogram plotting ``` In case some packages can not be found, place install them with the next command: install.packages("name-of-the-package") ## RFM Let's first load a dataset, where there is an information about clients and their orders along with the order details: ```{r read_table, cache=TRUE} orders <- read.table(file.choose(), header=TRUE, sep=',') # orders_rfm.csv ``` The next logical step is to investigate what kind of data you have. ```{r} View(orders) ``` Let's perform **RFM analysis**, which means to calculate three key measures: 1. **R** - recency score 2. **F** - frequency score 3. **M** - monetary score In our case recency will be expressed as number of days since the last order, frequency will be defined as the number of purchased items, and monetary score is the total amount spent during the defined period. There are a lot of variations of these definitions. For example, when necessary, you may want to aggregate recency component on a yearly basis rather than using days; frequency and monetary scores can be expressed as the percentage of one period to another, etc. Moreover, it is important to define the **period** under investigation: First let's pick date of interest: ```{r reporting_date} reporting_date <- as.Date('2017-03-10', format='%Y-%m-%d') reporting_date ``` Next, we have to change the type of the date in our table from factor to date: ```{r} str(orders) orders\$order_date <- as.Date(orders\$order_date, format='%Y-%m-%d') str(orders) ``` For more details about the format of the dates refer to: https://www.statmethods.net/input/dates.html Since we are interested only in orders that happened before specified date, we will filter unneccessary orders: ```{r} orders <- filter(orders, order_date <= reporting_date) ``` As we discussed previously, the descriptive part helps to get sense of the data: ```{r} length(unique(orders\$client_id)) # Client ids table(orders\$product) # Times each product was bought ```
Note. We will have some comments with package name in order to follow where the functions originate from.
We will calcualte the frequency, recency and monetary values in the following way: ```{r} #dplyr frm_tbl_initial <- orders %>% group_by(client_id) %>% summarise(order_frequency = n(), # amount of products order_recency = min(reporting_date - order_date), # days since last order order_monetary = sum(money_spent)) # total amount spent head(frm_tbl_initial) ``` Order recency is a time object (days), which can cause errors later. We need to transform it into numeric value: ```{r} class(frm_tbl_initial\$order_recency) # checks type of the variable frm_tbl_initial\$order_recency <- as.numeric(frm_tbl_initial\$order_recency) class(frm_tbl_initial\$order_recency) ``` We will investigate the distribution of the values in our RFM calculations: ```{r} #ggplot2 ggplot(frm_tbl_initial, aes(x=order_recency)) + geom_histogram(fill='#8b3840', color='grey60', binwidth = 1) + theme_bw() ggplot(frm_tbl_initial, aes(x=order_frequency)) + geom_histogram(fill='#8b3840', color='grey60', binwidth = 1) + theme_bw() ```
Task: Demonstrate the code for the histogram of the monetary values
```{r echo=FALSE} ggplot(frm_tbl_initial, aes(x=order_monetary)) + geom_histogram(fill='#8b3840', color='grey60') + theme_bw() ``` Now, we need to define the limits of our RFM values and divide these intervals into bins. There are, again, many ways to proceed. We will use our **domain knowledge** and define cut points manually. However, we can define bins using quantiles, using equal intervals in terms of values or equal intervals in terms of number of observations in each bin.
Note. ```cut_interval``` makes n groups with equal range, ```cut_number``` makes n groups with (approximately) equal numbers of observations. ```cut``` function allows to specify cutting points manually.