Today we will continue discussing customer lifecycle management. In particular we concentrate on the retention problem. Business data analytics can help you identify who is about to churn by training classification model.

Libraries

During the lab session we will need the following libraries:

library('tidyverse')
library('data.table')

library("party") # Decision Trees plotting

library("rpart") # Another Decision Trees library
library("rpart.plot") # Decision Trees plotting


library("randomForest") # Random Forest

Step 1: Data loading

Let’s load the data:

dt <- read.csv(file.choose()) # churn.csv
View(dt)

And check the data type of each column

sapply(dt, class)
##       customerID           gender    SeniorCitizen          Partner 
##         "factor"         "factor"        "integer"         "factor" 
##       Dependents           tenure     PhoneService    MultipleLines 
##         "factor"        "integer"         "factor"         "factor" 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##         "factor"         "factor"         "factor"         "factor" 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##         "factor"         "factor"         "factor"         "factor" 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##         "factor"         "factor"        "numeric"        "numeric" 
##            Churn 
##         "factor"

Step 2: Data cleaning and transformation

Next step is to clean the data. It is necessary to take care of unknown values before performing the prediction task since we cant take into the account instances where values are missing.

Let’s check the columns with NA values

sapply(dt, function(x) sum(is.na(x)))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0

Remove the rows with NA:

dt <- <YOUR CODE>

Next we check the unique calues for each column

func <- function(x) {
  if(class(x) == "factor")
  {
    unique(x) 
  } 
  else
  {
    paste(min(x),"-", max(x))
  }
}
sapply(dt[,-1], func)
## $gender
## [1] Female Male  
## Levels: Female Male
## 
## $SeniorCitizen
## [1] "0 - 1"
## 
## $Partner
## [1] Yes No 
## Levels: No Yes
## 
## $Dependents
## [1] No  Yes
## Levels: No Yes
## 
## $tenure
## [1] "1 - 72"
## 
## $PhoneService
## [1] No  Yes
## Levels: No Yes
## 
## $MultipleLines
## [1] No phone service No               Yes             
## Levels: No No phone service Yes
## 
## $InternetService
## [1] DSL         Fiber optic No         
## Levels: DSL Fiber optic No
## 
## $OnlineSecurity
## [1] No                  Yes                 No internet service
## Levels: No No internet service Yes
## 
## $OnlineBackup
## [1] Yes                 No                  No internet service
## Levels: No No internet service Yes
## 
## $DeviceProtection
## [1] No                  Yes                 No internet service
## Levels: No No internet service Yes
## 
## $TechSupport
## [1] No                  Yes                 No internet service
## Levels: No No internet service Yes
## 
## $StreamingTV
## [1] No                  Yes                 No internet service
## Levels: No No internet service Yes
## 
## $StreamingMovies
## [1] No                  Yes                 No internet service
## Levels: No No internet service Yes
## 
## $Contract
## [1] Month-to-month One year       Two year      
## Levels: Month-to-month One year Two year
## 
## $PaperlessBilling
## [1] Yes No 
## Levels: No Yes
## 
## $PaymentMethod
## [1] Electronic check          Mailed check             
## [3] Bank transfer (automatic) Credit card (automatic)  
## 4 Levels: Bank transfer (automatic) ... Mailed check
## 
## $MonthlyCharges
## [1] "18.25 - 118.75"
## 
## $TotalCharges
## [1] "18.8 - 8684.8"
## 
## $Churn
## [1] No  Yes
## Levels: No Yes

Let us also remove the “customerId” variable as these values are all unique and won’t contribute in the model.

<YOUR CODE>

Next, we need to change “No internet service” to “No” for the features “OnlineSecurity”, #“OnlineBackup”, “DeviceProtection”, “TechSupport”, “streamingTV”, “streamingMovies” since “No internet service” and “No” bear the same meaning.

for(i in c(10:15)) {
  print(names(dt[i]))
  dt[i][dt[i]=="No internet service"] <- "No"
  dt[,i] <- factor(dt[,i])
}
## [1] "OnlineBackup"
## [1] "DeviceProtection"
## [1] "TechSupport"
## [1] "StreamingTV"
## [1] "StreamingMovies"
## [1] "Contract"

And we should also change “No phone service” to “No” for MultipleLines

<YOUR CODE>

As a next step in data preparation because some models will not accept numeric values, we need to change them into Factors. So for columns (or variables) which have a few (or small number of) values we will transform those columns (having integer values) into factors. For those columns which have higher number of unique values, we will transform them into bins.

unique(dt$SeniorCitizen)
## [1] 0 1

Just two values. We can turn this column to factor:

dt$SeniorCitizen <- <YOUR CODE>

Next column at which we should look is “tenure”. Lets plot it on the graph instead of just looking on the ranges:

dt %>%
  <YOUR CODE>

Above we can see the distribution and range of the values however, we will use summary command to select intervals which would be more representative (discussed in Lab 3 also)

(Refer to Lab 3 for help)

Let’s create groups based on next values:

summary(dt$tenure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   29.00   32.42   55.00   72.00
dt <- mutate(dt, tenure = <YOUR CODE>)
unique(dt$tenure)

Now only two columns with type “numeric” left. We can either split the data into groups the save way we did with “tenure” or just ignore them while creating the model. However, ignoring features may decreese quality of the model. That is why we will split thouse columns as well.

But befor we start, lets plot the values for features “MonthlyCharges” and “TotalCharges”:

dt %>%
  <YOUR CODE>

dt %>%
  <YOUR CODE>

The plots looks similar. Let’s check correlation between those features:

cor(dt$MonthlyCharges, dt$TotalCharges)
## [1] 0.6510648

So, by looking at correlation we can say that those features are dependent on each other. This gives us possability to keep only one of them, as having both of the features won’t increase quality of our model that much.

dt$TotalCharges <- NULL

Now let’s split “MonthlyCharges”:

summary(dt$MonthlyCharges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.25   35.59   70.35   64.80   89.86  118.75
dt <- mutate(dt, MonthlyCharges = <YOUR CODE>)
unique(dt$MonthlyCharges)

Now that we prepared the data, let’s look at what we have:

str(dt)
## 'data.frame':    7032 obs. of  19 variables:
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : Factor w/ 5 levels "(0,9]","(9,29]",..: 1 4 1 4 1 1 2 2 2 5 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
##  $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ TechSupport     : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ StreamingTV     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : Factor w/ 5 levels "(17,36]","(36,60]",..: 1 2 2 2 3 5 4 1 5 2 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

And also don’t forget to check prepared data for NA values:

sapply(dt, function(x) sum(is.na(x)))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges            Churn 
##                0                0                0

Step 3: Preparing test and training data

Lets create training and test data. We will split the dataset like so:

. 70% - train data . 30% - test data

set.seed(999) 
sample <- <YOUR CODE>

train <- <YOUR CODE>
test  <- <YOUR CODE>

dim(train); dim(test)

First Model: Generalized linear model (logistic regression)

The first method that we will use is logistic regression:

LogModel <- glm(Churn ~ ., family=binomial, data=train)
summary(LogModel)
## 
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0351  -0.6853  -0.2958   0.6705   3.1482  
## 
## Coefficients: (1 not defined because of singularities)
##                                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)                          -0.30773    0.21345  -1.442 0.149399
## genderMale                            0.01540    0.07765   0.198 0.842833
## SeniorCitizen1                        0.27290    0.10043   2.717 0.006582
## PartnerYes                           -0.03172    0.09236  -0.344 0.731216
## DependentsYes                        -0.04873    0.10707  -0.455 0.649014
## tenure(9,29]                         -0.95065    0.10444  -9.103  < 2e-16
## tenure(29,33]                        -1.22222    0.20915  -5.844 5.11e-09
## tenure(33,55]                        -1.34941    0.13885  -9.719  < 2e-16
## tenure(55,72]                        -1.80027    0.18426  -9.770  < 2e-16
## PhoneServiceYes                      -0.13673    0.24444  -0.559 0.575922
## MultipleLinesYes                      0.30409    0.10249   2.967 0.003006
## InternetServiceFiber optic            1.33093    0.24271   5.484 4.17e-08
## InternetServiceNo                    -1.18393    0.31587  -3.748 0.000178
## OnlineSecurityNo internet service          NA         NA      NA       NA
## OnlineSecurityYes                    -0.33229    0.10476  -3.172 0.001515
## OnlineBackupYes                      -0.10641    0.09465  -1.124 0.260866
## DeviceProtectionYes                   0.07706    0.09775   0.788 0.430502
## TechSupportYes                       -0.31329    0.10480  -2.989 0.002795
## StreamingTVYes                        0.28290    0.12044   2.349 0.018828
## StreamingMoviesYes                    0.38586    0.12205   3.161 0.001570
## ContractOne year                     -0.61354    0.12602  -4.869 1.12e-06
## ContractTwo year                     -1.47177    0.20654  -7.126 1.03e-12
## PaperlessBillingYes                   0.36369    0.08912   4.081 4.48e-05
## PaymentMethodCredit card (automatic) -0.01932    0.13359  -0.145 0.885015
## PaymentMethodElectronic check         0.37206    0.11250   3.307 0.000943
## PaymentMethodMailed check            -0.06360    0.13862  -0.459 0.646360
## MonthlyCharges(36,60]                -0.19350    0.28210  -0.686 0.492764
## MonthlyCharges(60,71]                -0.74656    0.39609  -1.885 0.059451
## MonthlyCharges(71,90]                -0.83641    0.46257  -1.808 0.070578
## MonthlyCharges(90,119]               -0.94729    0.57981  -1.634 0.102303
##                                         
## (Intercept)                             
## genderMale                              
## SeniorCitizen1                       ** 
## PartnerYes                              
## DependentsYes                           
## tenure(9,29]                         ***
## tenure(29,33]                        ***
## tenure(33,55]                        ***
## tenure(55,72]                        ***
## PhoneServiceYes                         
## MultipleLinesYes                     ** 
## InternetServiceFiber optic           ***
## InternetServiceNo                    ***
## OnlineSecurityNo internet service       
## OnlineSecurityYes                    ** 
## OnlineBackupYes                         
## DeviceProtectionYes                     
## TechSupportYes                       ** 
## StreamingTVYes                       *  
## StreamingMoviesYes                   ** 
## ContractOne year                     ***
## ContractTwo year                     ***
## PaperlessBillingYes                  ***
## PaymentMethodCredit card (automatic)    
## PaymentMethodElectronic check        ***
## PaymentMethodMailed check               
## MonthlyCharges(36,60]                   
## MonthlyCharges(60,71]                .  
## MonthlyCharges(71,90]                .  
## MonthlyCharges(90,119]                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5681.1  on 4921  degrees of freedom
## Residual deviance: 4098.9  on 4893  degrees of freedom
## AIC: 4156.9
## 
## Number of Fisher Scoring iterations: 6

Now let’s predict the scores:

test$predict_glm <- predict(LogModel, newdata = test[, -19], type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

As result, we recived a scores. We have to round them to 1 or 0 depending on the threshold:

ggplot(data = test, aes(x=predict_glm, fill=Churn)) + geom_density(alpha=0.3) + theme_bw()

test$predict_glm <- ifelse(test$predict_glm >0.3, 1, 0)

Now we will evaluate the model:

Accuracy:

misClasificError <- mean(test$predict_glm != as.numeric(test$Churn)-1)
print(paste('Logistic Regression Accuracy',1-misClasificError))
## [1] "Logistic Regression Accuracy 0.759241706161137"

Second Model: Decision Tree

Logistic regression is a simple and transparent model. Let’s try something more powerful:

Way 1: with library “party”

tree <- ctree(Churn~Contract+tenure+PaperlessBilling, train)
plot(tree, type='simple')

Make prediction on Test dataset

pred_tree <- predict(tree, test)

Accuracy of Decision tree model on Test dataset:

misClasificError <- mean(pred_tree != test$Churn)
print(paste('Decision Tree Accuracy',1-misClasificError))
## [1] "Decision Tree Accuracy 0.774881516587678"

Way2: with library “rpart”

tree2 <- rpart(data=train, Churn ~ Contract+tenure+PaperlessBilling, method='class')
rpart.plot(tree2)

Third Model: Random Forrest

Create model with training data and see the confusion matrix

rfModel <- randomForest(Churn ~., data = train)
print(rfModel)
## 
## Call:
##  randomForest(formula = Churn ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 20.85%
## Confusion matrix:
##       No Yes class.error
## No  3270 353  0.09743307
## Yes  673 626  0.51809084

Predict churn on the test dataset with the created model

pred_rf <- predict(rfModel, test)

Accuracy:

misClasificError <- mean(pred_rf != test$Churn)
print(paste('Random Forest Accuracy',1-misClasificError))
## [1] "Random Forest Accuracy 0.792417061611374"

Plot model to check error dynamics for the different number of trees

layout(matrix(c(1,2),nrow=1), width=c(4,1)) 
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rfModel)
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1), axes=F, xlab="", ylab="")
legend("top", colnames(rfModel$err.rate),col=1:4,cex=0.8,fill=1:4)

varImpPlot(rfModel, sort=T, n.var = 10, main = 'Top 10 Feature Importance')