Today we will continue discussing customer lifecycle management. In particular we concentrate on the retention problem. Business data analytics can help you identify who is about to churn by training classification model.
Libraries
During the lab session we will need the following libraries:
library('tidyverse')
library('data.table')
library("party") # Decision Trees plotting
library("rpart") # Another Decision Trees library
library("rpart.plot") # Decision Trees plotting
library("randomForest") # Random Forest
Step 1: Data loading
Let’s load the data:
dt <- read.csv(file.choose()) # churn.csv
View(dt)
And check the data type of each column
sapply(dt, class)
## customerID gender SeniorCitizen Partner
## "factor" "factor" "integer" "factor"
## Dependents tenure PhoneService MultipleLines
## "factor" "integer" "factor" "factor"
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## "factor" "factor" "factor" "factor"
## TechSupport StreamingTV StreamingMovies Contract
## "factor" "factor" "factor" "factor"
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## "factor" "factor" "numeric" "numeric"
## Churn
## "factor"
Step 2: Data cleaning and transformation
Next step is to clean the data. It is necessary to take care of unknown values before performing the prediction task since we cant take into the account instances where values are missing.
Let’s check the columns with NA values
sapply(dt, function(x) sum(is.na(x)))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
Remove the rows with NA:
dt <- <YOUR CODE>
Next we check the unique calues for each column
func <- function(x) {
if(class(x) == "factor")
{
unique(x)
}
else
{
paste(min(x),"-", max(x))
}
}
sapply(dt[,-1], func)
## $gender
## [1] Female Male
## Levels: Female Male
##
## $SeniorCitizen
## [1] "0 - 1"
##
## $Partner
## [1] Yes No
## Levels: No Yes
##
## $Dependents
## [1] No Yes
## Levels: No Yes
##
## $tenure
## [1] "1 - 72"
##
## $PhoneService
## [1] No Yes
## Levels: No Yes
##
## $MultipleLines
## [1] No phone service No Yes
## Levels: No No phone service Yes
##
## $InternetService
## [1] DSL Fiber optic No
## Levels: DSL Fiber optic No
##
## $OnlineSecurity
## [1] No Yes No internet service
## Levels: No No internet service Yes
##
## $OnlineBackup
## [1] Yes No No internet service
## Levels: No No internet service Yes
##
## $DeviceProtection
## [1] No Yes No internet service
## Levels: No No internet service Yes
##
## $TechSupport
## [1] No Yes No internet service
## Levels: No No internet service Yes
##
## $StreamingTV
## [1] No Yes No internet service
## Levels: No No internet service Yes
##
## $StreamingMovies
## [1] No Yes No internet service
## Levels: No No internet service Yes
##
## $Contract
## [1] Month-to-month One year Two year
## Levels: Month-to-month One year Two year
##
## $PaperlessBilling
## [1] Yes No
## Levels: No Yes
##
## $PaymentMethod
## [1] Electronic check Mailed check
## [3] Bank transfer (automatic) Credit card (automatic)
## 4 Levels: Bank transfer (automatic) ... Mailed check
##
## $MonthlyCharges
## [1] "18.25 - 118.75"
##
## $TotalCharges
## [1] "18.8 - 8684.8"
##
## $Churn
## [1] No Yes
## Levels: No Yes
Let us also remove the “customerId” variable as these values are all unique and won’t contribute in the model.
<YOUR CODE>
Next, we need to change “No internet service” to “No” for the features “OnlineSecurity”, #“OnlineBackup”, “DeviceProtection”, “TechSupport”, “streamingTV”, “streamingMovies” since “No internet service” and “No” bear the same meaning.
for(i in c(10:15)) {
print(names(dt[i]))
dt[i][dt[i]=="No internet service"] <- "No"
dt[,i] <- factor(dt[,i])
}
## [1] "OnlineBackup"
## [1] "DeviceProtection"
## [1] "TechSupport"
## [1] "StreamingTV"
## [1] "StreamingMovies"
## [1] "Contract"
And we should also change “No phone service” to “No” for MultipleLines
<YOUR CODE>
As a next step in data preparation because some models will not accept numeric values, we need to change them into Factors. So for columns (or variables) which have a few (or small number of) values we will transform those columns (having integer values) into factors. For those columns which have higher number of unique values, we will transform them into bins.
unique(dt$SeniorCitizen)
## [1] 0 1
Just two values. We can turn this column to factor:
dt$SeniorCitizen <- <YOUR CODE>
Next column at which we should look is “tenure”. Lets plot it on the graph instead of just looking on the ranges:
dt %>%
<YOUR CODE>
Above we can see the distribution and range of the values however, we will use summary command to select intervals which would be more representative (discussed in Lab 3 also)
(Refer to Lab 3 for help)
Let’s create groups based on next values:
summary(dt$tenure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 29.00 32.42 55.00 72.00
dt <- mutate(dt, tenure = <YOUR CODE>)
unique(dt$tenure)
Now only two columns with type “numeric” left. We can either split the data into groups the save way we did with “tenure” or just ignore them while creating the model. However, ignoring features may decreese quality of the model. That is why we will split thouse columns as well.
But befor we start, lets plot the values for features “MonthlyCharges” and “TotalCharges”:
dt %>%
<YOUR CODE>
dt %>%
<YOUR CODE>
The plots looks similar. Let’s check correlation between those features:
cor(dt$MonthlyCharges, dt$TotalCharges)
## [1] 0.6510648
So, by looking at correlation we can say that those features are dependent on each other. This gives us possability to keep only one of them, as having both of the features won’t increase quality of our model that much.
dt$TotalCharges <- NULL
Now let’s split “MonthlyCharges”:
summary(dt$MonthlyCharges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.25 35.59 70.35 64.80 89.86 118.75
dt <- mutate(dt, MonthlyCharges = <YOUR CODE>)
unique(dt$MonthlyCharges)
Now that we prepared the data, let’s look at what we have:
str(dt)
## 'data.frame': 7032 obs. of 19 variables:
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : Factor w/ 5 levels "(0,9]","(9,29]",..: 1 4 1 4 1 1 2 2 2 5 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 1 1 2 ...
## $ DeviceProtection: Factor w/ 2 levels "No","Yes": 1 2 1 2 1 2 1 1 2 1 ...
## $ TechSupport : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ StreamingTV : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
## $ StreamingMovies : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 2 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : Factor w/ 5 levels "(17,36]","(36,60]",..: 1 2 2 2 3 5 4 1 5 2 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
And also don’t forget to check prepared data for NA values:
sapply(dt, function(x) sum(is.na(x)))
## gender SeniorCitizen Partner Dependents
## 0 0 0 0
## tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges Churn
## 0 0 0
Step 3: Preparing test and training data
Lets create training and test data. We will split the dataset like so:
. 70% - train data . 30% - test data
set.seed(999)
sample <- <YOUR CODE>
train <- <YOUR CODE>
test <- <YOUR CODE>
dim(train); dim(test)
First Model: Generalized linear model (logistic regression)
The first method that we will use is logistic regression:
LogModel <- glm(Churn ~ ., family=binomial, data=train)
summary(LogModel)
##
## Call:
## glm(formula = Churn ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0351 -0.6853 -0.2958 0.6705 3.1482
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.30773 0.21345 -1.442 0.149399
## genderMale 0.01540 0.07765 0.198 0.842833
## SeniorCitizen1 0.27290 0.10043 2.717 0.006582
## PartnerYes -0.03172 0.09236 -0.344 0.731216
## DependentsYes -0.04873 0.10707 -0.455 0.649014
## tenure(9,29] -0.95065 0.10444 -9.103 < 2e-16
## tenure(29,33] -1.22222 0.20915 -5.844 5.11e-09
## tenure(33,55] -1.34941 0.13885 -9.719 < 2e-16
## tenure(55,72] -1.80027 0.18426 -9.770 < 2e-16
## PhoneServiceYes -0.13673 0.24444 -0.559 0.575922
## MultipleLinesYes 0.30409 0.10249 2.967 0.003006
## InternetServiceFiber optic 1.33093 0.24271 5.484 4.17e-08
## InternetServiceNo -1.18393 0.31587 -3.748 0.000178
## OnlineSecurityNo internet service NA NA NA NA
## OnlineSecurityYes -0.33229 0.10476 -3.172 0.001515
## OnlineBackupYes -0.10641 0.09465 -1.124 0.260866
## DeviceProtectionYes 0.07706 0.09775 0.788 0.430502
## TechSupportYes -0.31329 0.10480 -2.989 0.002795
## StreamingTVYes 0.28290 0.12044 2.349 0.018828
## StreamingMoviesYes 0.38586 0.12205 3.161 0.001570
## ContractOne year -0.61354 0.12602 -4.869 1.12e-06
## ContractTwo year -1.47177 0.20654 -7.126 1.03e-12
## PaperlessBillingYes 0.36369 0.08912 4.081 4.48e-05
## PaymentMethodCredit card (automatic) -0.01932 0.13359 -0.145 0.885015
## PaymentMethodElectronic check 0.37206 0.11250 3.307 0.000943
## PaymentMethodMailed check -0.06360 0.13862 -0.459 0.646360
## MonthlyCharges(36,60] -0.19350 0.28210 -0.686 0.492764
## MonthlyCharges(60,71] -0.74656 0.39609 -1.885 0.059451
## MonthlyCharges(71,90] -0.83641 0.46257 -1.808 0.070578
## MonthlyCharges(90,119] -0.94729 0.57981 -1.634 0.102303
##
## (Intercept)
## genderMale
## SeniorCitizen1 **
## PartnerYes
## DependentsYes
## tenure(9,29] ***
## tenure(29,33] ***
## tenure(33,55] ***
## tenure(55,72] ***
## PhoneServiceYes
## MultipleLinesYes **
## InternetServiceFiber optic ***
## InternetServiceNo ***
## OnlineSecurityNo internet service
## OnlineSecurityYes **
## OnlineBackupYes
## DeviceProtectionYes
## TechSupportYes **
## StreamingTVYes *
## StreamingMoviesYes **
## ContractOne year ***
## ContractTwo year ***
## PaperlessBillingYes ***
## PaymentMethodCredit card (automatic)
## PaymentMethodElectronic check ***
## PaymentMethodMailed check
## MonthlyCharges(36,60]
## MonthlyCharges(60,71] .
## MonthlyCharges(71,90] .
## MonthlyCharges(90,119]
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5681.1 on 4921 degrees of freedom
## Residual deviance: 4098.9 on 4893 degrees of freedom
## AIC: 4156.9
##
## Number of Fisher Scoring iterations: 6
Now let’s predict the scores:
test$predict_glm <- predict(LogModel, newdata = test[, -19], type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
As result, we recived a scores. We have to round them to 1 or 0 depending on the threshold:
ggplot(data = test, aes(x=predict_glm, fill=Churn)) + geom_density(alpha=0.3) + theme_bw()
test$predict_glm <- ifelse(test$predict_glm >0.3, 1, 0)
Now we will evaluate the model:
Accuracy:
misClasificError <- mean(test$predict_glm != as.numeric(test$Churn)-1)
print(paste('Logistic Regression Accuracy',1-misClasificError))
## [1] "Logistic Regression Accuracy 0.759241706161137"
Second Model: Decision Tree
Logistic regression is a simple and transparent model. Let’s try something more powerful:
Way 1: with library “party”
tree <- ctree(Churn~Contract+tenure+PaperlessBilling, train)
plot(tree, type='simple')
Make prediction on Test dataset
pred_tree <- predict(tree, test)
Accuracy of Decision tree model on Test dataset:
misClasificError <- mean(pred_tree != test$Churn)
print(paste('Decision Tree Accuracy',1-misClasificError))
## [1] "Decision Tree Accuracy 0.774881516587678"
Way2: with library “rpart”
tree2 <- rpart(data=train, Churn ~ Contract+tenure+PaperlessBilling, method='class')
rpart.plot(tree2)
Third Model: Random Forrest
Create model with training data and see the confusion matrix
rfModel <- randomForest(Churn ~., data = train)
print(rfModel)
##
## Call:
## randomForest(formula = Churn ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 20.85%
## Confusion matrix:
## No Yes class.error
## No 3270 353 0.09743307
## Yes 673 626 0.51809084
Predict churn on the test dataset with the created model
pred_rf <- predict(rfModel, test)
Accuracy:
misClasificError <- mean(pred_rf != test$Churn)
print(paste('Random Forest Accuracy',1-misClasificError))
## [1] "Random Forest Accuracy 0.792417061611374"
Plot model to check error dynamics for the different number of trees
layout(matrix(c(1,2),nrow=1), width=c(4,1))
par(mar=c(5,4,4,0)) #No margin on the right side
plot(rfModel)
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1), axes=F, xlab="", ylab="")
legend("top", colnames(rfModel$err.rate),col=1:4,cex=0.8,fill=1:4)
varImpPlot(rfModel, sort=T, n.var = 10, main = 'Top 10 Feature Importance')