Tuesday, 27 October 2015

Machine Learning Case Study - Churn Analytics

In this tutorial you will learn how to build churn model using R programing language.This tutorial will teach you all major steps performed during data science/machine learning pipeline such as (Data-set cleaning,Feature extraction,Feature enrichment, Model building and evaluation). Cross Validation won't be covered in this tutorial,which i will discuss in more details with some other case study.

Overview

The goal of churn analytics is to understand the primary drivers to churn and predict churn. Churn can have a very specific meaning depending upon the industry or even the organization we are talking about, but in general it is related to the extension of the contract between a service provider and a subscriber.

Businesses need to have an effective strategy for managing customer churn because it costs more to attract new customers than to retain existing ones. Customer churn can take different forms, such as switching to a competitor’s service, reducing the time spent using the service, reducing the number of services used, or switching to a lower-cost service. Companies in the retail, media, telecommunication, and banking industries use churn modeling to create better products, services, and experiences that lead to a higher customer retention rate.Churn models enable companies to predict which customers are most likely to churn, and to understand the factors that cause churn to occur.

In this data set, the target variable is the last column, Status, which stores the information about whether a user churned or not. True stands for churn customers and False for active. We have a total of 4293 active and 707 churn customers in the data-set.

Let's have a look at the column definition; this is a good starting point to understand a data-set:

Column Data Type

Column Name                   Type
State                                     Discrete
account length                     Continuous
area code                             Continuous
phone number                     Discrete
international plan                Discrete
voice mail plan                   Discrete
number v-mail messages    Continuous
total day minutes                Continuous
total day calls                     Continuous
total day charge                  Continuous
total eve minutes                Continuous
total eve calls                     Continuous
total eve charge                  Continuous
total night minutes             Continuous
total night calls                  Continuous
total night charge               Continuous
total intl minutes               Continuous
total intl calls                    Continuous
total intl charge                 Continuous
customer service calls       Continuous
Status                                 Discrete


The data set, as seen in this table, has various telecom service usage metrics from rows eight to 19. They cover attributes such as total number of calls, total charge,and total minutes used by different slices of the data. The slices include time,day or night, and usage type such as international call. Row 20 has the number of customer service calls made and row 21 is the status of the subscriber, which is our target variable.

Data Cleaning Steps

The data set is not clean ,so we will do some basic cleaning steps using unix sed operation.

# Remove the white spaces from the file
sed -i 's/\s//g' churn.all

# Replace False. with False and True. with True
sed -i 's/False./False/g' churn.all
sed -i 's/True./True/g' churn.all


# Add the header line
sed -i '1s/^/state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,number customer service calls,Status\n/' churn.all


Exploring Data set using R

#Read the csv into a data frame
churn_data<-read.csv("churn.all",header=T)

# View the data frame created
View(churn_data)

# Summary of all active customers
summary_all<-summary(churn_data)
summary_churn<-summary(subset(churn_data,Status=='TRUE'))
summary_active<-summary(subset(churn_data,Status=='FALSE'))

** The pattern to observe while looking at the summary file is to observe substantial difference between the summaries of churn and active customers,especially the mean, median, and the 1st and 3rd quartile.

# Check the correlation between numerical variables
cor_data<-churn_data
cor_data$Status<-NULL
cor_data$voice.mail.plan<-NULL
cor_data$international.plan <-NULL
cor_data$phone.number<-NULL
cor_data$state<-NULL

# Calculate the correlation, which returns a correlation matrix
correlation_all<-cor(cor_data)

Feature Engineering

Looking at correlation_all object , we can see that four pairs of columns are heavily correlated and we should remove them:

churn_data$total.day.charge<-NULL
churn_data$total.eve.charge<-NULL
churn_data$total.night.charge<-NULL
churn_data$total.intl.charge<-NULL

We will remove the features phone number and state.We need to remove these columns from all the files.Another way of removing correlation is to perform dimensionality reduction such as PCA(preferred when feature set is large).

churn_data$state<-NULL
churn_data$phone.number<-NULL
write.csv(churn_data,file="churn_data_clean.all.csv",row.names = F)

Feature Enrichment

The features like total day calls and total eve calls measure frequency of usage whereas features such as total day minutes and total eve minutes measure volume of usage. Another interesting feature to look at would be the average minutes per call.
We can measure the average by dividing the total minutes by total calls, for example,the feature average minutes per day call = total day minutes / total day calls and

similarly,average minutes per eve call = total eve minutes/ total eve calls.

churn_data$avg.minute.day <- churn_data$total.day.minutes/churn_data$total.day.calls
churn_data$avg.minute.eve <- churn_data$total.eve.minutes/churn_data$total.eve.calls

churn_data$avg.minute.night <- churn_data$total.night.minutes/churn_data$total.night.calls
churn_data$avg.minute.intl <- churn_data$total.intl.minutes/churn_data$total.intl.calls

# Spliting Train and Test Data using caTools/Caret
library (caret)
set.seed (111)
trainIndex <- createDataPartition (churn_data$Status, p = .7,list = FALSE,times = 1)

train <- churn_data[trainIndex, ]
test <- churn_data[-trainIndex, ]

# See the distribution of churn and active accounts across the train and test sets
table(train$Status)
table(test$Status)

# Save the train & test sets as csv files
write.csv(train,file="churn_data_clean.all.csv",row.names = F)
write.csv(test,file="churn_data_clean_test.all.csv",row.names = F)


Now that we have the training and testing data set,we can prepare our Churn Model using either Apache Mahout, R or Python.Here i will show  using R only. all codes(R, Mahout & Python) are available at github.

CHURN MODEL using R

churnTrain<-read.csv("churn_data_clean.all.csv")
churnTest<-read.csv("churn_data_clean_test.all.csv")

#Using RandomForest
library(randomForest)
churnRF <-randomForest(Status~.,data=churnTrain,ntree=100,proximity=TRUE)

#Accuracy on Training set
table(predict(churnRF),churnTrain$Status)

Output:
       False True
  False  2980  166
  True     26  329
> (2980+329)/nrow(churnTrain)
[1] 0.9451585


#Accuracy on Test set
churnPred<-predict(churnRF,newdata=churnTest)
table(churnPred, churnTest$Status)

Output:
  False True
    False  1274   63
    True     13  149
> (1274+149)/nrow(churnTest)
[1] 0.9492995

So the Churn model using Random Forest has an accuracy of 94.9 %. 

No comments:

Post a Comment