Big Data Analytics and Machine Learning: Machine Learning Case Study

In this tutorial you will learn how to build churn model using R programing language.This tutorial will teach you all major steps performed during data science/machine learning pipeline such as (Data-set cleaning,Feature extraction,Feature enrichment, Model building and evaluation). Cross Validation won't be covered in this tutorial,which i will discuss in more details with some other case study.

Overview

The goal of churn analytics is to understand the primary drivers to churn and predict churn. Churn can have a very specific meaning depending upon the industry or even the organization we are talking about, but in general it is related to the extension of the contract between a service provider and a subscriber.

Businesses need to have an effective strategy for managing customer churn because it costs more to attract new customers than to retain existing ones. Customer churn can take different forms, such as switching to a competitor’s service, reducing the time spent using the service, reducing the number of services used, or switching to a lower-cost service. Companies in the retail, media, telecommunication, and banking industries use churn modeling to create better products, services, and experiences that lead to a higher customer retention rate.Churn models enable companies to predict which customers are most likely to churn, and to understand the factors that cause churn to occur.

In this data set, the target variable is the last column, Status, which stores the information about whether a user churned or not. True stands for churn customers and False for active. We have a total of 4293 active and 707 churn customers in the data-set.

Let's have a look at the column definition; this is a good starting point to understand a data-set:

Column Data Type

Column Name Type
State Discrete
account length Continuous
area code Continuous
phone number Discrete
international plan Discrete
voice mail plan Discrete
number v-mail messages Continuous
total day minutes Continuous
total day calls Continuous
total day charge Continuous
total eve minutes Continuous
total eve calls Continuous
total eve charge Continuous
total night minutes Continuous
total night calls Continuous
total night charge Continuous
total intl minutes Continuous
total intl calls Continuous
total intl charge Continuous
customer service calls Continuous
Status Discrete

The data set, as seen in this table, has various telecom service usage metrics from rows eight to 19. They cover attributes such as total number of calls, total charge,and total minutes used by different slices of the data. The slices include time,day or night, and usage type such as international call. Row 20 has the number of customer service calls made and row 21 is the status of the subscriber, which is our target variable.

Data Cleaning Steps

The data set is not clean ,so we will do some basic cleaning steps using unix sed operation.

# Remove the white spaces from the file
sed -i 's/\s//g' churn.all

# Replace False. with False and True. with True
sed -i 's/False./False/g' churn.all
sed -i 's/True./True/g' churn.all

# Add the header line
sed -i '1s/^/state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,number customer service calls,Status\n/' churn.all

Exploring Data set using R

#Read the csv into a data frame
churn_data<-read.csv("churn.all",header=T)

# View the data frame created
View(churn_data)

# Summary of all active customers
summary_all<-summary(churn_data)
summary_churn<-summary(subset(churn_data,Status=='TRUE'))
summary_active<-summary(subset(churn_data,Status=='FALSE'))

** The pattern to observe while looking at the summary file is to observe substantial difference between the summaries of churn and active customers,especially the mean, median, and the 1st and 3rd quartile.

# Check the correlation between numerical variables
cor_data<-churn_data
cor_data$Status<-NULL
cor_data$voice.mail.plan<-NULL
cor_data$international.plan <-NULL
cor_data$phone.number<-NULL
cor_data$state<-NULL

# Calculate the correlation, which returns a correlation matrix
correlation_all<-cor(cor_data)

Feature Engineering

Looking at correlation_all object , we can see that four pairs of columns are heavily correlated and we should remove them:

churn_data$total.day.charge<-NULL
churn_data$total.eve.charge<-NULL
churn_data$total.night.charge<-NULL
churn_data$total.intl.charge<-NULL

We will remove the features phone number and state.We need to remove these columns from all the files.Another way of removing correlation is to perform dimensionality reduction such as PCA(preferred when feature set is large).

churn_data$state<-NULL
churn_data$phone.number<-NULL
write.csv(churn_data,file="churn_data_clean.all.csv",row.names = F)

Feature Enrichment

The features like total day calls and total eve calls measure frequency of usage whereas features such as total day minutes and total eve minutes measure volume of usage. Another interesting feature to look at would be the average minutes per call.
We can measure the average by dividing the total minutes by total calls, for example,the feature average minutes per day call = total day minutes / total day calls and

similarly,average minutes per eve call = total eve minutes/ total eve calls.

churn_data$avg.minute.day <- churn_data$total.day.minutes/churn_data$total.day.calls
churn_data$avg.minute.eve <- churn_data$total.eve.minutes/churn_data$total.eve.calls

churn_data$avg.minute.night <- churn_data$total.night.minutes/churn_data$total.night.calls
churn_data$avg.minute.intl <- churn_data$total.intl.minutes/churn_data$total.intl.calls

# Spliting Train and Test Data using caTools/Caret
library (caret)
set.seed (111)
trainIndex <- createDataPartition (churn_data$Status, p = .7,list = FALSE,times = 1)

train <- churn_data[trainIndex, ]
test <- churn_data[-trainIndex, ]

# See the distribution of churn and active accounts across the train and test sets
table(train$Status)
table(test$Status)

# Save the train & test sets as csv files
write.csv(train,file="churn_data_clean.all.csv",row.names = F)
write.csv(test,file="churn_data_clean_test.all.csv",row.names = F)

Now that we have the training and testing data set,we can prepare our Churn Model using either Apache Mahout, R or Python.Here i will show using R only. all codes(R, Mahout & Python) are available at github.

CHURN MODEL using R

churnTrain<-read.csv("churn_data_clean.all.csv")
churnTest<-read.csv("churn_data_clean_test.all.csv")

#Using RandomForest
library(randomForest)
churnRF <-randomForest(Status~.,data=churnTrain,ntree=100,proximity=TRUE)

#Accuracy on Training set
table(predict(churnRF),churnTrain$Status)

Output:
False True
False 2980 166
True 26 329
> (2980+329)/nrow(churnTrain)
[1] 0.9451585

#Accuracy on Test set
churnPred<-predict(churnRF,newdata=churnTest)
table(churnPred, churnTest$Status)

Output:
False True
False 1274 63
True 13 149
> (1274+149)/nrow(churnTest)
[1] 0.9492995

So the Churn model using Random Forest has an accuracy of 94.9 %.

Big Data Analytics and Machine Learning

Tuesday, 27 October 2015

Machine Learning Case Study - Churn Analytics

Overview

Column Data Type

Data Cleaning Steps

Exploring Data set using R

Feature Engineering

Feature Enrichment

CHURN MODEL using R

No comments:

Post a Comment

Labels

About Me