Big Data Analytics and Machine Learning

Friday, 6 November 2015

Deep Learning - Image Classification and Similar Image Retrieval

In this tutorial i will show you how to build a deep learning network for image recognition CIFAR-10 data set. The CIFAR data-set represents real-world data that is already formatted and labeled, so we can focus on building our network today instead of cleaning the data.

We will be using deep visual features to match the product images to each other. In order to do that, we need to load in our pre-trained ImageNet neural network model to be used as a feature extractor, and extract features from the images in the data set.

Deep Learning

Deep learning (deep machine learning, or deep structured learning, or hierarchical learning, or sometimes DL) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers with complex structures or otherwise, composed of multiple non-linear transformations.
More details follow :
https://en.wikipedia.org/wiki/Deep_learning
http://deeplearning.stanford.edu/tutorial/

This tutorial will teach you the following:

Loading the Data
Training our Model
Evaluating the results of the model
Look at ways to improving the performance of our model

All Codes available at github.

Note: Few things in this tutorial are taken from the course that i have done recently on Coursera on Machine Learning Foundation.

We've download a subset of the CIFAR-10 data set for you. We load the data into an SFame, which is a powerful and scalable data structure that is used by many of the models in GraphLab Create.

Loading Image train and test data
image_train = graphlab.SFrame('http://s3.amazonaws.com/dato-datasets/coursera/deep_learning/image_train_data')
image_test = graphlab.SFrame('http://s3.amazonaws.com/dato-datasets/coursera/deep_learning/image_test_data')

Resizing image size
graphlab.image_analysis.resize(image_train['image'], 128, 128, 3).show()

Build Logistic Classifier

Now use the logistic_classifier provided by GraphLab.
raw_pixel_model = graphlab.logistic_classifier.create(image_train, features=['image_array'], target='label')

Model Validation and Evaluation

In order to ensure that the deep learning model is actually learning how to recognize the data, instead of memorizing features, we want to validate it with a data set it hasn't seen before.

graphlab.image_analysis.resize(image_test[0:1]['image'], 128, 128, 3).show()

Let us check the label of this image.
image_test[0:1]['label']

Output:
dtype: str
Rows: 1
['cat']

Now let see what our model predict.
raw_pixel_model.predict(image_test[0:1])

Output:
dtype: str
Rows: 1
['dog']

Ooph !! it a wrong prediction..

Evaluating the Logistic model
raw_pixel_model.evaluate(image_test)

Output:
'accuracy': 0.46225

Build a Neural Network Classifier

Now use the neural network classifier provided by GraphLab and let us create a neural network for our data set with same feature. The create method picks a default network architecture for you based on the data set.
deep_model = graphlab.neuralnet_classifier.create(image_train, features=['image'], target='label')

Prediction with Neural Network Model
Lets use the same test data that we used for our previous model.
graphlab.image_analysis.resize(image_test[0:1]['image'], 128, 128, 3).show()
deep_model.predict(image_test[0:1])

Output:
dtype: str
Rows: 1
['cat']

Wow great !! it predicted correctly..Hooray ~~

Evaluating the Neural Model
deep_model.evaluate(image_test)

Output:
'accuracy': 0.597000002861023

Surely the accuracy level is better than Logistic classifier model.

But can we improve it further???

Improving the model further with Deep Features

Let us try to rebuild the model using deep features of the imagesand evauate our model.

Rebuilding our model
deep_features_model = graphlab.logistic_classifier.create(image_train, features=['deep_features'], target='label')

Evaluating the deep feature model
deep_features_model.evaluate(image_test)

Output:
'accuracy': 0.784

Its awesome!! You try withe the neural classifier with deep feature and check whether its better than this or not.

Similar Image Retrieval

It time to dig deeper. Using the deep learning knowledge that we have gather so far,lets use the images having deep feature to find similar images.
One real world use-case for this would be searching similar items at shopping cart using images rather than text word in search bar. Similar images along with meta text search wold make you shopping site awesome and your customer will love it.
Have go hero. I will host a similar service in AWS soon for a demo.

Finding Similar Images with nearest neighbor model
nearest_neighbors_model = graphlab.nearest_neighbors.create(image_train, features=['deep_features'],label='id')

def get_nearest_neighbors(image):
ans = nearest_neighbors_model.query(image)
return image_train.filter_by(ans['reference_label'],'id')

Similar Images: Cat Photo
cat = image_train[18:19]
graphlab.image_analysis.resize(cat['image'],128,128,3).show()

Picture size is not good(small pixel)

Finding similar images of cats
graphlab.image_analysis.resize(get_nearest_neighbors(cat)['image'],128,128,3).show()

Hope you enjoy this tutorial :)

Wednesday, 4 November 2015

Machine Learning Case Study- Restaurant review Analysis

In this tutorial you will learn how to classify textual data using GraphLab. We will be using a restaurant review data set for this purpose.In this tuorial along with classification you will also learn some basic feature engineering(bag of words, tf-idf) stuff that are essential part of any text analytics machine learning project.

All codes available at github.

Let's have a quick view of the Data set

I will be using iPython notebook for this exercise.
Note: You can use linux command from ipython notebook using a ! before the command.

!head -n 2 /home/kuntal/data/yelp/yelp_training_set_review.json

SFrame (Scalable Dataframe) -Powerful unstructured data processing library.

reviews = gl.SFrame.read_csv('/home/kuntal/data/yelp/yelp_training_set_review.json', header=False)
reviews[0]

Lets unpack to extract structure the column "X1".
reviews=reviews.unpack('X1','')
reviews.head(4)

Votes are still crammed in a dictionary. Let's unpack it too.
reviews = reviews.unpack('votes', '')

Data Visualization
reviews.show()

Feature Engineering

Represent datetime (Date formatting)
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')

Munging votes and adding a new column
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']

Filter rows to remove reviews with no votes
reviews = reviews[reviews['total_votes'] > 0]

Classification task

Predict which reviews will be voted "funny," based on review text.First, the labels. Reviews with at least one vote for "funny" is funny.

reviews['funny'] = reviews['funny'] > 0
reviews = reviews[['text','funny']]

Creating bag-of-words representation of text
word_delims = ["\r", "\v", "\n", "\f", "\t", " ",
'~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=',
',', '.', ';', ':', '\"', '?', '|', '\\', '/',
'<', '>', '(', ')', '[', ']', '{', '}']

reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)

Creating tf-idf representation of the bag of words
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])

Creating a train-test split
train_sf, test_sf = reviews.random_split(0.8)
Note: It returns immediately because SFrame operations are lazily evaluated.

Training classifiers on bow and tf-idf
Dictionaries are automatically interpreted as sparse features.We will be using GraphLab's in built "Logistic regression" module to create our classification models using different feature.

# Model-1 with feature 'bow
m1 = gl.logistic_classifier.create(train_sf,
'funny',
features=['bow'],
validation_set=None,
feature_rescaling=False)

# Model-2 with feature tf-idf
m2 = gl.logistic_classifier.create(train_sf,
'funny',
features=['tf_idf'],
validation_set=None,
feature_rescaling=False)

Evaluating on validation set and comparing the models performance
m1_res = m1.evaluate(test_sf)
m2_res = m2.evaluate(test_sf)

Baseline accuracy (what if we classify everything as the majority class)
float(test_sf['funny'].sum())/test_sf.num_rows()

Output:
0.4800796812749004

Percentage of not funny reviews
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()

Output:
0.5199203187250996

Saturday, 31 October 2015

Machine Learning Case Study - Housing Price Prediction

In this tutorial we will be using supervised machine learning technique 'Linear Regression' to predict the housing price.
We will be using a very power and scalable machine learning framework 'GraphLab' to do this case study.

Note:
Graphlab is free for academic and personal use. For details about GraphLab Create,please visit.

Coding Time

Let's start your python editor,i'm using ipython notebook.

# Fire up graphlab create
import graphlab

Load some house sales data

Data set is from house sales in King County, the region where the city of Seattle, WA is located.This data set is taken from a machine learning course that i have done recently.

sales = graphlab.SFrame.read_csv('home_data.csv')

# Check some sample data
sales.head(3)

Exploring the data for housing sales

The house price is correlated with the number of square feet of living space.
Uncomment ,if you want to view the plot in same ipython notebook

#graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

Creating a simple regression model of sqft_living to price.

Let's create a simple linear regression model with one feature(sqft_living). This is also called univariate model as it is using only feature/independent variable.

Split data into training and testing

We use seed=0 so that everyone following this tutorial gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).

train_data,test_data = sales.random_split(.8,seed=0)

Building the regression model using only sqft_living as a feature

sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)

Evaluating the simple model
Now its time to evaluate our simple model

print test_data['price'].mean()
print sqft_model.evaluate(test_data)

RMSE of about $255,170! Not good, can we create a better model??

Plotting with Matplotlib

Let's us check what our predictions look like,Matplotlib is a Python plotting library that is also useful for plotting.

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(test_data['sqft_living'],test_data['price'],'.',
test_data['sqft_living'],sqft_model.predict(test_data),'-')

Above: blue dots are original data, green line is the prediction from the simple regression.
We can also view the learned regression coefficients using following command:
sqft_model.get('coefficients')

Explore other features in the data

To build a more elaborate model, we will explore using more features.

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

We can have some visualization for better understanding as well:

sales[my_features].show()
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

** Pull the bar at the bottom to view more of the data.
98039 is the most expensive zip code.

Build a regression model with more features

my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)
print my_features

Comparing the results of the simple model with adding more features

print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)

The RMSE goes down from $255,170 to $179,508 with more features.

Applying the learned models to predict prices of a house

The first house we will use is considered an "average" house in Seattle.

house1 = sales[sales['id']==5309101200]

#Check the price of house
print house1['price']

Output:
620000

Now check how the sqft model and my_feature model predict

print sqft_model.predict(graphlab.SFrame(house1))
Output:
629584.819

print my_features_model.predict(house1)
Output:
730345.745

In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features might be better.

All codes available at github.

Thursday, 29 October 2015

Machine Learning case study - Predicting Income Level with R

In this real world data analytics case study tutorial, we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data.

The dataset used for the analysis is an extraction from the 1994 census data by Barry Becker and donated to the public site http://archive.ics.uci.edu/ml/datasets/Census+Income. This dataset is popularly called the “Adult” data set.

In this tutorial you will learn the following:

Description about the data
Acquiring and Reading the data
Exploring the independent variables
Building the prediction model
Validating the prediction model

Note: I will be using caret package to do model building and cross validation in this tutorial, you can use any other library of your choice. caret is very powerful and handy library while doing Machine Learning or Analytics task using R. For more info :

https://cran.r-project.org/web/packages/caret/vignettes/caret.pdf

Description about the data:

As mentioned earlier, the data set is from http://archive.ics.uci.edu/ml/datasets/Census+Income.

Specifically the predictor variables (also called independent variables features) from the Census data and the dependent variable which is the level of income (either “greater than USD 50000” or “less than USD 50000”).

Dependent Variable

The dependent variable is “incomelevel”, representing the level of income. A value of “<=50K” indicates “less than or equal to USD 50000” and “>50K” indicates “greater than USD 50000”.

Independent Variable

Below are the independent variables (features or predictors) from the Census Data.

Variable Name	Description	Type	Possible Values
Age	Age of the individual	Continuous	Numeric
Workclass	Class of Work	Categorical	Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt	Final Weight Determined by Census Org	Continuous	Numeric
Education	Education of the individual	Ordered Factor	Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
Education-num	Number of years of education	Continuous	Numeric
Marital-status	Marital status of the individual	Categorical	Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
Occupation	Occupation of the individual	Categorical	Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Relationship	Present relationship	Categorical	Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
Race	Race of the individual	Categorical	White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Sex	Sex of the individual	Categorical	Female, Male
Capital-gain	Capital gain made by the individual	Continuous	Numeric
Capital-loss	Capital loss made by the individual	Continuous	Numeric
Hours-per-week	Average number of hours spent by the individual on work	Continuous	Numeric
Native-country	Native country of orgin	Categorical	United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Load and Read the Data

Training data and test data are both separately available at the UCI source .

As the training data file does not contain the variable names, the variable names are explicitly specified while reading the data set. While reading the data, extra spaces are stripped.

colNames = c ("age", "workclass", "fnlwgt", "education", 
              "educationnum", "maritalstatus", "occupation",
              "relationship", "race", "sex", "capitalgain",
              "capitalloss", "hoursperweek", "nativecountry",
              "incomelevel")

train = read.table ("adult.data", header = FALSE, sep = ",",
                       strip.white = TRUE, col.names = colNames,
                        na.strings = "?", stringsAsFactors = TRUE)

Dataset is read and stored as train data frame of 32561 rows and 15 columns. A high level summary of the data is below. All the variables have been read in their expected classes.

str (train)

## 'data.frame':    32561 obs. of  15 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass    : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
##  $ fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
##  $ occupation   : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
##  $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
##  $ race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
##  $ incomelevel  : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

Cleaning the Data

The training data set is cleaned for missing or invalid data.

About 8% (2399/30162) of the dataset has NAs in them. It is observed that in most of the missing dataset, the ‘workclass’ variable and ‘occupation’ variable are missing data together. And the remaining have ‘nativecountry’ variable missing. We could, handle the missing values by imputing the data. However, since ‘workclass’, ‘occupation’ and ‘nativecountry’ could potentially be very good predictors of income, imputing may simply skew the model.

Also, since most of the missing data 2066/2399 (~86%) rows pertain to the “<=50K” incomelevel and the dataset is predominently of “<=50K” incomelevel, there will not be much information loss for predictive model building if we removed the NAs data set.

table (complete.cases (train))

## 
## FALSE  TRUE 
##  2399 30162

# Summarize all data sets with NAs only
summary  (train [!complete.cases(train),])

Data distribution

# Distribution of the income level factor in the entire training data set.
table (train$incomelevel)

## 
## <=50K  >50K 
## 24720  7841

Data sets with NAs are removed below:

cleanTrain = train [!is.na (train$workclass) & !is.na (train$occupation), ]
cleanTrain = cleanTrain [!is.na (cleanTrain$nativecountry), ]

The ‘fnlwgt’ final weight estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. This variable is removed from the training data set due to it’s diminished impact on income level.

cleanTrain$fnlwgt = NULL

The cleaned data set is now cleanTrain.

Exploring the Data

Each of the variable is explored for quirks, distribution, variance and predictability.

Explore the Continuous Variables

Since the model of choice here is Boosting, which is non-parametric (does not follow any statistical distribution), we will not be transforming variables to address skewness. We will however try to understand the data to determine each variable’s predictability.

Explore the Age variable

The Age variable has a wide range and variability. The distribution and mean are quite different for income level <=50K and >50K, implying that ‘age’ will be a good predictor of ‘incomelevel’.

summary (cleanTrain$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   28.00   37.00   38.44   47.00   90.00

boxplot (age ~ incomelevel, data = cleanTrain, 
         main = "Age distribution for different income levels",
         xlab = "Income Levels", ylab = "Age", col = "salmon")

incomeBelow50K = (cleanTrain$incomelevel == "<=50K")
xlimit = c (min (cleanTrain$age), max (cleanTrain$age))
ylimit = c (0, 1600)

hist1 = qplot (age, data = cleanTrain[incomeBelow50K,], margins = TRUE, 
           binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)

hist2 = qplot (age, data = cleanTrain[!incomeBelow50K,], margins = TRUE, 
           binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)

grid.arrange (hist1, hist2, nrow = 2)

Exploring the Years of Education Variable

The Years of Education variable has good variability. The statistics are quite different for income level <=50K and >50K, implying that ‘educationnum’ will be a good predictor of ‘incomelevel’.

summary (cleanTrain$educationnum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    9.00   10.00   10.12   13.00   16.00

boxplot (educationnum ~ incomelevel, data = cleanTrain, 
         main = "Years of Education distribution for different income levels",
         xlab = "Income Levels", ylab = "Years of Education", col = "blue")

Explore the correlation between continuous variables

The below shows that there is no correlation between the continuous variables and that they are independent of each other.

corMat = cor (cleanTrain[, c("age", "educationnum", "capitalgain", "capitalloss", "hoursperweek")])


diag (corMat) = 0 #Remove self correlations
corMat

##                     age educationnum capitalgain capitalloss hoursperweek
## age          0.00000000   0.04352609  0.08015423  0.06016548   0.10159876
## educationnum 0.04352609   0.00000000  0.12441600  0.07964641   0.15252207
## capitalgain  0.08015423   0.12441600  0.00000000 -0.03222933   0.08043180
## capitalloss  0.06016548   0.07964641 -0.03222933  0.00000000   0.05241705
## hoursperweek 0.10159876   0.15252207  0.08043180  0.05241705   0.00000000

Explore Categorical Variables

Exploring the Sex variable

Mostly the sex variable is not a good predictor, and so is the case for the incomelevel prediction too. This variable will not be used for the model.

table (cleanTrain[,c("sex", "incomelevel")])

##         incomelevel
## sex      <=50K  >50K
##   Female  8670  1112
##   Male   13984  6396

Exploring the workclass & occupation variables

The variables workclass & occupation show good predictability of the incomelevel variable.

qplot (incomelevel, data = cleanTrain, fill = workclass) + facet_grid (. ~ workclass)

qplot (incomelevel, data = cleanTrain, fill = occupation) + facet_grid (. ~ occupation)

Exploring the education variables

The education variable however needs to be reordered and marked an ordinal variable (ordered factor variable). The new ordinal variable also shows good predictability of incomelevel.

# Modify the levels to be ordinal
cleanTrain$education = ordered (cleanTrain$education,
    levels (cleanTrain$education) [c(14, 4:7, 1:3, 12, 15, 8:9, 16, 10, 13, 11)])

print (levels (cleanTrain$education))

##  [1] "Preschool"    "1st-4th"      "5th-6th"      "7th-8th"     
##  [5] "9th"          "10th"         "11th"         "12th"        
##  [9] "HS-grad"      "Prof-school"  "Assoc-acdm"   "Assoc-voc"   
## [13] "Some-college" "Bachelors"    "Masters"      "Doctorate"

qplot (incomelevel, data = cleanTrain, fill = education) + facet_grid (. ~ education)

Building the Prediction Model

Finally, down to building the prediction model, we will be using all the independent variables except the Sex variable to build a model that predicts the income level of an individual to be greater than USD 50000 or less than USD 50000 using Census data.

Since, Census data is typically of weak predictors, Boosting algorithm is used for this classification modelling.

I have used Cross Validation (CV) where the training data is partitioned a specific number of times and separate boosted models are built on each. The resulting models are ensembled to arrive at final model. This helps avoid overfitting the model to the training data.

library(caret)

set.seed (32323)
trCtrl = trainControl (method = "cv", number = 10)

boostFit = train (incomelevel ~ age + workclass + education + educationnum +
                      maritalstatus + occupation + relationship +
                      race + capitalgain + capitalloss + hoursperweek +
                      nativecountry, trControl = trCtrl, 
                  method = "gbm", data = cleanTrain, verbose = FALSE)

The confusion matrix below shows an in-sample overall accuracy of ~86%, sensitivity of ~88% and specificity of ~79%.

This implies that 86% of times, the model has classified the income level correctly, 88% of the times, the income level being less than or equal to USD 50000 in classified correctly and 79% of the times, the income level being greater than USD 50000 is classified correctly.

confusionMatrix (cleanTrain$incomelevel, predict (boostFit, cCleanTrain))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K  >50K
##      <=50K 21415  1239
##      >50K   2900  4608
##                                           
##                Accuracy : 0.8628          
##                  95% CI : (0.8588, 0.8666)
##     No Information Rate : 0.8061          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6037          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8807          
##             Specificity : 0.7881          
##          Pos Pred Value : 0.9453          
##          Neg Pred Value : 0.6137          
##              Prevalence : 0.8061          
##          Detection Rate : 0.7100          
##    Detection Prevalence : 0.7511          
##       Balanced Accuracy : 0.8344          
##                                           
##        'Positive' Class : <=50K           
##

Validating the Prediction Model

The created prediction model is applied on the test data to validate the true performance. The test data is cleaned similar to the training data before applying the model.

## 
## FALSE 
## 15060

The cleaning is not shown to keep the tutorial concise. The cleaned test dataset has 15060 rows and 14 columns with no missing data.

The prediction model is applied on the test data. From the confusion matrix below the performance measures are out-of-sample overall accuracy of ~86%, sensitivity of ~88% and specificity of ~78%, which is quite similar to the in-sample performance.

cleanTest$predicted = predict (boostFit, cleanTest)
confusionMatrix (cleanTest$incomelevel, cleanTest$predicted)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K  >50K
##      <=50K 10731   629
##      >50K   1446  2254
##                                           
##                Accuracy : 0.8622          
##                  95% CI : (0.8566, 0.8677)
##     No Information Rate : 0.8086          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5984          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8813          
##             Specificity : 0.7818          
##          Pos Pred Value : 0.9446          
##          Neg Pred Value : 0.6092          
##              Prevalence : 0.8086          
##          Detection Rate : 0.7125          
##    Detection Prevalence : 0.7543          
##       Balanced Accuracy : 0.8315          
##                                           
##        'Positive' Class : <=50K           
##

Note:

During a data analytics work, it is very important to understand how the built model has performed with respect to a baseline model. This helps the analyst understand if there is really any value that the new model adds.

The baseline accuracy (here, accuracy of selection by random chance as there is no prior model) is 75% for income less than USD 50000 (sensitivity) and 25% for income more than USD 50000 (specificity) with an overall accuracy of 68% (Refer the skewed number of data sets for both the incomelevels in the cleaned test data).

The prediction model built using the boosting algorithm can predict a less than USD 50000 income level with 88% accuracy (sensitivity) and a more than USD 50000 income level with 78% accuracy (specificity) and a overall accuracy of 86%.

So the prediction model does perform better than the baseline model.

All code available at github.