In this real world data analytics case study tutorial, we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data.
The dataset used for the analysis is an extraction from the 1994 census data by Barry Becker and donated to the public site
http://archive.ics.uci.edu/ml/datasets/Census+Income. This dataset is popularly called the
“Adult” data set.
In this tutorial you will learn the following:
- Description about the data
- Acquiring and Reading the data
- Exploring the independent variables
- Building the prediction model
- Validating the prediction model
Note: I will be using caret package to do model building and cross validation in this tutorial, you can use any other library of your choice. caret is very powerful and handy library while doing Machine Learning or Analytics task using R. For more info :
Description about the data:
Specifically the predictor variables (also called independent variables features) from the Census data and the dependent variable which is the level of income (either “greater than USD 50000” or “less than USD 50000”).
Dependent Variable
The dependent variable is “incomelevel”, representing the level of income. A value of “<=50K” indicates “less than or equal to USD 50000” and “>50K” indicates “greater than USD 50000”.
Independent Variable
Below are the independent variables (features or predictors) from the Census Data.
Variable Name
|
Description
|
Type
|
Possible Values
|
Age
|
Age of the individual
|
Continuous
|
Numeric
|
Workclass
|
Class of Work
|
Categorical
|
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
|
fnlwgt
|
Final Weight Determined by Census Org
|
Continuous
|
Numeric
|
Education
|
Education of the individual
|
Ordered Factor
|
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
|
Education-num
|
Number of years of education
|
Continuous
|
Numeric
|
Marital-status
|
Marital status of the individual
|
Categorical
|
Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
|
Occupation
|
Occupation of the individual
|
Categorical
|
Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
|
Relationship
|
Present relationship
|
Categorical
|
Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
|
Race
|
Race of the individual
|
Categorical
|
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
|
Sex
|
Sex of the individual
|
Categorical
|
Female, Male
|
Capital-gain
|
Capital gain made by the individual
|
Continuous
|
Numeric
|
Capital-loss
|
Capital loss made by the individual
|
Continuous
|
Numeric
|
Hours-per-week
|
Average number of hours spent by the individual on work
|
Continuous
|
Numeric
|
Native-country
|
Native country of orgin
|
Categorical
|
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
|
Load and Read the Data
Training data and test data are both separately available at the UCI source .
As the training data file does not contain the variable names, the variable names are explicitly specified while reading the data set. While reading the data, extra spaces are stripped.
colNames = c ("age", "workclass", "fnlwgt", "education",
"educationnum", "maritalstatus", "occupation",
"relationship", "race", "sex", "capitalgain",
"capitalloss", "hoursperweek", "nativecountry",
"incomelevel")
train = read.table ("adult.data", header = FALSE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
Dataset is read and stored as train data frame of 32561 rows and 15 columns. A high level summary of the data is below. All the variables have been read in their expected classes.
str (train)
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
## $ maritalstatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
## $ occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 4 10 4 ...
## $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
## $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
## $ nativecountry: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 5 39 23 39 39 39 ...
## $ incomelevel : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
Cleaning the Data
The training data set is cleaned for missing or invalid data.
About 8% (2399/30162) of the dataset has NAs in them. It is observed that in most of the missing dataset, the ‘workclass’ variable and ‘occupation’ variable are missing data together. And the remaining have ‘nativecountry’ variable missing. We could, handle the missing values by imputing the data. However, since ‘workclass’, ‘occupation’ and ‘nativecountry’ could potentially be very good predictors of income, imputing may simply skew the model.
Also, since most of the missing data 2066/2399 (~86%) rows pertain to the “<=50K” incomelevel and the dataset is predominently of “<=50K” incomelevel, there will not be much information loss for predictive model building if we removed the NAs data set.
table (complete.cases (train))
##
## FALSE TRUE
## 2399 30162
summary (train [!complete.cases(train),])
Data distribution
table (train$incomelevel)
##
## <=50K >50K
## 24720 7841
Data sets with NAs are removed below:
cleanTrain = train [!is.na (train$workclass) & !is.na (train$occupation), ]
cleanTrain = cleanTrain [!is.na (cleanTrain$nativecountry), ]
The ‘fnlwgt’ final weight estimate refers to population totals derived from CPS by creating “weighted tallies” of any specified socio-economic characteristics of the population. This variable is removed from the training data set due to it’s diminished impact on income level.
cleanTrain$fnlwgt = NULL
The cleaned data set is now cleanTrain.
Exploring the Data
Each of the variable is explored for quirks, distribution, variance and predictability.
Explore the Continuous Variables
Since the model of choice here is Boosting, which is non-parametric (does not follow any statistical distribution), we will not be transforming variables to address skewness. We will however try to understand the data to determine each variable’s predictability.
Explore the Age variable
The Age variable has a wide range and variability. The distribution and mean are quite different for income level <=50K and >50K, implying that ‘age’ will be a good predictor of ‘incomelevel’.
summary (cleanTrain$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 28.00 37.00 38.44 47.00 90.00
boxplot (age ~ incomelevel, data = cleanTrain,
main = "Age distribution for different income levels",
xlab = "Income Levels", ylab = "Age", col = "salmon")
incomeBelow50K = (cleanTrain$incomelevel == "<=50K")
xlimit = c (min (cleanTrain$age), max (cleanTrain$age))
ylimit = c (0, 1600)
hist1 = qplot (age, data = cleanTrain[incomeBelow50K,], margins = TRUE,
binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)
hist2 = qplot (age, data = cleanTrain[!incomeBelow50K,], margins = TRUE,
binwidth = 2, xlim = xlimit, ylim = ylimit, colour = incomelevel)
grid.arrange (hist1, hist2, nrow = 2)
Exploring the Years of Education Variable
The Years of Education variable has good variability. The statistics are quite different for income level <=50K and >50K, implying that ‘educationnum’ will be a good predictor of ‘incomelevel’.
summary (cleanTrain$educationnum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 9.00 10.00 10.12 13.00 16.00
boxplot (educationnum ~ incomelevel, data = cleanTrain,
main = "Years of Education distribution for different income levels",
xlab = "Income Levels", ylab = "Years of Education", col = "blue")
Explore the correlation between continuous variables
The below shows that there is no correlation between the continuous variables and that they are independent of each other.
corMat = cor (cleanTrain[, c("age", "educationnum", "capitalgain", "capitalloss", "hoursperweek")])
diag (corMat) = 0
corMat
## age educationnum capitalgain capitalloss hoursperweek
## age 0.00000000 0.04352609 0.08015423 0.06016548 0.10159876
## educationnum 0.04352609 0.00000000 0.12441600 0.07964641 0.15252207
## capitalgain 0.08015423 0.12441600 0.00000000 -0.03222933 0.08043180
## capitalloss 0.06016548 0.07964641 -0.03222933 0.00000000 0.05241705
## hoursperweek 0.10159876 0.15252207 0.08043180 0.05241705 0.00000000
Explore Categorical Variables
Exploring the Sex variable
Mostly the sex variable is not a good predictor, and so is the case for the incomelevel prediction too. This variable will not be used for the model.
table (cleanTrain[,c("sex", "incomelevel")])
## incomelevel
## sex <=50K >50K
## Female 8670 1112
## Male 13984 6396
Exploring the workclass & occupation variables
The variables workclass & occupation show good predictability of the incomelevel variable.
qplot (incomelevel, data = cleanTrain, fill = workclass) + facet_grid (. ~ workclass)
qplot (incomelevel, data = cleanTrain, fill = occupation) + facet_grid (. ~ occupation)
Exploring the education variables
The education variable however needs to be reordered and marked an ordinal variable (ordered factor variable). The new ordinal variable also shows good predictability of incomelevel.
cleanTrain$education = ordered (cleanTrain$education,
levels (cleanTrain$education) [c(14, 4:7, 1:3, 12, 15, 8:9, 16, 10, 13, 11)])
print (levels (cleanTrain$education))
## [1] "Preschool" "1st-4th" "5th-6th" "7th-8th"
## [5] "9th" "10th" "11th" "12th"
## [9] "HS-grad" "Prof-school" "Assoc-acdm" "Assoc-voc"
## [13] "Some-college" "Bachelors" "Masters" "Doctorate"
qplot (incomelevel, data = cleanTrain, fill = education) + facet_grid (. ~ education)
Building the Prediction Model
Finally, down to building the prediction model, we will be using all the independent variables except the Sex variable to build a model that predicts the income level of an individual to be greater than USD 50000 or less than USD 50000 using Census data.
Since, Census data is typically of weak predictors, Boosting algorithm is used for this classification modelling.
I have used Cross Validation (CV) where the training data is partitioned a specific number of times and separate boosted models are built on each. The resulting models are ensembled to arrive at final model. This helps avoid overfitting the model to the training data.
library(caret)
set.seed (32323)
trCtrl = trainControl (method = "cv", number = 10)
boostFit = train (incomelevel ~ age + workclass + education + educationnum +
maritalstatus + occupation + relationship +
race + capitalgain + capitalloss + hoursperweek +
nativecountry, trControl = trCtrl,
method = "gbm", data = cleanTrain, verbose = FALSE)
The confusion matrix below shows an in-sample overall accuracy of ~86%, sensitivity of ~88% and specificity of ~79%.
This implies that 86% of times, the model has classified the income level correctly, 88% of the times, the income level being less than or equal to USD 50000 in classified correctly and 79% of the times, the income level being greater than USD 50000 is classified correctly.
confusionMatrix (cleanTrain$incomelevel, predict (boostFit, cCleanTrain))
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 21415 1239
## >50K 2900 4608
##
## Accuracy : 0.8628
## 95% CI : (0.8588, 0.8666)
## No Information Rate : 0.8061
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6037
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8807
## Specificity : 0.7881
## Pos Pred Value : 0.9453
## Neg Pred Value : 0.6137
## Prevalence : 0.8061
## Detection Rate : 0.7100
## Detection Prevalence : 0.7511
## Balanced Accuracy : 0.8344
##
## 'Positive' Class : <=50K
##
Validating the Prediction Model
The created prediction model is applied on the test data to validate the true performance. The test data is cleaned similar to the training data before applying the model.
##
## FALSE
## 15060
The cleaning is not shown to keep the tutorial concise. The cleaned test dataset has 15060 rows and 14 columns with no missing data.
The prediction model is applied on the test data. From the confusion matrix below the performance measures are out-of-sample overall accuracy of ~86%, sensitivity of ~88% and specificity of ~78%, which is quite similar to the in-sample performance.
cleanTest$predicted = predict (boostFit, cleanTest)
confusionMatrix (cleanTest$incomelevel, cleanTest$predicted)
## Confusion Matrix and Statistics
##
## Reference
## Prediction <=50K >50K
## <=50K 10731 629
## >50K 1446 2254
##
## Accuracy : 0.8622
## 95% CI : (0.8566, 0.8677)
## No Information Rate : 0.8086
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5984
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8813
## Specificity : 0.7818
## Pos Pred Value : 0.9446
## Neg Pred Value : 0.6092
## Prevalence : 0.8086
## Detection Rate : 0.7125
## Detection Prevalence : 0.7543
## Balanced Accuracy : 0.8315
##
## 'Positive' Class : <=50K
##
Note:
During a data analytics work, it is very important to understand how the built model has performed with respect to a baseline model. This helps the analyst understand if there is really any value that the new model adds.
The baseline accuracy (here, accuracy of selection by random chance as there is no prior model) is 75% for income less than USD 50000 (sensitivity) and 25% for income more than USD 50000 (specificity) with an overall accuracy of 68% (Refer the skewed number of data sets for both the incomelevels in the cleaned test data).
The prediction model built using the boosting algorithm can predict a less than USD 50000 income level with 88% accuracy (sensitivity) and a more than USD 50000 income level with 78% accuracy (specificity) and a overall accuracy of 86%.
So the prediction model does perform better than the baseline model.
All code available at github.