Big Data Analytics and Machine Learning: Machine Learning Case Study

In this tutorial we will be using supervised machine learning technique 'Linear Regression' to predict the housing price.
We will be using a very power and scalable machine learning framework 'GraphLab' to do this case study.

Note:
Graphlab is free for academic and personal use. For details about GraphLab Create,please visit.

Coding Time

Let's start your python editor,i'm using ipython notebook.

# Fire up graphlab create
import graphlab

Load some house sales data

Data set is from house sales in King County, the region where the city of Seattle, WA is located.This data set is taken from a machine learning course that i have done recently.

sales = graphlab.SFrame.read_csv('home_data.csv')

# Check some sample data
sales.head(3)

Exploring the data for housing sales

The house price is correlated with the number of square feet of living space.
Uncomment ,if you want to view the plot in same ipython notebook

#graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

Creating a simple regression model of sqft_living to price.

Let's create a simple linear regression model with one feature(sqft_living). This is also called univariate model as it is using only feature/independent variable.

Split data into training and testing

We use seed=0 so that everyone following this tutorial gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).

train_data,test_data = sales.random_split(.8,seed=0)

Building the regression model using only sqft_living as a feature

sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)

Evaluating the simple model
Now its time to evaluate our simple model

print test_data['price'].mean()
print sqft_model.evaluate(test_data)

RMSE of about $255,170! Not good, can we create a better model??

Plotting with Matplotlib

Let's us check what our predictions look like,Matplotlib is a Python plotting library that is also useful for plotting.

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(test_data['sqft_living'],test_data['price'],'.',
test_data['sqft_living'],sqft_model.predict(test_data),'-')

Above: blue dots are original data, green line is the prediction from the simple regression.
We can also view the learned regression coefficients using following command:
sqft_model.get('coefficients')

Explore other features in the data

To build a more elaborate model, we will explore using more features.

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

We can have some visualization for better understanding as well:

sales[my_features].show()
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')

** Pull the bar at the bottom to view more of the data.
98039 is the most expensive zip code.

Build a regression model with more features

my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)
print my_features

Comparing the results of the simple model with adding more features

print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)

The RMSE goes down from $255,170 to $179,508 with more features.

Applying the learned models to predict prices of a house

The first house we will use is considered an "average" house in Seattle.

house1 = sales[sales['id']==5309101200]

#Check the price of house
print house1['price']

Output:
620000

Now check how the sqft model and my_feature model predict

print sqft_model.predict(graphlab.SFrame(house1))
Output:
629584.819

print my_features_model.predict(house1)
Output:
730345.745

In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features might be better.

All codes available at github.

Big Data Analytics and Machine Learning

Saturday, 31 October 2015

Machine Learning Case Study - Housing Price Prediction