In this tutorial we will be using supervised machine learning technique 'Linear Regression' to predict the housing price.
We will be using a very power and scalable machine learning framework 'GraphLab' to do this case study.
Note:
Graphlab is free for academic and personal use. For details about GraphLab Create,please visit.
# Fire up graphlab create
import graphlab
sales = graphlab.SFrame.read_csv('home_data.csv')
# Check some sample data
sales.head(3)
Uncomment ,if you want to view the plot in same ipython notebook
#graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")
Split data into training and testing
We use seed=0 so that everyone following this tutorial gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).
train_data,test_data = sales.random_split(.8,seed=0)
Building the regression model using only sqft_living as a feature
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)
Evaluating the simple model
Now its time to evaluate our simple model
print test_data['price'].mean()
print sqft_model.evaluate(test_data)
RMSE of about $255,170! Not good, can we create a better model??
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data['sqft_living'],test_data['price'],'.',
test_data['sqft_living'],sqft_model.predict(test_data),'-')
Above: blue dots are original data, green line is the prediction from the simple regression.
We can also view the learned regression coefficients using following command:
sqft_model.get('coefficients')
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
We can have some visualization for better understanding as well:
sales[my_features].show()
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')
** Pull the bar at the bottom to view more of the data.
98039 is the most expensive zip code.
Build a regression model with more features
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)
print my_features
Comparing the results of the simple model with adding more features
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
The RMSE goes down from $255,170 to $179,508 with more features.
house1 = sales[sales['id']==5309101200]
#Check the price of house
print house1['price']
Output:
620000
Now check how the sqft model and my_feature model predict
print sqft_model.predict(graphlab.SFrame(house1))
Output:
629584.819
print my_features_model.predict(house1)
Output:
730345.745
In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features might be better.
We will be using a very power and scalable machine learning framework 'GraphLab' to do this case study.
Note:
Graphlab is free for academic and personal use. For details about GraphLab Create,please visit.
Coding Time
Let's start your python editor,i'm using ipython notebook.# Fire up graphlab create
import graphlab
Load some house sales data
Data set is from house sales in King County, the region where the city of Seattle, WA is located.This data set is taken from a machine learning course that i have done recently.sales = graphlab.SFrame.read_csv('home_data.csv')
# Check some sample data
sales.head(3)
Exploring the data for housing sales
The house price is correlated with the number of square feet of living space.Uncomment ,if you want to view the plot in same ipython notebook
#graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")
Creating a simple regression model of sqft_living to price.
Let's create a simple linear regression model with one feature(sqft_living). This is also called univariate model as it is using only feature/independent variable.Split data into training and testing
We use seed=0 so that everyone following this tutorial gets the same results. In practice, you may set a random seed (or let GraphLab Create pick a random seed for you).
train_data,test_data = sales.random_split(.8,seed=0)
Building the regression model using only sqft_living as a feature
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'],validation_set=None)
Evaluating the simple model
Now its time to evaluate our simple model
print test_data['price'].mean()
print sqft_model.evaluate(test_data)
RMSE of about $255,170! Not good, can we create a better model??
Plotting with Matplotlib
Let's us check what our predictions look like,Matplotlib is a Python plotting library that is also useful for plotting.import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data['sqft_living'],test_data['price'],'.',
test_data['sqft_living'],sqft_model.predict(test_data),'-')
Above: blue dots are original data, green line is the prediction from the simple regression.
We can also view the learned regression coefficients using following command:
sqft_model.get('coefficients')
Explore other features in the data
To build a more elaborate model, we will explore using more features.my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
We can have some visualization for better understanding as well:
sales[my_features].show()
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')
** Pull the bar at the bottom to view more of the data.
98039 is the most expensive zip code.
Build a regression model with more features
my_features_model = graphlab.linear_regression.create(train_data,target='price',features=my_features,validation_set=None)
print my_features
Comparing the results of the simple model with adding more features
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
The RMSE goes down from $255,170 to $179,508 with more features.
Applying the learned models to predict prices of a house
The first house we will use is considered an "average" house in Seattle.house1 = sales[sales['id']==5309101200]
#Check the price of house
print house1['price']
Output:
620000
Now check how the sqft model and my_feature model predict
print sqft_model.predict(graphlab.SFrame(house1))
Output:
629584.819
print my_features_model.predict(house1)
Output:
730345.745
In this case, the model with more features provides a worse prediction than the simpler model with only 1 feature. However, on average, the model with more features might be better.
888 Casino and Resort Tickets - JamBase
ReplyDelete› tickets › tickets Buy Now: Buy Now: Buy Now: Buy Now: Buy Now: Buy Now: Buy 김제 출장샵 Now: $26.99. Ticket 제주도 출장마사지 Type: Tickets; Date Created: December 13, 2021 4:57 구미 출장안마 pm 충청북도 출장샵 $27.99 In 대전광역 출장마사지 stock