Big Data Analytics and Machine Learning: Machine Learning with Python

In this tutorial i will show how to perform model validation,parameter tuning and test data set evaluation.This is one of the most import aspect of machine learning pipeline, where you check how well your model is generalize on unseen data. All code available at github.

A recap on Scikit-learn's estimator interface

Scikit-learn strives to have a uniform interface across all methods, and we’ll see examples of these below. Given a scikit-learn estimator object named model, the following methods are available:

Available in all Estimators

model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).

Available in supervised estimators

model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.

model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().

model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.

Available in unsupervised estimators

model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model. model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

Measuring Performance

An important piece of machine learning is model validation: that is, determining how well your model will generalize from the training data to future unlabeled data. Let's look at an example using the nearest neighbor classifier. This is a very simple classifier: it simply stores all training data, and for any unknown quantity, simply returns the label of the closest training point.

from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print(np.all(y == y_pred))

from sklearn import metrics
print(metrics.confusion_matrix(y, y_pred))

print '\n'
print '---------------Classification Report------------------'
print '\n'
print(metrics.classification_report(y, y_pred))

Supervised Learning In-Depth: SVMs and Random Forests

There are many machine learning algorithms available, here we'll go into brief detail on two of the most common and interesting ones: Support Vector Machines (SVMs) and Random Forests.

Support Vector Machine [SVM] :

#First we need to create a dataset:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50);

from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X, y)

#The above version uses a linear kernel; it is also possible to use radial basis function kernels as well as others.

clf = SVC(kernel='rbf')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');

Random Forests :

Random forests are an example of an ensemble learner built on decision trees. For this reason we'll first discuss decision trees themselves:

Decision Trees :
Here we'll explore a class of algorithms based on Decision trees. Decision trees at their root are extremely intuitive. They encode a series of binary choices in a process that parallels how a person might classify things themselves, but using an information criterion to decide which question is most fruitful at each step.

One problem with decision trees is that they can end up over-fitting the data. They are such flexible models that, given a large depth, they can quickly memorize the inputs, which doesn't generalize well to previously unseen data. One way to get around this is to use many slightly different decision trees in concert. This is known as Random Forests, and is one of the more common techniques of ensemble learning (i.e. combining the results from several estimators).

from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_digits
from sklearn import metrics

#Loading data
digits = load_digits()
digits.keys()

X = digits.data
y = digits.target

# Split the dataset
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

print '-------------Support Vector Machine-------------'
# SVM results

from sklearn.svm import SVC
for kernel in ['rbf', 'linear']:
clf = SVC(kernel=kernel).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print '\n'
print("SVC: kernel = {0}".format(kernel))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("SVC: kernel = {0}".format(kernel))

print '\n\n-------------Random Forest-------------'
# random forest results

from sklearn.ensemble import RandomForestClassifier

for max_depth in [3, 5, 10]:
clf = RandomForestClassifier(max_depth=max_depth).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print '\n'
print("RF: max_depth = {0}".format(max_depth))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("RF: max_depth = {0}".format(max_depth))

Output:
-------------Support Vector Machine-------
SVC: kernel = rbf
0.541483398619

SVC: kernel = linear
0.97112374636

-------------Random Forest-------------
RF: max_depth = 3
0.791069561537

RF: max_depth = 5
0.87446124846

RF: max_depth = 10
0.939770211401

Validation and Tuning

Exploring Validation Metrics

How can you evaluate the performance of a model?
The simplest way might be to count the number of matches and mis-matches. But this is not always sufficient.

The Problem with Simple Validation
The problem here is that we might not care how well we can classify the background, but might instead be concerned with successfully pulling-out an uncontaminated set of foreground sources. We can get at this by computing statistics such as the precision, the recall, and the f1 score:

# Generate an un-balanced 2D dataset
np.random.seed(0)
X = np.vstack([np.random.normal(0, 1, (950, 2)),
np.random.normal(-1.8, 0.8, (50, 2))])
y = np.hstack([np.zeros(950), np.ones(50)])

plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none', cmap=plt.cm.Accent);

from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)

print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)

print '\n --------------Classification Report-----------------\n'
print metrics.classification_report(y_test, y_pred, target_names=['background', 'foreground'])

Output:

 --------------Classification Report-----------------

             precision    recall  f1-score   support

 background       0.97      0.99      0.98       234
 foreground       0.83      0.62      0.71        16

avg / total       0.97      0.97      0.97       250

Cross-Validation

Using the simple train/test split as above can be useful, but there is a disadvantage: Your fit is ignoring a portion of your dataset. One way to address this is to use cross-validation.

scikit-learn has a K-fold cross-validation scheme built-in:

from sklearn.cross_validation import cross_val_score

# Let's do a 2-fold cross-validation of the SVC estimator
print cross_val_score(SVC(), X, y, cv=2, scoring='precision')

Grid Search

Scikit-learn has a grid search tool built-in, which is used as follows. Note that GridSearchCV has a fit method: it is a meta-estimator: an estimator over estimators!

from sklearn.grid_search import GridSearchCV

clf = SVC()
Crange = np.logspace(-2, 2, 40)

grid = GridSearchCV(clf, param_grid={'C': Crange},
scoring='precision', cv=5)
grid.fit(X, y)

print "best parameter choice:", grid.best_params_

scores = [g[1] for g in grid.grid_scores_]
plt.semilogx(Crange, scores);

One popular and effective way to address over-fitting is to use Ensemble methods. sklearn.ensemble.RandomForestRegressor uses multiple randomized decision trees and averages their results. The ensemble of estimators can often do better than any individual estimator for data that is over-fit. Repeat the above experiment using the Random Forest regressor, with 10 trees.
What is the best max_depth for this model? Does the accuracy improve?

from sklearn.ensemble import RandomForestRegressor
X, y = make_data(500, error=1)

clf = RandomForestRegressor(10)
max_depth = np.arange(1, 10)

grid = GridSearchCV(clf, param_grid={'max_depth': max_depth},
scoring='mean_squared_error', cv=5)
grid.fit(X, y)

scores = [g[1] for g in grid.grid_scores_]
plt.plot(max_depth, scores);

Bias Variance -Learning Curves

The exact turning-point of the tradeoff between bias and variance is highly dependent on the number of training points used. Here i will illustrate the use of learning curves, which display this property.
The idea is to plot the mean-squared-error for the training and test set as a function of Number of Training Points

def test_func(x, err=0.5):
y = 10 - 1. / (x + 0.1)
if err > 0:
y = np.random.normal(y, err)
return y

def make_data(N=40, error=1.0, random_seed=1):
# randomly sample the data
np.random.seed(1)
X = np.random.random(N)[:, np.newaxis]
y = test_func(X.ravel(), error)

return X, y

X, y = make_data(40, error=1)
plt.scatter(X.ravel(), y);
X_test = np.linspace(-0.1, 1.1, 500)[:, None]

Linear Regression- MSE

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_test = model.predict(X_test)

plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)

Polynomial Regression

class PolynomialRegression(LinearRegression):
"""Simple Polynomial Regression to 1D data"""
def __init__(self, degree=1, **kwargs):
self.degree = degree
LinearRegression.__init__(self, **kwargs)

def fit(self, X, y):
if X.shape[1] != 1:
raise ValueError("Only 1D data valid here")
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.fit(self, Xp, y)

def predict(self, X):
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.predict(self, Xp)

model = PolynomialRegression(degree=2)
model.fit(X, y)
y_test = model.predict(X_test)

plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)

#Try with degree 3

X, y = make_data(200, error=1.0)
degree = 3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

N_range = np.linspace(15, X_train.shape[0], 20).astype(int)

def plot_learning_curve(degree=3):
training_error = []
test_error = []
mse = metrics.mean_squared_error

for N in N_range:
XN = X_train[:N]
yN = y_train[:N]

model = PolynomialRegression(degree).fit(XN, yN)
training_error.append(mse(model.predict(XN), yN))
test_error.append(mse(model.predict(X_test), y_test))

plt.plot(N_range, training_error, label='training')
plt.plot(N_range, test_error, label='test')
plt.plot(N_range, np.ones_like(N_range), ':k')
plt.legend()
plt.title('degree = {0}'.format(degree))
plt.xlabel('num. training points')
plt.ylabel('MSE')

plot_learning_curve(3)

This shows a typical learning curve: for very few training points, there is a large separation between the training and test error, which indicates over-fitting. Given the same model, for a large number of training points, the training and testing errors converge, which indicates potential under-fitting.Lets try with degree 2.

plot_learning_curve(2)

1 comment:

Unknown16 November 2015 at 03:33
Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.

Best hadoop training institute in chennai

Big Data Analytics and Machine Learning

Wednesday, 28 October 2015

Machine Learning with Python - Model Evaluation and Parameter Tuning