In this tutorial i will show how to perform model validation,parameter tuning and test data set evaluation.This is one of the most import aspect of machine learning pipeline, where you check how well your model is generalize on unseen data. All code available at github.
Scikit-learn strives to have a uniform interface across all methods, and we’ll see examples of these below. Given a scikit-learn estimator object named model, the following methods are available:
model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print(np.all(y == y_pred))
from sklearn import metrics
print(metrics.confusion_matrix(y, y_pred))
print '\n'
print '---------------Classification Report------------------'
print '\n'
print(metrics.classification_report(y, y_pred))
Support Vector Machine [SVM] :
#First we need to create a dataset:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50);
from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X, y)
#The above version uses a linear kernel; it is also possible to use radial basis function kernels as well as others.
clf = SVC(kernel='rbf')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');
Random Forests :
Random forests are an example of an ensemble learner built on decision trees. For this reason we'll first discuss decision trees themselves:
Decision Trees :
Here we'll explore a class of algorithms based on Decision trees. Decision trees at their root are extremely intuitive. They encode a series of binary choices in a process that parallels how a person might classify things themselves, but using an information criterion to decide which question is most fruitful at each step.
One problem with decision trees is that they can end up over-fitting the data. They are such flexible models that, given a large depth, they can quickly memorize the inputs, which doesn't generalize well to previously unseen data. One way to get around this is to use many slightly different decision trees in concert. This is known as Random Forests, and is one of the more common techniques of ensemble learning (i.e. combining the results from several estimators).
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_digits
from sklearn import metrics
#Loading data
digits = load_digits()
digits.keys()
X = digits.data
y = digits.target
# Split the dataset
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
print '-------------Support Vector Machine-------------'
# SVM results
from sklearn.svm import SVC
for kernel in ['rbf', 'linear']:
clf = SVC(kernel=kernel).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print '\n'
print("SVC: kernel = {0}".format(kernel))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("SVC: kernel = {0}".format(kernel))
print '\n\n-------------Random Forest-------------'
# random forest results
from sklearn.ensemble import RandomForestClassifier
for max_depth in [3, 5, 10]:
clf = RandomForestClassifier(max_depth=max_depth).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print '\n'
print("RF: max_depth = {0}".format(max_depth))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("RF: max_depth = {0}".format(max_depth))
Output:
-------------Support Vector Machine-------
SVC: kernel = rbf
0.541483398619
SVC: kernel = linear
0.97112374636
-------------Random Forest-------------
RF: max_depth = 3
0.791069561537
RF: max_depth = 5
0.87446124846
RF: max_depth = 10
0.939770211401
The simplest way might be to count the number of matches and mis-matches. But this is not always sufficient.
The Problem with Simple Validation
The problem here is that we might not care how well we can classify the background, but might instead be concerned with successfully pulling-out an uncontaminated set of foreground sources. We can get at this by computing statistics such as the precision, the recall, and the f1 score:
# Generate an un-balanced 2D dataset
np.random.seed(0)
X = np.vstack([np.random.normal(0, 1, (950, 2)),
np.random.normal(-1.8, 0.8, (50, 2))])
y = np.hstack([np.zeros(950), np.ones(50)])
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none', cmap=plt.cm.Accent);
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)
print '\n --------------Classification Report-----------------\n'
print metrics.classification_report(y_test, y_pred, target_names=['background', 'foreground'])
Output:
scikit-learn has a K-fold cross-validation scheme built-in:
from sklearn.cross_validation import cross_val_score
# Let's do a 2-fold cross-validation of the SVC estimator
print cross_val_score(SVC(), X, y, cv=2, scoring='precision')
from sklearn.grid_search import GridSearchCV
clf = SVC()
Crange = np.logspace(-2, 2, 40)
grid = GridSearchCV(clf, param_grid={'C': Crange},
scoring='precision', cv=5)
grid.fit(X, y)
print "best parameter choice:", grid.best_params_
scores = [g[1] for g in grid.grid_scores_]
plt.semilogx(Crange, scores);
One popular and effective way to address over-fitting is to use Ensemble methods. sklearn.ensemble.RandomForestRegressor uses multiple randomized decision trees and averages their results. The ensemble of estimators can often do better than any individual estimator for data that is over-fit. Repeat the above experiment using the Random Forest regressor, with 10 trees.
What is the best max_depth for this model? Does the accuracy improve?
from sklearn.ensemble import RandomForestRegressor
X, y = make_data(500, error=1)
clf = RandomForestRegressor(10)
max_depth = np.arange(1, 10)
grid = GridSearchCV(clf, param_grid={'max_depth': max_depth},
scoring='mean_squared_error', cv=5)
grid.fit(X, y)
scores = [g[1] for g in grid.grid_scores_]
plt.plot(max_depth, scores);
The idea is to plot the mean-squared-error for the training and test set as a function of Number of Training Points
def test_func(x, err=0.5):
y = 10 - 1. / (x + 0.1)
if err > 0:
y = np.random.normal(y, err)
return y
def make_data(N=40, error=1.0, random_seed=1):
# randomly sample the data
np.random.seed(1)
X = np.random.random(N)[:, np.newaxis]
y = test_func(X.ravel(), error)
return X, y
X, y = make_data(40, error=1)
plt.scatter(X.ravel(), y);
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
Linear Regression- MSE
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
Polynomial Regression
class PolynomialRegression(LinearRegression):
"""Simple Polynomial Regression to 1D data"""
def __init__(self, degree=1, **kwargs):
self.degree = degree
LinearRegression.__init__(self, **kwargs)
def fit(self, X, y):
if X.shape[1] != 1:
raise ValueError("Only 1D data valid here")
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.fit(self, Xp, y)
def predict(self, X):
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.predict(self, Xp)
model = PolynomialRegression(degree=2)
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
#Try with degree 3
X, y = make_data(200, error=1.0)
degree = 3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
N_range = np.linspace(15, X_train.shape[0], 20).astype(int)
def plot_learning_curve(degree=3):
training_error = []
test_error = []
mse = metrics.mean_squared_error
for N in N_range:
XN = X_train[:N]
yN = y_train[:N]
model = PolynomialRegression(degree).fit(XN, yN)
training_error.append(mse(model.predict(XN), yN))
test_error.append(mse(model.predict(X_test), y_test))
plt.plot(N_range, training_error, label='training')
plt.plot(N_range, test_error, label='test')
plt.plot(N_range, np.ones_like(N_range), ':k')
plt.legend()
plt.title('degree = {0}'.format(degree))
plt.xlabel('num. training points')
plt.ylabel('MSE')
plot_learning_curve(3)
This shows a typical learning curve: for very few training points, there is a large separation between the training and test error, which indicates over-fitting. Given the same model, for a large number of training points, the training and testing errors converge, which indicates potential under-fitting.Lets try with degree 2.
plot_learning_curve(2)
A recap on Scikit-learn's estimator interface
Scikit-learn strives to have a uniform interface across all methods, and we’ll see examples of these below. Given a scikit-learn estimator object named model, the following methods are available:
Available in all Estimators
model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).Available in supervised estimators
model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.
Available in unsupervised estimators
model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model. model.fit_transform() : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.Measuring Performance
An important piece of machine learning is model validation: that is, determining how well your model will generalize from the training data to future unlabeled data. Let's look at an example using the nearest neighbor classifier. This is a very simple classifier: it simply stores all training data, and for any unknown quantity, simply returns the label of the closest training point.from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X, y)
y_pred = clf.predict(X)
print(np.all(y == y_pred))
from sklearn import metrics
print(metrics.confusion_matrix(y, y_pred))
print '\n'
print '---------------Classification Report------------------'
print '\n'
print(metrics.classification_report(y, y_pred))
Supervised Learning In-Depth: SVMs and Random Forests
There are many machine learning algorithms available, here we'll go into brief detail on two of the most common and interesting ones: Support Vector Machines (SVMs) and Random Forests.Support Vector Machine [SVM] :
#First we need to create a dataset:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=50, centers=2,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50);
from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X, y)
#The above version uses a linear kernel; it is also possible to use radial basis function kernels as well as others.
clf = SVC(kernel='rbf')
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50)
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1],
s=200, facecolors='none');
Random Forests :
Random forests are an example of an ensemble learner built on decision trees. For this reason we'll first discuss decision trees themselves:
Decision Trees :
Here we'll explore a class of algorithms based on Decision trees. Decision trees at their root are extremely intuitive. They encode a series of binary choices in a process that parallels how a person might classify things themselves, but using an information criterion to decide which question is most fruitful at each step.
One problem with decision trees is that they can end up over-fitting the data. They are such flexible models that, given a large depth, they can quickly memorize the inputs, which doesn't generalize well to previously unseen data. One way to get around this is to use many slightly different decision trees in concert. This is known as Random Forests, and is one of the more common techniques of ensemble learning (i.e. combining the results from several estimators).
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_digits
from sklearn import metrics
#Loading data
digits = load_digits()
digits.keys()
X = digits.data
y = digits.target
# Split the dataset
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
print '-------------Support Vector Machine-------------'
# SVM results
from sklearn.svm import SVC
for kernel in ['rbf', 'linear']:
clf = SVC(kernel=kernel).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print '\n'
print("SVC: kernel = {0}".format(kernel))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("SVC: kernel = {0}".format(kernel))
print '\n\n-------------Random Forest-------------'
# random forest results
from sklearn.ensemble import RandomForestClassifier
for max_depth in [3, 5, 10]:
clf = RandomForestClassifier(max_depth=max_depth).fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print '\n'
print("RF: max_depth = {0}".format(max_depth))
print(metrics.f1_score(ytest, ypred))
plt.figure()
plt.imshow(metrics.confusion_matrix(ypred, ytest),
interpolation='nearest', cmap=plt.cm.binary)
plt.colorbar()
plt.xlabel("true label")
plt.ylabel("predicted label")
plt.title("RF: max_depth = {0}".format(max_depth))
Output:
-------------Support Vector Machine-------
SVC: kernel = rbf
0.541483398619
SVC: kernel = linear
0.97112374636
-------------Random Forest-------------
RF: max_depth = 3
0.791069561537
RF: max_depth = 5
0.87446124846
RF: max_depth = 10
0.939770211401
Validation and Tuning
Exploring Validation Metrics
How can you evaluate the performance of a model?The simplest way might be to count the number of matches and mis-matches. But this is not always sufficient.
The Problem with Simple Validation
The problem here is that we might not care how well we can classify the background, but might instead be concerned with successfully pulling-out an uncontaminated set of foreground sources. We can get at this by computing statistics such as the precision, the recall, and the f1 score:
# Generate an un-balanced 2D dataset
np.random.seed(0)
X = np.vstack([np.random.normal(0, 1, (950, 2)),
np.random.normal(-1.8, 0.8, (50, 2))])
y = np.hstack([np.zeros(950), np.ones(50)])
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='none', cmap=plt.cm.Accent);
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)
print '\n --------------Classification Report-----------------\n'
print metrics.classification_report(y_test, y_pred, target_names=['background', 'foreground'])
Output:
--------------Classification Report----------------- precision recall f1-score support background 0.97 0.99 0.98 234 foreground 0.83 0.62 0.71 16 avg / total 0.97 0.97 0.97 250
Cross-Validation
Using the simple train/test split as above can be useful, but there is a disadvantage: Your fit is ignoring a portion of your dataset. One way to address this is to use cross-validation.scikit-learn has a K-fold cross-validation scheme built-in:
from sklearn.cross_validation import cross_val_score
# Let's do a 2-fold cross-validation of the SVC estimator
print cross_val_score(SVC(), X, y, cv=2, scoring='precision')
Grid Search
Scikit-learn has a grid search tool built-in, which is used as follows. Note that GridSearchCV has a fit method: it is a meta-estimator: an estimator over estimators!from sklearn.grid_search import GridSearchCV
clf = SVC()
Crange = np.logspace(-2, 2, 40)
grid = GridSearchCV(clf, param_grid={'C': Crange},
scoring='precision', cv=5)
grid.fit(X, y)
print "best parameter choice:", grid.best_params_
scores = [g[1] for g in grid.grid_scores_]
plt.semilogx(Crange, scores);
One popular and effective way to address over-fitting is to use Ensemble methods. sklearn.ensemble.RandomForestRegressor uses multiple randomized decision trees and averages their results. The ensemble of estimators can often do better than any individual estimator for data that is over-fit. Repeat the above experiment using the Random Forest regressor, with 10 trees.
What is the best max_depth for this model? Does the accuracy improve?
from sklearn.ensemble import RandomForestRegressor
X, y = make_data(500, error=1)
clf = RandomForestRegressor(10)
max_depth = np.arange(1, 10)
grid = GridSearchCV(clf, param_grid={'max_depth': max_depth},
scoring='mean_squared_error', cv=5)
grid.fit(X, y)
scores = [g[1] for g in grid.grid_scores_]
plt.plot(max_depth, scores);
Bias Variance -Learning Curves
The exact turning-point of the tradeoff between bias and variance is highly dependent on the number of training points used. Here i will illustrate the use of learning curves, which display this property.The idea is to plot the mean-squared-error for the training and test set as a function of Number of Training Points
def test_func(x, err=0.5):
y = 10 - 1. / (x + 0.1)
if err > 0:
y = np.random.normal(y, err)
return y
def make_data(N=40, error=1.0, random_seed=1):
# randomly sample the data
np.random.seed(1)
X = np.random.random(N)[:, np.newaxis]
y = test_func(X.ravel(), error)
return X, y
X, y = make_data(40, error=1)
plt.scatter(X.ravel(), y);
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
Linear Regression- MSE
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
Polynomial Regression
class PolynomialRegression(LinearRegression):
"""Simple Polynomial Regression to 1D data"""
def __init__(self, degree=1, **kwargs):
self.degree = degree
LinearRegression.__init__(self, **kwargs)
def fit(self, X, y):
if X.shape[1] != 1:
raise ValueError("Only 1D data valid here")
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.fit(self, Xp, y)
def predict(self, X):
Xp = X ** (1 + np.arange(self.degree))
return LinearRegression.predict(self, Xp)
model = PolynomialRegression(degree=2)
model.fit(X, y)
y_test = model.predict(X_test)
plt.scatter(X.ravel(), y)
plt.plot(X_test.ravel(), y_test)
print "mean squared error:", metrics.mean_squared_error(model.predict(X), y)
#Try with degree 3
X, y = make_data(200, error=1.0)
degree = 3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
N_range = np.linspace(15, X_train.shape[0], 20).astype(int)
def plot_learning_curve(degree=3):
training_error = []
test_error = []
mse = metrics.mean_squared_error
for N in N_range:
XN = X_train[:N]
yN = y_train[:N]
model = PolynomialRegression(degree).fit(XN, yN)
training_error.append(mse(model.predict(XN), yN))
test_error.append(mse(model.predict(X_test), y_test))
plt.plot(N_range, training_error, label='training')
plt.plot(N_range, test_error, label='test')
plt.plot(N_range, np.ones_like(N_range), ':k')
plt.legend()
plt.title('degree = {0}'.format(degree))
plt.xlabel('num. training points')
plt.ylabel('MSE')
plot_learning_curve(3)
This shows a typical learning curve: for very few training points, there is a large separation between the training and test error, which indicates over-fitting. Given the same model, for a large number of training points, the training and testing errors converge, which indicates potential under-fitting.Lets try with degree 2.
plot_learning_curve(2)
Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.
ReplyDeleteBest hadoop training institute in chennai