Big Data Analytics and Machine Learning: Machine Learning with Python

In this tutorial i will show you how to perform various Machine Learning activities using Python. We will use a popular machine learning framework in python Sci-kit Learn.

Preliminaries

Checking the installation
You can run the following code to check the versions of the packages on your system:
import numpy
print 'numpy:', numpy.__version__

import scipy
print 'scipy:', scipy.__version__

import matplotlib
print 'matplotlib:', matplotlib.__version__

import sklearn
print 'scikit-learn:', sklearn.__version__

What is Machine Learning?

Machine Learning is about building programs with tunable parameters that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.

Representation of Data in Scikit-learn

Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented in order to be understood by the computer.
Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features].

Loading the Data with Scikit-Learn

Scikit-learn has a very straightforward set of data loading,we will look examples of loading Iris and Digit dataset.

Features in the Iris data-set:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Target classes to predict:
Iris Setosa
Iris Versicolour
Iris Virginica

Code:
#Loading Iris Data
from sklearn.datasets import load_iris
iris = load_iris()

iris.keys()
n_samples, n_features = iris.data.shape
print (n_samples, n_features)
print iris.data[0]
print iris.data.shape
print iris.target.shape

print iris.target_names

#Loading Digits Data
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

n_samples, n_features = digits.data.shape
print (n_samples, n_features)

print digits.data.shape
print digits.images.shape

# Visualize the Digit data point
import matplotlib.pyplot as plt
% matplotlib inline

# set up the figure
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')

# label the image with the target value

ax.text(0, 7, str(digits.target[i]))

Supervised Learning

In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower. This is a relatively simple task.

Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous.

Classification- K Nearest Neighbors

K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

Let's try it out on our iris classification problem:

from sklearn import neighbors, datasets

iris = datasets.load_iris()

X, y = iris.data, iris.target

knn = neighbors.KNeighborsClassifier(n_neighbors=1)

knn.fit(X, y)

# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?

print iris.target_names[knn.predict([[3, 5, 4, 2]])]

Output: ['virginica']

Classification- Support Vector Machines

Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data.

Let's try it out on our iris classification problem:

from sklearn import svm,datasets

iris = datasets.load_iris()

X, y = iris.data, iris.target

unknown_iris = [[3, 5, 4, 2]]

#Kernel could be rbf or linear or any other

clf=svm.SVC(kernel='linear')

clf.fit(X,y)

print iris.target_names[clf.predict(unknown_iris)]

Output: ['versicolor']

Regression- Linear

The simplest possible regression setting is the linear regression one: import numpy as np

import matplotlib.pyplot as plt %matplotlib _inline

# Create some simple data np.random.seed(0) X = np.random.random(size=(20, 1)) y = 3 * X.squeeze() + 2 + np.random.normal(size=20) # Fit a linear regression to it from sklearn.linear_model import LinearRegression model = LinearRegression(fit_intercept=True) model.fit(X, y) print "Model coefficient: %.5f, and intercept: %.5f" % (model.coef_, model.intercept_) # Plot the data and the model prediction X_test = np.linspace(0, 1, 100)[:, np.newaxis] y_test = model.predict(X_test) import pylab as pl plt.plot(X.squeeze(), y, 'o') plt.plot(X_test.squeeze(), y_test);

All codes available at github.

Big Data Analytics and Machine Learning

Saturday, 24 October 2015

Machine Learning with Python - Supervised Learning

Preliminaries

What is Machine Learning?

Representation of Data in Scikit-learn

Loading the Data with Scikit-Learn

Supervised Learning

Classification- K Nearest Neighbors

Classification- Support Vector Machines

Regression- Linear

No comments:

Post a Comment

Labels

About Me