Big Data Analytics and Machine Learning: November 2015

In this tutorial i will show you how to build a deep learning network for image recognition CIFAR-10 data set. The CIFAR data-set represents real-world data that is already formatted and labeled, so we can focus on building our network today instead of cleaning the data.

We will be using deep visual features to match the product images to each other. In order to do that, we need to load in our pre-trained ImageNet neural network model to be used as a feature extractor, and extract features from the images in the data set.

Deep Learning

Deep learning (deep machine learning, or deep structured learning, or hierarchical learning, or sometimes DL) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers with complex structures or otherwise, composed of multiple non-linear transformations.
More details follow :
https://en.wikipedia.org/wiki/Deep_learning
http://deeplearning.stanford.edu/tutorial/

This tutorial will teach you the following:

Loading the Data
Training our Model
Evaluating the results of the model
Look at ways to improving the performance of our model

All Codes available at github.

Note: Few things in this tutorial are taken from the course that i have done recently on Coursera on Machine Learning Foundation.

We've download a subset of the CIFAR-10 data set for you. We load the data into an SFame, which is a powerful and scalable data structure that is used by many of the models in GraphLab Create.

Loading Image train and test data
image_train = graphlab.SFrame('http://s3.amazonaws.com/dato-datasets/coursera/deep_learning/image_train_data')
image_test = graphlab.SFrame('http://s3.amazonaws.com/dato-datasets/coursera/deep_learning/image_test_data')

Resizing image size
graphlab.image_analysis.resize(image_train['image'], 128, 128, 3).show()

Build Logistic Classifier

Now use the logistic_classifier provided by GraphLab.
raw_pixel_model = graphlab.logistic_classifier.create(image_train, features=['image_array'], target='label')

Model Validation and Evaluation

In order to ensure that the deep learning model is actually learning how to recognize the data, instead of memorizing features, we want to validate it with a data set it hasn't seen before.

graphlab.image_analysis.resize(image_test[0:1]['image'], 128, 128, 3).show()

Let us check the label of this image.
image_test[0:1]['label']

Output:
dtype: str
Rows: 1
['cat']

Now let see what our model predict.
raw_pixel_model.predict(image_test[0:1])

Output:
dtype: str
Rows: 1
['dog']

Ooph !! it a wrong prediction..

Evaluating the Logistic model
raw_pixel_model.evaluate(image_test)

Output:
'accuracy': 0.46225

Build a Neural Network Classifier

Now use the neural network classifier provided by GraphLab and let us create a neural network for our data set with same feature. The create method picks a default network architecture for you based on the data set.
deep_model = graphlab.neuralnet_classifier.create(image_train, features=['image'], target='label')

Prediction with Neural Network Model
Lets use the same test data that we used for our previous model.
graphlab.image_analysis.resize(image_test[0:1]['image'], 128, 128, 3).show()
deep_model.predict(image_test[0:1])

Output:
dtype: str
Rows: 1
['cat']

Wow great !! it predicted correctly..Hooray ~~

Evaluating the Neural Model
deep_model.evaluate(image_test)

Output:
'accuracy': 0.597000002861023

Surely the accuracy level is better than Logistic classifier model.

But can we improve it further???

Improving the model further with Deep Features

Let us try to rebuild the model using deep features of the imagesand evauate our model.

Rebuilding our model
deep_features_model = graphlab.logistic_classifier.create(image_train, features=['deep_features'], target='label')

Evaluating the deep feature model
deep_features_model.evaluate(image_test)

Output:
'accuracy': 0.784

Its awesome!! You try withe the neural classifier with deep feature and check whether its better than this or not.

Similar Image Retrieval

It time to dig deeper. Using the deep learning knowledge that we have gather so far,lets use the images having deep feature to find similar images.
One real world use-case for this would be searching similar items at shopping cart using images rather than text word in search bar. Similar images along with meta text search wold make you shopping site awesome and your customer will love it.
Have go hero. I will host a similar service in AWS soon for a demo.

Finding Similar Images with nearest neighbor model
nearest_neighbors_model = graphlab.nearest_neighbors.create(image_train, features=['deep_features'],label='id')

def get_nearest_neighbors(image):
ans = nearest_neighbors_model.query(image)
return image_train.filter_by(ans['reference_label'],'id')

Similar Images: Cat Photo
cat = image_train[18:19]
graphlab.image_analysis.resize(cat['image'],128,128,3).show()

Picture size is not good(small pixel)

Finding similar images of cats
graphlab.image_analysis.resize(get_nearest_neighbors(cat)['image'],128,128,3).show()

Hope you enjoy this tutorial :)

In this tutorial you will learn how to classify textual data using GraphLab. We will be using a restaurant review data set for this purpose.In this tuorial along with classification you will also learn some basic feature engineering(bag of words, tf-idf) stuff that are essential part of any text analytics machine learning project.

All codes available at github.

Let's have a quick view of the Data set

I will be using iPython notebook for this exercise.
Note: You can use linux command from ipython notebook using a ! before the command.

!head -n 2 /home/kuntal/data/yelp/yelp_training_set_review.json

SFrame (Scalable Dataframe) -Powerful unstructured data processing library.

reviews = gl.SFrame.read_csv('/home/kuntal/data/yelp/yelp_training_set_review.json', header=False)
reviews[0]

Lets unpack to extract structure the column "X1".
reviews=reviews.unpack('X1','')
reviews.head(4)

Votes are still crammed in a dictionary. Let's unpack it too.
reviews = reviews.unpack('votes', '')

Data Visualization
reviews.show()

Feature Engineering

Represent datetime (Date formatting)
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')

Munging votes and adding a new column
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']

Filter rows to remove reviews with no votes
reviews = reviews[reviews['total_votes'] > 0]

Classification task

Predict which reviews will be voted "funny," based on review text.First, the labels. Reviews with at least one vote for "funny" is funny.

reviews['funny'] = reviews['funny'] > 0
reviews = reviews[['text','funny']]

Creating bag-of-words representation of text
word_delims = ["\r", "\v", "\n", "\f", "\t", " ",
'~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=',
',', '.', ';', ':', '\"', '?', '|', '\\', '/',
'<', '>', '(', ')', '[', ']', '{', '}']

reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)

Creating tf-idf representation of the bag of words
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])

Creating a train-test split
train_sf, test_sf = reviews.random_split(0.8)
Note: It returns immediately because SFrame operations are lazily evaluated.

Training classifiers on bow and tf-idf
Dictionaries are automatically interpreted as sparse features.We will be using GraphLab's in built "Logistic regression" module to create our classification models using different feature.

# Model-1 with feature 'bow
m1 = gl.logistic_classifier.create(train_sf,
'funny',
features=['bow'],
validation_set=None,
feature_rescaling=False)

# Model-2 with feature tf-idf
m2 = gl.logistic_classifier.create(train_sf,
'funny',
features=['tf_idf'],
validation_set=None,
feature_rescaling=False)

Evaluating on validation set and comparing the models performance
m1_res = m1.evaluate(test_sf)
m2_res = m2.evaluate(test_sf)

Baseline accuracy (what if we classify everything as the majority class)
float(test_sf['funny'].sum())/test_sf.num_rows()

Output:
0.4800796812749004

Percentage of not funny reviews
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()

Output:
0.5199203187250996

Big Data Analytics and Machine Learning

Friday, 6 November 2015

Deep Learning - Image Classification and Similar Image Retrieval