Machine Learning Case Study- Restaurant review Analysis

In this tutorial you will learn how to classify textual data using GraphLab. We will be using a restaurant review data set for this purpose.In this tuorial along with classification you will also learn some basic feature engineering(bag of words, tf-idf) stuff that are essential part of any text analytics machine learning project.

All codes available at github.

Let's  have a quick view of the Data set

I will be using iPython notebook for this exercise.
Note: You can use linux command from ipython notebook using a ! before the command.

!head -n 2 /home/kuntal/data/yelp/yelp_training_set_review.json

SFrame  (Scalable Dataframe) -Powerful unstructured data processing library.

reviews = gl.SFrame.read_csv('/home/kuntal/data/yelp/yelp_training_set_review.json', header=False)

Lets unpack to extract structure the column "X1".

Votes are still crammed in a dictionary. Let's unpack it too.
reviews = reviews.unpack('votes', '')

Data Visualization

Feature Engineering

Represent datetime (Date formatting)
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')

Munging votes and adding a new column
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']

Filter rows to remove reviews with no votes
reviews = reviews[reviews['total_votes'] > 0]

Classification task

Predict which reviews will be voted "funny," based on review text.First, the labels. Reviews with at least one vote for "funny" is funny.

reviews['funny'] = reviews['funny'] > 0
reviews = reviews[['text','funny']]

Creating bag-of-words representation of text
word_delims = ["\r", "\v", "\n", "\f", "\t", " ", 
               '~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=', 
               ',', '.', ';', ':', '\"', '?', '|', '\\', '/', 
               '<', '>', '(', ')', '[', ']', '{', '}']

reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)

Creating tf-idf representation of the bag of words
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])

Creating a train-test split
train_sf, test_sf = reviews.random_split(0.8)
Note: It returns immediately because SFrame operations are lazily evaluated.

Training   classifiers on bow and tf-idf
Dictionaries are automatically interpreted as sparse features.We will be using GraphLab's in built "Logistic regression" module to create our classification models using different feature.

# Model-1 with feature 'bow
m1 = gl.logistic_classifier.create(train_sf, 

# Model-2 with feature tf-idf
m2 = gl.logistic_classifier.create(train_sf, 

Evaluating on validation set and comparing the models performance
m1_res = m1.evaluate(test_sf)
m2_res = m2.evaluate(test_sf)

Baseline accuracy (what if we classify everything as the majority class)


Percentage of not funny reviews
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()


