Wednesday, 4 November 2015

Machine Learning Case Study- Restaurant review Analysis

In this tutorial you will learn how to classify textual data using GraphLab. We will be using a restaurant review data set for this purpose.In this tuorial along with classification you will also learn some basic feature engineering(bag of words, tf-idf) stuff that are essential part of any text analytics machine learning project.

All codes available at github.

Let's  have a quick view of the Data set

I will be using iPython notebook for this exercise.
Note: You can use linux command from ipython notebook using a ! before the command.

!head -n 2 /home/kuntal/data/yelp/yelp_training_set_review.json


SFrame  (Scalable Dataframe) -Powerful unstructured data processing library.

reviews = gl.SFrame.read_csv('/home/kuntal/data/yelp/yelp_training_set_review.json', header=False)
reviews[0]




Lets unpack to extract structure the column "X1".
reviews=reviews.unpack('X1','')
reviews.head(4)




Votes are still crammed in a dictionary. Let's unpack it too.
reviews = reviews.unpack('votes', '')


Data Visualization
reviews.show()



Feature Engineering

Represent datetime (Date formatting)
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')

Munging votes and adding a new column
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']

Filter rows to remove reviews with no votes
reviews = reviews[reviews['total_votes'] > 0]


Classification task

Predict which reviews will be voted "funny," based on review text.First, the labels. Reviews with at least one vote for "funny" is funny.

reviews['funny'] = reviews['funny'] > 0
reviews = reviews[['text','funny']]

Creating bag-of-words representation of text
word_delims = ["\r", "\v", "\n", "\f", "\t", " ", 
               '~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=', 
               ',', '.', ';', ':', '\"', '?', '|', '\\', '/', 
               '<', '>', '(', ')', '[', ']', '{', '}']

reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)

Creating tf-idf representation of the bag of words
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])

Creating a train-test split
train_sf, test_sf = reviews.random_split(0.8)
Note: It returns immediately because SFrame operations are lazily evaluated.

Training   classifiers on bow and tf-idf
Dictionaries are automatically interpreted as sparse features.We will be using GraphLab's in built "Logistic regression" module to create our classification models using different feature.

# Model-1 with feature 'bow
m1 = gl.logistic_classifier.create(train_sf, 
                                   'funny', 
                                   features=['bow'], 
                                   validation_set=None, 
                                   feature_rescaling=False)

# Model-2 with feature tf-idf
m2 = gl.logistic_classifier.create(train_sf, 
                                   'funny', 
                                   features=['tf_idf'], 
                                   validation_set=None, 
                                   feature_rescaling=False)


Evaluating on validation set and comparing the models performance
m1_res = m1.evaluate(test_sf)
m2_res = m2.evaluate(test_sf)



Baseline accuracy (what if we classify everything as the majority class)
float(test_sf['funny'].sum())/test_sf.num_rows()

Output:
0.4800796812749004

Percentage of not funny reviews
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()

Output:
0.5199203187250996

1 comment: