In this tutorial you will learn how to classify textual data using GraphLab. We will be using a restaurant review data set for this purpose.In this tuorial along with classification you will also learn some basic feature engineering(bag of words, tf-idf) stuff that are essential part of any text analytics machine learning project.
All codes available at github.
Note: You can use linux command from ipython notebook using a ! before the command.
!head -n 2 /home/kuntal/data/yelp/yelp_training_set_review.json
SFrame (Scalable Dataframe) -Powerful unstructured data processing library.
reviews = gl.SFrame.read_csv('/home/kuntal/data/yelp/yelp_training_set_review.json', header=False)
reviews[0]
Lets unpack to extract structure the column "X1".
reviews=reviews.unpack('X1','')
reviews.head(4)
Votes are still crammed in a dictionary. Let's unpack it too.
reviews = reviews.unpack('votes', '')
Data Visualization
reviews.show()
reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')
Munging votes and adding a new column
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']
Filter rows to remove reviews with no votes
reviews = reviews[reviews['total_votes'] > 0]
reviews['funny'] = reviews['funny'] > 0
reviews = reviews[['text','funny']]
Creating bag-of-words representation of text
word_delims = ["\r", "\v", "\n", "\f", "\t", " ",
'~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=',
',', '.', ';', ':', '\"', '?', '|', '\\', '/',
'<', '>', '(', ')', '[', ']', '{', '}']
reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)
Creating tf-idf representation of the bag of words
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])
Creating a train-test split
train_sf, test_sf = reviews.random_split(0.8)
Note: It returns immediately because SFrame operations are lazily evaluated.
Training classifiers on bow and tf-idf
Dictionaries are automatically interpreted as sparse features.We will be using GraphLab's in built "Logistic regression" module to create our classification models using different feature.
# Model-1 with feature 'bow
m1 = gl.logistic_classifier.create(train_sf,
'funny',
features=['bow'],
validation_set=None,
feature_rescaling=False)
# Model-2 with feature tf-idf
m2 = gl.logistic_classifier.create(train_sf,
'funny',
features=['tf_idf'],
validation_set=None,
feature_rescaling=False)
Evaluating on validation set and comparing the models performance
m1_res = m1.evaluate(test_sf)
m2_res = m2.evaluate(test_sf)
Baseline accuracy (what if we classify everything as the majority class)
float(test_sf['funny'].sum())/test_sf.num_rows()
Output:
0.4800796812749004
Percentage of not funny reviews
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()
Output:
0.5199203187250996
All codes available at github.
Let's have a quick view of the Data set
I will be using iPython notebook for this exercise.Note: You can use linux command from ipython notebook using a ! before the command.
!head -n 2 /home/kuntal/data/yelp/yelp_training_set_review.json
SFrame (Scalable Dataframe) -Powerful unstructured data processing library.
reviews = gl.SFrame.read_csv('/home/kuntal/data/yelp/yelp_training_set_review.json', header=False)
reviews[0]
Lets unpack to extract structure the column "X1".
reviews=reviews.unpack('X1','')
reviews.head(4)
Votes are still crammed in a dictionary. Let's unpack it too.
reviews = reviews.unpack('votes', '')
Data Visualization
reviews.show()
Feature Engineering
Represent datetime (Date formatting)reviews['date'] = reviews['date'].str_to_datetime(str_format='%Y-%m-%d')
Munging votes and adding a new column
reviews['total_votes'] = reviews['funny'] + reviews['cool'] + reviews['useful']
Filter rows to remove reviews with no votes
reviews = reviews[reviews['total_votes'] > 0]
Classification task
Predict which reviews will be voted "funny," based on review text.First, the labels. Reviews with at least one vote for "funny" is funny.reviews['funny'] = reviews['funny'] > 0
reviews = reviews[['text','funny']]
Creating bag-of-words representation of text
word_delims = ["\r", "\v", "\n", "\f", "\t", " ",
'~', '`', '!', '@', '#', '$', '%', '^', '&', '*', '-', '_', '+', '=',
',', '.', ';', ':', '\"', '?', '|', '\\', '/',
'<', '>', '(', ')', '[', ']', '{', '}']
reviews['bow'] = gl.text_analytics.count_words(reviews['text'], delimiters=word_delims)
Creating tf-idf representation of the bag of words
reviews['tf_idf'] = gl.text_analytics.tf_idf(reviews['bow'])
reviews['tf_idf'] = reviews['tf_idf'].apply(lambda x: x['docs'])
Creating a train-test split
train_sf, test_sf = reviews.random_split(0.8)
Note: It returns immediately because SFrame operations are lazily evaluated.
Training classifiers on bow and tf-idf
Dictionaries are automatically interpreted as sparse features.We will be using GraphLab's in built "Logistic regression" module to create our classification models using different feature.
# Model-1 with feature 'bow
m1 = gl.logistic_classifier.create(train_sf,
'funny',
features=['bow'],
validation_set=None,
feature_rescaling=False)
# Model-2 with feature tf-idf
m2 = gl.logistic_classifier.create(train_sf,
'funny',
features=['tf_idf'],
validation_set=None,
feature_rescaling=False)
Evaluating on validation set and comparing the models performance
m1_res = m1.evaluate(test_sf)
m2_res = m2.evaluate(test_sf)
Baseline accuracy (what if we classify everything as the majority class)
float(test_sf['funny'].sum())/test_sf.num_rows()
Output:
0.4800796812749004
Percentage of not funny reviews
1.0 - float(test_sf['funny'].sum())/test_sf.num_rows()
Output:
0.5199203187250996
Great Article
ReplyDeleteMachine Learning Projects for Students
Final Year Project Centers in Chennai