Big Data Analytics and Machine Learning: Building a Recommender system with Apache Mahout

Recently i was playing with Apache Mahout for building recommend-er based system. I wanted to first test state of the art collaborative filtering algorithms before to build a customized solution (potentially on top of those algorithms).Here's a basic idea behind the recommendation system using Apache Mahout:

Collaborative Filtering

It is a technique for producing recommendations solely based on the user’s preferences for products (instead of including product features and/or user properties). Well, collaborative filtering can be user- or item-based.

User-based recommendation- promotes products to the user that are bought by users who are similar to his/her.

User-based Recommendation: recommend products to a user based on what similar users have bought

Item-based recommendation- proposes products that are similar to the ones the user already buys.

Item-based Recommendation: recommend products to a user that are similar to the ones he/she already bought

User-Item Preferences and Similarity

So what does similar mean in this context? In collaborative filtering similarity between users (for user-based recommendations) or items (for item-based recommendations) is computed based on the user-item preference only. We use the number of how often a user bought a product as a proxyfor the user’s preference.

Based on these user-item preferences we can use the Euclidean distance or the Pearson correlation to determine the similarity between users respectively items (products).

Based on the Euclidean distance, two users are similar if the distance between their preference vectors projected into a Cartesian coordinate system is small.
In fact, the Pearson correlation (based on demeaned user-item preferences) coincides with the cosine of the angle between the preference vectors. That is, two users are similar if the angle between their preference vectors is small, or formulated in terms of correlation, two users are similar if they rate the same products high and other products low.
The Tanimoto similarity between 2 users is computed as the number of products the 2 users have in common divided by the total number of products they bought (respectively clicked or viewed) overall.

Now lets implement the above ideas -Coding Time:

Lets start playing by building a simple recommendation engine based on the movie lens data.

To see a recommender engine in action, you can for download one of the movie Lens ratings data sets (I will show with one million ratings). Unzip the archive somewhere. The file that will interest you is u.data. Its format(separated by tab) is as follows:

userId | movieId | rating | timestamp

I have modified the file for mahout taste FileDataModel with the simple following format:

userId,movieId,rating

Sample data:

196,242,3
186,302,3
22,377,1
244,51,2
166,346,1
298,474,4
115,265,2
253,465,5

305,451,3
.....

Let's build a classic user based recommender algorithm using the Pearson correlation similarity with a nearest 10 users neighborhood with the code below:

public class UserRecommenderPlaying {

public static void main(String[] args) throws TasteException, IOException {
// specifying the user id to which the recommendations have to be generated for
int userId=6;

//specifying the number of recommendations to be generated
int noOfRecommendations=5;

//Get the dataset using FileData Model
DataModel model = new FileDataModel(new File("/home/kuntal/knowledge/IDE/workspace/MahoutTest/data/rating.csv"));

//Use a pearson similarity algorithm
UserSimilarity similarity = new PearsonCorrelationSimilarity (model);

/*NearestNUserNeighborhood is preferred in situations where we need to have control on the exact no of neighbors*/
UserNeighborhood neighborhood = new NearestNUserNeighborhood (10, similarity, model);

/*Initalizing the recommender */
Recommender recommender = new GenericUserBasedRecommender ( model, neighborhood, similarity);

//calling the recommend method to generate recommendations
List<RecommendedItem> recommendations = recommender.recommend(userId, noOfRecommendations);

for (RecommendedItem recommendedItem : recommendations) {
System.out.println("Recommended Movie Id: "+recommendedItem.getItemID()+" .Strength of Preference: "+recommendedItem.getValue());
}

}
}

Output:
Recommended Movie Id: 878 .Strength of Preference: 4.464102
Recommended Movie Id: 300 .Strength of Preference: 4.2047677
Recommended Movie Id: 322 .Strength of Preference: 4.0203676
Recommended Movie Id: 313 .Strength of Preference: 4.008741
Recommended Movie Id: 689 .Strength of Preference: 4.0

Let's build a classic item based recommender algorithm using the Pearson correlation similarity with the code below:

public class ItemRecommenderPlaying {

public static void main(String args[])throws TasteException, IOException {
// specifying the user id to which the recommendations have to be generated for
int userId=308;

//specifying the number of recommendations to be generated
int noOfRecommendations=3;

// Data model created to accept the input file
FileDataModel dataModel = new FileDataModel(new File("/home/kuntal/knowledge/IDE/workspace/MahoutTest/data/rating.csv"));

/*Specifies the Similarity algorithm*/
ItemSimilarity itemSimilarity = new PearsonCorrelationSimilarity(dataModel);

/*Initalizing the recommender */
ItemBasedRecommender recommender =new GenericItemBasedRecommender(dataModel, itemSimilarity);

//calling the recommend method to generate recommendations
List<RecommendedItem> recommendations =recommender.recommend(userId, noOfRecommendations);

for (RecommendedItem recommendedItem : recommendations)
System.out.println("Recommended Movie Id: "+recommendedItem.getItemID()+" .Strength of Preference: "+recommendedItem.getValue());

}
}

Output:
Recommended Movie Id: 245 .Strength of Preference: 5.0
Recommended Movie Id: 34 .Strength of Preference: 5.0
Recommended Movie Id: 35 .Strength of Preference: 5.0

Evaluation of the Algorithms:

In my opinion the most valuable part of the whole process is evaluating your algorithm/model. To feel immediately if your intuition of choosing a particular algorithm is a good one, or to see the good or bad impact of your own customized algorithm, you need a way to evaluate and compare them on the data.
You can easily do that with mahout RecommenderEvaluator interface. Two different implementations of that interface are given: AverageAbsoluteDifferenceRecommenderEvaluator and RMSRecommenderEvaluator. The first one is the average absolute difference between predicted and actual ratings for users and the second one is the classic RMSE (a.k.a. RMSD).

One way to check whether the recommender returns good results is by doing a hold-out test. We partition our dataset into two sets: a training-set consisting of 90% of the data and a test-set consisting of 10%. Then we train our recommender using the training set and look how well it predicts the unknown interactions in the testset.

public class EvaluationUserExample{

public static void main(String[] args) throws IOException, TasteException, OptionException {

RecommenderBuilder builder = new RecommenderBuilder() {
public Recommender buildRecommender(DataModel model) throws TasteException{
UserSimilarity similarity = new PearsonCorrelationSimilarity (model);
//Splitting of data(.1) done using 90% in training-set & 10% test-set
UserNeighborhood neighborhood = new ThresholdUserNeighborhood (.1, similarity, model);
Recommender recommender = new GenericUserBasedRecommender ( model, neighborhood, similarity);
return new CachingRecommender(recommender);
}
};

RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
DataModel model = new FileDataModel(new File("/home/kuntal/knowledge/IDE/workspace/MahoutTest/data/rating.csv"));
/*0.9 here represents the percentage of each user’s preferences to use to produce recommendations, the rest are compared to estimated preference values to evaluate.
1 represent the percentage of users to use in evaluation (so here all users).*/
double score = evaluator.evaluate(builder,null, model,0.9,1);

System.out.println("Result: "+score);
}
}

Output:
Result: 0.8018675119933131

Note: if you run this test multiple times, you will get different results, because the splitting into trainingset and testset is done randomly.

All codes available at github.

Big Data Analytics and Machine Learning

Sunday, 15 February 2015

Building a Recommender system with Apache Mahout

No comments:

Post a Comment

Labels

About Me