Data & The Art of Beer Recommendation 

I have been wanting to write this blog for some time now and this is in continuation to an earlier blog that I published some time back where I used Trifacta to wrangle the beer reviews data and make it ready for some Predictive fun !!

The Dataset : Crowd sourced data of beer reviews from the website, where beer aficionados from all over the world have rated and critiqued beers. There are close to 1.6 million reviews from 1999 to 2012 spanning almost 66,000 different beers globally.

 Task : Recommend some awesome beers for Mr Data Wrangler

Platform : R  (Hosted on Amazon EC2 m4.2xlarge )

 So let the fun begin :)

1. Set the environment and load the required libraries
2. Calculate the weighted average review score and filter to select only the beers that fall under the category of American Double / Imperial IPA
beermerge.df$weightedOverallReview = 0.6* beermerge.df$review_overall + 0.1* beermerge.df$review_appearance + 0.1* beermerge.df$review_aroma + 0.1* beermerge.df$review_taste + 0.1* beermerge.df$review_palate
my_styles = c("American Double / Imperial IPA")
beermerge.df = beermerge.df[beermerge.df$beer_style %in% my_styles,]
3. Create a lookup table that will contain Beer Names and Beer ID.
beerLookup = beermerge.df[,c(10,12)]
beerLookup = beerLookup[duplicated(beerLookup)==FALSE,]
beerLookup$beer_name = as.character(beerLookup$beer_name)
4. Next we create a subset of beermerge.df dataframe where we only select BeerId, review_profilename, and weightedOverallReview. I am also renaming the user osusullins to Data Wrangler . osusullns has reviewed about 5-6 beers and his rankings are very similar to how I would have ranked those same beers.So moving forward, I shall assume that whatever recommendations we will derive for osusullins would apply to Data Wrangler as well.
beerRatings.df = beermerge.df[,c(6,12,14)]
beerRatings.df$review_profilename = as.character(beerRatings.df$review_profilename)
beerRatings.df[beerRatings.df$review_profilename == "osusullins",1] = "Data Wrangler"
Some of the users have reviewed the same beer twice. In those cases, we will choose the maximum rating which the user gave. We could also take an average of multiple ratings.
dt.df = data.table(beerRatings.df)
dt.df = dt.df[,Max:=max(weightedOverallReview), by = list(review_profilename,beer_beerid)]
dt.df = dt.df[,-3,with = FALSE]
df =
Next we create another lookup table for the review_profilename and replace the profile names with their corresponding ID’s. We are doing this to keep the data-frame compact and clean.
review_profilename = unique(df$review_profilename)
profile_id = seq(1:length(review_profilename))
profileLookup = cbind(,
profileLookup$review_profilename = as.character(profileLookup$review_profilename)
newdf = merge(profileLookup,df,by = "review_profilename")
newdf = newdf[,-1]
newdf[,c(1)] = as.integer(newdf[,c(1)])
newdf[,c(2)] = as.integer(newdf[,c(2)])
newdf = newdf[duplicated(newdf)==FALSE,]
5. Next we filter to only keep the beers that have been reviwed by atleast 5 other people. We are doing this to avoid any beers that could be a hit or miss since we dont have enough information about them. There are close to 9500 distinct beers that fall under American Double / Imperial IPA category. By setting this filter of >= 5 reviews, we remove about 340 beers from the pool.
beerCount = data.frame(table(newdf$beer_beerid))
beerCount = beerCount[beerCount$Freq >= 5,]
newdf = newdf[newdf$beer_beerid %in% beerCount$Var1,]
In order to build the recommender system, we will use a technique called as Collaborative Filtering. The underlying concept behind this technique is as follows:
Assume Person X likes Hoppy Beers and has tasted Beer 1, Beer 2, Beer 3 given them raving reviews Person Y also likes Hoppy Beers and has given good reviews to Beer 1, Beer 4, Beer 5, Beer 6
Person X also seems to like Dark Porters namely Beer 7, Beer 8. There is a high likelihood that Person Y will also share similar opinions on Beer 7 & Beer 8 than some other random person.
You can predict and recommend items to users based on preference similarities.
There are two types of collaborative filtering: user-based and item-based.
: Item Based Collaborative Filtering takes the similarities between items’ consumption history.
: User Based Collaborative Filtering considers similarities between user consumption history.
The following code block:
  1. Builds a n*p matrix (n = number of users & p = number of items). Its is a very sparse matrix where a number of cells have NA because that particular user has not yet reviewed that particular beer.
  2. Creates a p*p placeholder matrix with NA populated in all the cells.
  3. Builds a User Defined Function getCorrelation. (The similarity/distance function between two different items will be the correlation coefficient)
  4. Runs nested for loop (2 loops) across the columns of the n*p matrix and in every iteration it calculates the correlation between Beer Pi & Beer Pj. The correlation value is then fed in the placeholder matrix created above in 2.
sparse_matrix = acast(newdf, profile_id~beer_beerid, value.var="Max") = matrix(NA,nrow = ncol(sparse_matrix), ncol = ncol(sparse_matrix))
getCorrelation = function(x,y) 
  this.correlation = cor(x,y, use = "na.or.complete")
for(i in 1:ncol(sparse_matrix)) {
  # Loop through the columns for each column
  for(j in 1:ncol(sparse_matrix)) {
    # Fill in placeholder with correlation values[i,j] = getCorrelation(as.matrix(sparse_matrix[,i]),as.matrix(sparse_matrix[,j]))
5.  Inspect few rows and columns of sparse_matrix and
##    358 642 1372 1558 1878 
## 1   NA  NA   NA 4.60   NA 
## 2   NA  NA   NA   NA   NA  
## 3   NA  NA   NA   NA   NA 
## 4  4.1  NA   NA 4.10   NA 
## 5   NA  NA   NA   NA   NA[1:5,1:10]
##             V1           V2          V3         V4           V5         V6
## 1   1.00000000 -0.361978529  0.10304733 0.22551219  0.653382807 0.14079738
## 2  -0.36197853  1.000000000 -0.46582101 0.75224386  0.007153734 0.07308733
## 3   0.10304733 -0.465821006  1.00000000 0.14425995  0.346162998 0.15152294
## 4   0.22551219  0.752243865  0.14425995 1.00000000  0.311272406 0.18850740
## 5   0.65338281  0.007153734  0.34616300 0.31127241  1.000000000 0.04889332
##             V7        V8 V9         V10
## 1   0.09173101 0.3726164 NA  0.27178957
## 2   1.00000000        NA NA  0.17957908
## 3   0.16156663 0.7247839 NA -0.05265814
## 4   0.30142459 0.9134926 NA  0.09969213
## 5  -0.01595918        NA NA  0.07393889
Once we have generated the similarity matrix, we can then build out a neighborhood matrix (p*11) where the row indexes are the unique beer names and for each row, we will list out the 10 closest beers based on correlation values in descending order. = matrix(NA, nrow=ncol(,ncol=11,dimnames=list(colnames(
for(i in 1:ncol(sparse_matrix)) 
{[i,] = (t(head(n=11,rownames([order([,i],decreasing=TRUE),][i]))))
The following functions & commands are more for boiler plate stuff. Once we have generated the neighborhood matrix, we want to lookup against the beerLookup table created above to derive the beer names
as.numeric.factor = function(x) {as.numeric(levels(x))[x]}

beer_id_to_name = function(beerlist){
  #beerLookup[beerLookup$beer_beerid %in% beerlist,1]
  beer_names =
  names(beer_names) = "beer_beerid"
  beer_names = merge(beerLookup,beer_names)[2]
} = cbind(,
names( = c("beer_beerid","RowIndex")
beer.list = as.numeric.factor([,1])
row.names( = beer_id_to_name(beer.list)

cellLookup = function(cell_list){
  cell_list =
  names(cell_list) = "RowIndex"
  cell_list$RowIndex = as.character(paste('V',cell_list$RowIndex,sep=''))
  beer_id = as.numeric.factor(merge(cell_list,[,2])
}[,1] = beer_id_to_name(cellLookup([,1]))[,2] = beer_id_to_name(cellLookup([,2]))[,3] = beer_id_to_name(cellLookup([,3]))[,4] = beer_id_to_name(cellLookup([,4]))[,5] = beer_id_to_name(cellLookup([,5]))[,6] = beer_id_to_name(cellLookup([,6]))[,7] = beer_id_to_name(cellLookup([,7]))[,8] = beer_id_to_name(cellLookup([,8]))[,9] = beer_id_to_name(cellLookup([,9]))[,10] = beer_id_to_name(cellLookup([,10]))[,11] = beer_id_to_name(cellLookup([,11]))
We can now look at the neighborhood matrix with the beer names populated. In the left most column, we have all the unique beer names.  For each record, the screenshot below shows 4 other most similar beers based on similarity score.

Last but not the least, we want to recommend some awesome beers for Mr Data Wrangler based on some of the beers he has given a rating of >=4 . 
# Suggestions based on Data Wranglers past rated beers with Overall Weighter score >= 4

wranglerBeers = beer_id_to_name(beerRatings.df[beerRatings.df$review_profilename == "Data Wrangler" & beerRatings.df$weightedOverallReview >= 4 ,2])
## [1] "Pliny The Elder" "Maharaja"        "Mongo"           "1000 IBU"
wranglerSuggestions =[row.names( %in% wranglerBeers,]

In the screenshot below, columns 6 through 11 have been split across multiple rows. please bear with the not such a nice output. I tried to best customize the Blogger HTML template to read in RMD ( R Markdown ) HTML template.

If you are a Craft beer enthusiast, hopefully you enjoyed reading this blog.  Be happy if you have any suggestions or analysis that you would want to see.

Thanks for visiting ! Ciao until the next posting.



Popular posts from this blog