Data & The Art of Beer Recommendation
I have been wanting to write this blog for some time now and this is in continuation to an earlier blog that I published some time back where I used Trifacta to wrangle the beer reviews data and make it ready for some Predictive fun !!
The Dataset : Crowd sourced data of beer reviews from the website BeerAdvocate.com, where beer aficionados from all over the world have rated and critiqued beers. There are close to 1.6 million reviews from 1999 to 2012 spanning almost 66,000 different beers globally.
Task : Recommend some awesome beers for Mr Data Wrangler
Platform : R (Hosted on Amazon EC2 m4.2xlarge )
So let the fun begin :)
1. Set the environment and load the required libraries
3. Create a lookup table that will contain Beer Names and Beer ID.4. Next we create a subset of beermerge.df dataframe where we only select BeerId, review_profilename, and weightedOverallReview. I am also renaming the user osusullins to Data Wrangler . osusullns has reviewed about 5-6 beers and his rankings are very similar to how I would have ranked those same beers.So moving forward, I shall assume that whatever recommendations we will derive for osusullins would apply to Data Wrangler as well.
= ,, = , =
= ,,, = , =
= = ,, , = ,, =
Next we create another lookup table for the review_profilename and replace the profile names with their corresponding ID’s. We are doing this to keep the data-frame compact and clean.
5. Next we filter to only keep the beers that have been reviwed by atleast 5 other people. We are doing this to avoid any beers that could be a hit or miss since we dont have enough information about them. There are close to 9500 distinct beers that fall under American Double / Imperial IPA category. By setting this filter of >= 5 reviews, we remove about 340 beers from the pool.
= , = = ,, = , , = , , = , = ,
= = , = ,
5. Inspect few rows and columns of sparse_matrix and data.beer.similarity
= , , = , , = , = ,, , = ,,,
## 358 642 1372 1558 1878 ## 1 NA NA NA 4.60 NA ## 2 NA NA NA NA NA ## 3 NA NA NA NA NA ## 4 4.1 NA NA 4.10 NA ## 5 NA NA NA NA NA
Once we have generated the similarity matrix, we can then build out a neighborhood matrix (p*11) where the row indexes are the unique beer names and for each row, we will list out the 10 closest beers based on correlation values in descending order.
## V1 V2 V3 V4 V5 V6 ## 1 1.00000000 -0.361978529 0.10304733 0.22551219 0.653382807 0.14079738 ## 2 -0.36197853 1.000000000 -0.46582101 0.75224386 0.007153734 0.07308733 ## 3 0.10304733 -0.465821006 1.00000000 0.14425995 0.346162998 0.15152294 ## 4 0.22551219 0.752243865 0.14425995 1.00000000 0.311272406 0.18850740 ## 5 0.65338281 0.007153734 0.34616300 0.31127241 1.000000000 0.04889332 ## V7 V8 V9 V10 ## 1 0.09173101 0.3726164 NA 0.27178957 ## 2 1.00000000 NA NA 0.17957908 ## 3 0.16156663 0.7247839 NA -0.05265814 ## 4 0.30142459 0.9134926 NA 0.09969213 ## 5 -0.01595918 NA NA 0.07393889
The following functions & commands are more for boiler plate stuff. Once we have generated the neighborhood matrix, we want to lookup against the beerLookup table created above to derive the beer names
= , ,, , = ,,,,
We can now look at the neighborhood matrix with the beer names populated. In the left most column, we have all the unique beer names. For each record, the screenshot below shows 4 other most similar beers based on similarity score.
= = = = = , = , = , = , = = = = = ,, = ,, , = , , = , , = , , = , , = , , = , , = , , = , , = , , = , , = ,
Last but not the least, we want to recommend some awesome beers for Mr Data Wrangler based on some of the beers he has given a rating of >=4 .
##  "Pliny The Elder" "Maharaja" "Mongo" "1000 IBU"
If you are a Craft beer enthusiast, hopefully you enjoyed reading this blog. Be happy if you have any suggestions or analysis that you would want to see.
Thanks for visiting ! Ciao until the next posting.