Skip to main content


Showing posts from 2015

Good Food + Data. I can't ask for more

Been a couple of months since the last post...
Moving from Beer Recommendation that I posted in my previous blog, this time the topic is Food. The data set comes from a Kaggle Competition "What's Cooking"  that consists of approximately 40,000 recipes comprising of 20 global cuisines.  
How am I using this data ?
In this post I will begin with the data preparation steps such as standardization of ingredient names, data reshaping, aggregations and make the data ready to be easily consumed downstream.  I will then lead into Gephi where I will do network graph visualization to understand the connections between different elements. i-e How do different ingredients map into different cuisines and also how-often they are used. I know "Garam-Masala" will be the most important predictor for Indian cuisine, however my confidence intervals are pretty wide when it comes to something  like "Anchovies".

Hopefully, by the end, you & I will learn the art of data …
Data & The Art of Beer Recommendation 

I have been wanting to write this blog for some time now and this is in continuation to an earlier blog that I published some time back where I used Trifacta to wrangle the beer reviews data and make it ready for some Predictive fun !!

The Dataset : Crowd sourced data of beer reviews from the website, where beer aficionados from all over the world have rated and critiqued beers. There are close to 1.6 million reviews from 1999 to 2012 spanning almost 66,000 different beers globally.

 Task : Recommend some awesome beers for Mr Data Wrangler

Platform : R  (Hosted on Amazon EC2 m4.2xlarge )

 So let the fun begin :)

1. Set the environment and load the required libraries

library(ggplot2)library(data.table)library(reshape2)library(reshape)library(Matrix)library(dummies)library(plyr)setwd("/home/vulcan/Python")load("beermerge.Rda") 2. Calculate the weighted average review score and filter to select only the beers t…