Been a couple of months since the last post...
Moving from Beer Recommendation that I posted in my previous blog, this time the topic is Food. The data set comes from a Kaggle Competition "What's Cooking" that consists of approximately 40,000 recipes comprising of 20 global cuisines.
How am I using this data ?
In this post I will begin with the data preparation steps such as standardization of ingredient names, data reshaping, aggregations and make the data ready to be easily consumed downstream. I will then lead into Gephi where I will do network graph visualization to understand the connections between different elements. i-e How do different ingredients map into different cuisines and also how-often they are used. I know "Garam-Masala" will be the most important predictor for Indian cuisine, however my confidence intervals are pretty wide when it comes to something like "Anchovies".
Hopefully, by the end, you & I will learn the art of data driven cooking.
Hopefully, by the end, you & I will learn the art of data driven cooking.
Trifacta (Data Wrangling & Preparation), and Gephi (Network Graphs Visualization)
Stage 1 : Let's Wrangle the Data !
1. The training data-set is a pretty-printed JSON file with bunch of extra tabs & line-breaks which makes it human readable however not ideal for direct read into any of the programs.
2. Load this JSON file into Trifacta. I see the structure similar to the raw data shown above. However, I can quickly go through few steps of cleaning the empty spaces and extra characters by interacting with the data whereby I get suggestions as to how I might be able to transform it.
My goal in this step is to create a JSON structure where every recipe is folded into a single record and its represented as a series of key-value pairs separated by commas. The reason for doing so is that Trifacta can then automatically infer that structure as a Map type which can then be Un-nested into individual fields by a click of a button.
As shown in the image below, the right hand side column is inferred as Map type with 3 nested keys namely id, cuisine, and ingredients
3) After un-nesting the column2 (right side column in the above image), the three fields are structured as columns as shown below. Straight out of the box, we have also inferred ingredients field as an array object. My goal is to create a single row for each ingredient keeping the id and cuisine field combination as the ID. In other words I will turn my data from wide to long where my number of rows will increase. It would help me easily perform Joins by using those ingredients later down the process and also perform updates & filters based on the values in a single column instead of looking across multiple columns.
4) Filtering on some very common ingredients like water, salt, sugar that occur pretty uniformly across all cuisines and would not provide any additional information to cluster/classify the cuisines
5) We now Standardize the ingredient names to we can bucket them into fewer categories. This involves consolidating similarly spelled names. As shown in the screenshot below, Chicken Breasts appears in 4-5 different forms. Within Trifacta, I can use the standardize option which does this task. I have control over groups being wider or narrower and I can also manually assign certain values to larger groups that were not automatically grouped.
6) Once I have prepared this data structurally and cleansed it to the best of my knowledge, I can then go ahead and do some Visual Profiling where I get crucial information such as the outliers. On an average, the string length of ingredient names is about 11-12 characters. However there are some ingredients whose length is around 65-70 characters. Its evident from the long right tail. In the string length outliers section, I can see some of those names such as "Bertolli vineyard premium collection red vinaigrette" and a few more. I could then edit these to something more standard or maybe delete them if there are only 1-2 of them. The thought being if I have 40,000 recipe's almost evenly distributed between 20 cuisines, and only 1 of those use this particular ingredient, there is no way my predictive model will pick this signal to predict the cuisine type. Further, I will add an additional filter on the edges/connections where each ingredient should be used at least 3 times for any type of cuisine.
Some interesting facts & and figures so far :
Unique ingredients : 3648. However, some of the ingredients were used only once or twice in the entire data-set that I will eliminate from my downstream analysis. I am then left with unique 3588 ingredients that were used 3 or more times. These will be referred as Nodes in the Network Graph
Connections : There are 11667 unique connections between an ingredient and a type of cuisine after filtering based on the criteria described earlier.. These connections would be called as Edges on the Network Graph and they depict the interconnections between ingredients and cuisines.
Cuisine Complexity : On an average, Irish, British, Russian cuisines tend to use lesser ingredients per recipe (approx 5), where as Thai, French, Cajun tend to be more involved and use about 13 ingredients per recipe. Indian, Southern US , Jamaican, Moroccan fall in the middle of the pack with about 9 ingredients per preparation.
The visual pro filer also helps me understand about the interaction between different fields of data. In the image below, I am focusing on the ingredients in the main frame, however when I click on Greek as the cuisine type in the left frame, the ingredients typically used in Greek cuisine get highlighted.
7) The Network graph that I plan to create in the second stage would require couple of different data-sets. This network graph will be a hetrogeneous directed graph where the nodes contain cuisine types as well as the and the edges will be the connections between cuisine and ingredients.
In order to create the edges, I will be grouping by the cuisine type and ingredient and use the count() as the value. This count shows me the strength of the connection/edge. In the image below, I see that olive-oil was used for making Italian for 138 recipe's within the sample shown below. Not surprising eh...
group by Ingredients,count() UNION group by Cuisines,count()
This concludes my data prep stage for now.
Stage 2 : Network Graphs. Identifying how the elements interact in space
I will be using Gephi to look at the components and the connections in space. Helpful to see how clusters form.
Here is the very first representation without any filtering. I have turned the node size & color as a function of overall degree (incoming connections + outgoing connections) and its pretty obvious that Mexican, Italian, French, and Southern US lead the pack. This is a heterogeneous graph in the sense that my nodes contain both the ingredients and the recipe (I have labeled the recipe with the Cuisine Name). Ingredients have only outgoing connections whereas Cuisines have only incoming connections.
Some of the more commonly used ingredients are visible as red dots appearing in the center.
We now turn some filtering on where I will keep only the edges that have a weight of 3 or more ie keep only the connections between an ingredient and cuisine type if they happen 3 or more times.
I will also filter on the nodes and only keep the one's that have a degree of 20 or more i.e overall that particular ingredient is used at least 20 times in the overall data set. These starting numbers that I am using to filter are somewhat random and little bit of intuition. All depends on how dense I want to keep the graph.
In the above two network graphs, I had to manually position the nodes representing different cuisines on the perimeter and there is no particular order. I can see that each of those cuisines is connected to a bunch of ingredients.
Lets start to now use some algorithms and see if we can create clusters of cuisines that should come in vicinity based on shared connections. The more the common shared connections, higher gravitational pull between the nodes. I will be using an algorithm called Force Atlas within Gephi, change the color theme slightly there my node colors are indicative of whether its a cuisine or an ingredient and also the node size is a function of incoming connections (in-degree).
I am already seeing all the Asian cuisines (Thai, Filipino, Vietnamese, Korean, Chinese) start forming a cluster on the bottom right. Indian and my amigo bros do eat a lot of beans :)
Spanish, Greek, Moroccan cling on well...no surprises but its pretty cool to visualize in this manner.
I did not really know that Italian and Southern US has a lot of commonalities in ingredients...so that's a surprise.
Lets zoom in on the above image
If you happen to like Brazilian & Russian, & Jamaican.. Best of luck keeping those ingredients...
Lets filter it down a bit more to only keep the ingredients that have been used at least 50 times over all and only keep the edges that have a weight of 10 or more. I will now increase the overall repulsion slightly between the nodes so I can remove some of the overlap that's happening. I will focus on the Asian cuisines...
The strongest links between Japanese and Chinese are sesame oil, soy sauce, rice vinegar, oyster sauce... Ginger is the strongest connection between Chinese and Indian...and the list goes on..
I would let you all do the rest of exploration.
Thanks for visiting and reading my blog. If you have suggestions on other things that might be interesting to explore and analyze, please feel free to drop by a line or two in the comments...