Thanks to the built-in data formats like the h-recipe microformat and recipe schema, many of the recipes published on the Web are marked semantically. Even better, there is a Ruby gem called hangry to analyze these formats. In a short time, I have transformed the recipes into structured data.

The thing that interested to me most were the ingredients, and here I found my next problem: I had lists of human readable ingredients, nothing structured enough to compare quantities, find similarities or convert units.

Ingredients are difficult

First examples I looked at seemed pretty simple to me:


  “2 tablespoons of butter.”

  “2 tablespoons of flour.”

  “1/2 cup of white wine.”

  “1 cup of chicken broth.”


It seemed like a clear pattern was emerging, and maybe a line of Ruby code would have been enough:

quantity, unit, name = description.split(” “, 3)

Regrettably, the reality was much more complex. I found more and more examples that did not fit this simple pattern. A few ingredients had multiple quantities that had to be combined (“3 cups and 2 spoons”, or “2 packs of 10 ounces”); some had alternative quantities in metric and imperial, either in cups and ounces; others followed the name of the ingredient with the preparation instructions, or listed several ingredients together in the same article.

The special cases accumulated higher and higher, and my simple Ruby code became more and more tangled. I stopped feeling comfortable with the code, then I stopped feeling that it would work after refactoring, and eventually threw it away.

I needed a whole new plan.

Recognition of named entities

That seemed to me the perfect problem for supervised machine learning – I had a lot of data I wanted to categorize; categorizing a single example manually was easy enough; but manually identifying a broad pattern was at best difficult, and at worst impossible.

When I considered my options, a named entity recognition seemed the right tool to use. Named entity recognizers identify predefined categories in the text; I wanted it to recognize their names, quantities, and units of ingredients in my case.

I decided on the Stanford NER, which uses a sequence model of random conditional fields. To be honest, I don’t understand the math behind this particular type of model, but you can read the document1 if you want all the gory details. What was important to me was that I could train this NER model on my data set.

The process I had to follow to train my model was based on Jane Austen’s example from the Stanford NER FAQ.

Training the Model

The first thing I did was to collect my sample data. Within a single recipe, the way the ingredients are written is quite uniform. I wanted to make sure I had a good range of formats, so I combined the ingredients from about 30,000 recipes online into a single list, ordered them at random and chose the first 1,500 for my training set.

It looked just like that:

sugar to dust the cake

1 cup and 1/2 cup of diced smoked salmon

1/2 cup whole almonds (3 oz), toasted

Then, I used part of Stanford’s suite of GNP tools to split them into chips.

The following command will read the text from standard input, and the output tokens into standard output:

java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer

In this case, I wanted to build a model that includes a description of a single ingredient, not a complete set of descriptions of the ingredients. In the language of GNP, this means that each description of ingredients must be considered a separate document. To represent this to Stanford’s NER tools, we need to separate each set of tokens with a blank line.

I broke them using a small shell script:

while reading the line; do

  echo $line | java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer >> train.tok

  echo >> train.tok

done < train.txt

Several hours in vim later, the results looked something like this:

confectioners  NAME

‘              NAME

sugar          NAME

for            O

dusting        O

the            O

cake           O

1 1/2          QUANTITY

cups           UNIT

diced          O

smoked         NAME

salmon         NAME

1/2            QUANTITY

cup            UNIT

whole          O

almonds        NAME

-LRB-          O

3              QUANTITY

oz             UNIT

-RRB-          O

,              O

toasted        O

Now the training set was finished, I could build the model:

java -cp stanford-ner.jar \

  -trainFile train.tsv \

  -serializeTo ner-model.ser.gz \

  -prop train.prop

The train.prop file I used was very similar to the Stanford NER FAQ’s example file, austen.prop.

Model test

One of the downsides of machine learning is that it is a bit opaque. I knew I trained a model, but I didn’t know how accurate it would be. Fortunately, Stanford provides test tools to let you know how well your model can generalize to new examples.

I took about 500 more random examples from my data set, and went through the same fascinating process of manual token labeling. I now had a set of tests that I could use to validate my model. Our precision measurements will be based on how the token labels produced by the model differ from the token labels I wrote by hand.

I tested the model using this command:

I tested the model using this command:

java -cp stanford-ner.jar \

  -loadClassifier ner-model.ser.gz \

  -testFile text.tsv

This test command outputs the test data with the label I’d given each token and the label the model predicted for each token, followed by a summary of the accuracy:

CRFClassifier tagged 4539 words in 514 documents at 3953.83 words per second.

         Entity P       R F1 TP      FP FN

           NAME 0.8327  0.7764 0.8036  448 90 129

       QUANTITY 0.9678  0.9821 0.9749 602     20 11

           UNIT 0.9501  0.9630 0.9565  495 26 19

         Totals 0.9191  0.9067 0.9129 1545    136 159

The column headings are a bit opaque, but they are standard machine learning metrics that make good sense with a little explanation.

P is precision: it is the number of tokens of a given type that the model has correctly identified, out of the total number of tokens that the expected model was of that type. 83% of the tokens that the model identified as NAME tokens were actually NAME tokens, 97% of the tokens that the model identified as QUANTITY tokens were actually QUANTITY tokens, etc.

R is recall: is the number of tokens of a given type that the model correctly identified, out of the total number of tokens of that type in the test set. The model found 78% of the NAME tokens, 98% of the QUANTITY tokens, etc.

F is the F1 score, which combines precision and recall. It is possible that a model is very inaccurate, but still has a high score in terms of accuracy or recalculation: if a model had labeled each token as NAME, it would get a very good recall score. Combining the two as F1 scores results in a single number that is more representative of the overall quality.

TP, FP and FN are true positives, false positives and false negatives respectively.

Using the model

Now I had a model and confidence that it was reasonably accurate, I could use it to classify new examples that were not in the training or test sets.

Here is the command to run the model:

$ echo “1/2 cup of flour” | \

  java -cp stanford-ner/stanford-ner.jar \

  -loadClassifier ner-model.ser.gz \


Invoked on Wed Sep 27 08:18:42 EDT 2017 with arguments: -loadClassifier

ner-model.ser.gz -readStdin



Loading classifier from ner-model.ser.gz … done [0.3 sec].

1/2/QUANTITY cup/UNIT of/O flour/NAME

CRFClassifier tagged 4 words in 1 documents at 18.87 words per second.

$ echo “1/2 cup of flour” | \

  java -cp stanford-ner/stanford-ner.jar \

  -loadClassifier ner-model.ser.gz \

  -readStdin 2>/dev/null

1/2/QUANTITY cup/UNIT of/O flour/NAME

Iterate on the model

Even with these seemingly high F1 scores, the model was only as good as its training set. When I went back and ran my full set of ingredient descriptions through the model I quickly discovered some flaws.

The most obvious problem was that the model could not recognize the ounces of liquid as a unit. When I looked back at the training set and the test set, there was not a single example of fluid ounces, fl oz, or fl oz.

My random sample was not large enough to truly represent the data.

I selected additional training and test examples, taking care to include various representations of fluid ounces in my training and test sets. The updated model scored similarly on the updated test sets and had no more problems with fluid ounces.