Varietal Features

as Revealed by the Analysis of On-line Wine Reviews

 

A project for Algonquin College’s Sommelier program

 

Jim Elder

 

October 21, 2000


_______________________

 

 

Each issue of the Wine Spectator magazine reviews several hundred wines, with tasting notes and ratings, mostly from blind tastings.  Thousands of these reviews are available at the Wine Spectator web site via the internet, at http://www.winespectator.com, where reviews can be sorted by varietal and other criteria.  A typical wine review reads:

 

Merlot: Sonoma

$60; Rated: 94/100

 

Wonderful richness and fruit definition, with spicy, toasty oak, ripe plum and black cherry, black olive, herb and sage flavors lingering on the long, complex aftertaste. Best from 1999 through 2006. (500 cases produced)

 

Do wines of each variety have characteristic features?  If so, what features are associated with which varietals?  How different are varietal wines of the major grapes?

 

In this project, over 8,500 reviews of nine varietal wines (four white, and six red) are analysed to determine if there is a characteristic vocabulary associated with each varietal.

 

 

Method

 

1) Collect reviews.  The Wine Spectator web site was selected as the most suitable source of wine reviews, for the following reasons:

·     large number of reviews available (several thousand), providing sufficient numbers to reveal patterns

·     the professed quality of the reviews (blind tasting, controlled conditions, etc) from a team of several reviewers (reducing “reviewer bias”)

·     an emphasis on varietal wines (typically New World) (as opposed to blends)

·     standardized format of reviews (making it easier to compare reviews)

·     online (making them easy to process)

·     their reviews are from the last two years, reducing variance due to changing standards over time

 

Wine Spectator makes their database available via a web page interface, and by direct query.  By direct query, it is possible to capture reviews of only unblended varietal wines.

 

2) Parse each review into salient words and phrases.  The text of all reviews is broken into individual words so that the frequency of each word can be counted.  Simple words (such as ‘the’, ‘and’, ‘is’, and ‘a’) and punctuation are eliminated.  Adjacent words are linked into word-pairs and counted.  Word pairs that are relevant, such as “toasty oak”, are designated as single words (to facilitate later analysis) by concatenating them with an underscore, eg., “toasty_oak”.  This prevents phrases like ‘green pepper’ from contributing to ‘pepper’.

 

This is a screen shot of the program that parses reviews:

 

 

The first box lists the files containing reviews downloaded from Wine Spectator.  Clicking on a .txt file name causes that file to be analysed.  Above, the analysis for merlot.txt is displayed.  The middle-left box lists the regions of the wines in the review – 198 of the wines came from Napa, CA.  The lower-left boxes list words that are ignored, made into word-pairs, and considered as synonyms (for the purpose of the list in the middle, not subsequent analysis).  The list in the middle shows the frequency of single words, word-pairs, and synonyms (for example, black_cherry* includes black_cherry and dark_cherry).  The frequency of potential word-pairs (ones not previously selected as relevant) are listed in the right-most box.  Clicking on a single word in the middle list causes it to be added to the ‘ignore’ list, and clicking on a word-pair in the right list causes it to be added to the ‘word-pair’ list, for subsequent runs.  By repeatedly running the program and selecting words and word-pairs, it doesn’t take long to arrive at a suitable parsing result.

 

Here is a typical review and, below, the resulting parser output:

 

Merlot: Sonoma

$60; Rated: 94/100

 

Wonderful richness and fruit definition, with spicy, toasty oak, ripe plum and black cherry, black olive, herb and sage flavors lingering on the long, complex aftertaste. Best from 1999 through 2006. (500 cases produced)

 

… becomes, after parsing:

 

wonderful richness|fruit definition|spicy|toasty_oak|

ripe_plum|black_cherry|black_olive|herb|sage|lingering|

long|complex aftertaste|

 

Distilling the reviews in this fashion makes it easier to isolate the keywords that might be distinctive for each varietal.

 

3) Decide what keywords will be used.  As a starting point, the words of Noble’s wine aroma wheel are used, and then supplemented by words that appear from the previous parsing step (for example, ‘mineral’ is a common descriptor not found in Noble’s wheel).  Synonyms are identified (eg., “herbal, herb, herbaceous, herby”, or “black_currant, currant, cassis”).

 

4) Count the occurrence of each word or synonym set, by varietal.  This is done with another program shown below:

 

 

The left box lists available parsed reviews.  Clicking on a file name causes that file to be analysed, with the results displayed at the right and written to disk.  The classification is modeled after the wine aroma wheel, where aromas are categorized hierarchically from most specific at the edge to most general at the inner ring (eg., ‘grapefruit’ on the edge, which is a subset of ‘citrus’ in the middle ring, which is a subset of ‘fruity’ on the innermost ring).  “Cascading counts up” means that a count in an ‘edge ring’ (eg., ‘grapefruit’) is added to all associated inner-more rings (eg., ‘citrus’ and ‘fruity’, in the case of ‘grapefruit’).  If a wine was called ‘fruity’, it would cause a count in only the inner most ring ‘fruity’.  If a wine was called ‘citrusy’, it was cause a count in both the ‘citrus’ and ‘fruity’ rings.  “Cascade counts down” means that a count in an ‘inner ring’ is added to all associated outer-more rings (as shown in the example above).  Cascading down means that most information about inner rings is transferred and merged into outer-more rings – and thus a full profile of the wine can be viewed just by looking at the single outer-most ring.

 

The analyser matches words that are exactly like the keywords listed (eg., ‘grapefruit’) and like the keyword with ‘s’ or ‘y’ added.  Thus ‘has a grapefruity taste’ and ‘smells like grapefruits’ would all cause a hit on ‘grapefruit’ (and inner-more rings of that category).

 

5) Merge the results and compare varietals.  The results for the outer-most ring are collected into an Excel file to allow side-by-side comparisons.  The nine wines and 94 keywords represent too much information to be easily displayed by Excel charts, but almost as effective is a technique that represents Excel cell values by a series of ‘|’ characters, making a stack proportional to the size of numbers.  For example, the fragment of  Excel worksheet below shows the results for four white wines.  In each case, the stack of ‘|’ characters is proportional to the percentage of reviewers that detected the associated feature in the wine.  The large stack of ‘|’ under Sauvignon Blanc in ‘Grapefruit’ means many reviews mentioned ‘grapefruit’ (or ‘grapefruity’ or ‘grapefruits’) for that variety, whereas ‘grapefruit’ was not as common a feature of Gewurztraminer wines.  Recall that due to cascading counts from inner rings, a mention of ‘citrus’ would cause a count for all outer-more members of that class, which in this case would be ‘grapefruit’ and ‘lemon’.  Looking at the results below, we can surmise that there were many mentions of ‘citrus’ for Sauvignon Blanc, and a few reviewers that were more specific, using ‘grapefruit’.

 



The four white wines and six red varietal wines chosen were mentioned as distinctive on Noble’s wine aroma wheel and her web site (http://wineserver.ucdavis.edu/Acnoble/waw.html).  Reviews for the following varieties were analysed:

 

                                    Red                                                White
                           Cabernet Franc                                  Chardonnay

                        Cabernet Sauvignon                            Sauvignon Blanc

                                  Merlot                                      Gewurztraminer

                               Pinot Noir                                          Riesling

                              Shiraz/Syrah

                                Zinfandel

 

 

Results

 

The results are shown on the following two pages, one page for reds and the other for whites.  These are the results for the outer-most ring, with counts from the inner-more rings added in.  To condense the display, some categories of Noble’s aroma wheel that were not found in any reviews have been removed.  The absence of most ‘chemical’ features is likely due to only wines free of defects being included in the reviews.

 

 





The results displayed this way are a little bit tricky to interpret.  For example, looking at the white wine results, it’s clear that Sauvignon Blanc has by far the most hits in herbaceous.  Note that most reviewers simply distinguished “herbal” (or some variant of that word) rather than a more specific aroma (eg., “bell pepper”), so the counts are the similar for all members of that category, with only a few reviewers being more specific (‘Hay, Straw’).

 

Distinguishing characteristics can be found by looking for counts that stand out (for example, ‘Rose’ for Gewurztraminer.

 

8,651 reviews (original and parsed) and word frequency tables are available.

 

 

Discussion

 

The results show considerable consistency in reviews.  For each particular variety, there are certain features that are often mentioned, but many more that are never mentioned.  This consistency of clustering could be interpreted as a sign of good reviewers (eg., in the same way that 1,000 people asked to name the colour of the a clear noon sky would tend to consistently answer “blue”) or a sign of mechanistic reviews or editing.

 

Some varietals have very similar profiles.  Merlot and Cabernet Sauvignon are nearly identical, except that ‘currant’ and ‘cedar’ are slightly more frequent in Cabernet Sauvignon – no single feature could distinguish their two profiles in the results.

 

On the other hand, ‘rose’ and ‘lychee’ are clearly identified with Gewurztraminer, and ‘tobacco’ is a reliably distinctive feature of Cabernet Franc in the reviews.

 

The four white varietals examined have profiles centered on ‘apple/peach/apricot’ and ‘citrus’, but distinguished as described below:

 

                                      Apple/peach/apricot/grapefruit/lemon

Chardonnay                   + melon + fig + smoky + mineral + RESINOUS(oak) + vanilla + caramel

Sauvignon Blanc             + HERBACOUS + melon + fig + mineral

Gewurztraminer              – lemon + pepper + ROSE + LYCHEE + honey

Riesling                          + MINERAL + vanilla/honey + almond

 

The six reds have overlapping characteristics but differ in emphasis. All six reds had cherry, black_cherry, plum, anise, “toasty oak” occurring with high probability in their reviews, with occasional mention of vanilla, smoke, coffee, sage (grass?), leather, chocolate, and mint, plus some notable tendencies, as listed below:

 

                                      Plum/cherry/black_cherry/resinous/anise/chocolate

Cabernet Franc              + TOBACCO + blackberry – anise

Cabernet Sauvignon       + CURRANT + blackberry

Merlot                           + blackberry

Pinot Noir                      + blackberry/raspberry/strawberry + tea – chocolate

Shiraz/Syrah                   + BLACKBERRY + raspberry + PEPPER (less cherry)

Zinfandel                        + blackberry/raspberry/strawberry (jammy) + tar + sage

 

A possible problem with aggregating so many reviews is that if there were different clusters of characteristics within a particular varietal, that clustering would be obscured.  For example, cool-climate Chardonnays differ from warm-climate Chardonnays in characteristic ways, but in this study, all Chardonnays have been lumped together, creating an aggregate profile that (theoretically) may not describe any single Chardonnay!  However, this is an interesting first-step, and of course refinements would always improve the results, at the cost of complexity and difficulty of interpretation –  interesting answers always raise more questions!

 

There are other descriptors not included in the study, such as those associated with tannins, acid, and body, that are features of a particular varietal and help distinguish it to tasters.

 

In summary, the analysis of thousands of wine reviews reveals that wines of each variety do have characteristic features that can be derived by looking for a pattern of descriptors in the wine reviews, especially varietal white wines.

 

 

Ideas for further work

 

·        Compare reviews for cool- and warm-climate wines within the same varietal

·        Analyse reviews for other varietals

·        Examine descriptors of acid, tannin, and texture (by varietal).

·        Try different sources of reviews.  Compare the profiles for varietals from different sources.

 

Credits and Sources

 

Wine reviews came from the Wine Spectator web site (http://www.winespectator.com), using direct queries (enabling capture of reviews of varietals, not blends, which is not possible for all varietals via their web page interface).

 

Wine aroma vocabulary came from A.C. Noble’s Wine Aroma Wheel (http://wineserver.ucdavis.edu/Acnoble/waw.html).

 

All the software is original.