< Hands-On Project > Analyzing French Wine Data

羅子函|Doris Lo
7 min readAug 8, 2024

--

As I will embark on my new journey in France later this year, I’d like to prepare myself for both academics and life, which started my idea of having this hands-on project to learn fundamentals of data analytics and a bit of French culture — wine.

Cover photo generated by AI 😆

Before I start, kindly note that this article will be illustrating how I navigated through the data analytics project with non-STEM academic background, so codes and models might not be discussed in this article. If you’d like to know the wine recommendations and insights, please go to <Wine Insights Uncorked — Learning French Wines with Vivino> to see my full analysis and Tableau Public for interactive visualizations.

Here’s table of contents, feel free to scroll down to topics that interest you:

  1. Defining Goals
  2. Collecting and Preparing the Data
  3. Analyze and Visualize the Data
  4. My Takeaways

1. Defining Goals

Before this project, my experience in data analytics is fragmented, and I was only able to use limited resources (e.g. Excel & company internal tools) to reach efficiency. With that in mind, my goals in the project are:

  1. Familiarizing common data analytics tools and methodology, including Python, Tableau, and Statistics.
  2. Building an end-to-end project with basic steps such as data collection, cleansing, analyzing and visualizing.

I got the inspiration of choosing wine as my topic from Kaggle.com (ref) where I re-ignited my connection with Bordeaux. Given my exchange experience in Bordeaux, France, one of the best-known wine growing regions in the world, there were many times I found myself in a situation where people considered me knowing wines well simply because I’ve been there (yeah I know it’s funny) and asked me to pick wines for the occasion.

However the truth is I don’t know much about wine, and with this opportunity, I’d like to be ‘wine literate’, as in:

  1. Being able to recognize wine regions in France and terms to describe the wine features
  2. Knowing wine market trends to make a satisfying purchase decision

Some of the questions I’d like to answer by the analysis:

Which wines/regions are the most popular?
Which wine bottles are good value for the money?
Which winery produces best seller wines/best value wines?
What are 3 taste features of the most popular wine bottle?

2. Collecting and Preparing the Data

Vivino was a convincing choice of my data source since I’ve seen Vivino recommendations at wine aisles in Carrefour and in a wine menu at an Italian restaurant. I was confident there’s a lot to explore and meet my project needs.

Vivino.com is one of the largest global online wine marketplaces with a over 60 millions users, a vast database of 18 million wines and 103 million reviews on the website.

After several experiences in learning Python at my uni, I realized I don’t enjoy coding that much, so I used YT tutorial videos like Python for Data Analytics — Full Course for Beginners, and ChatGPT to refresh necessary knowledge to perform data extraction. ChatGPT was very helpful and made me understand the process in no time.

My first conversation with ChatGPT for this project

Communities like Github, Reddit and Kaggle was useful as well where web scraping on Vivino.com is widely discussed. That made me skipped the process of using tools like Beautifulsoup or Scrapy to sort out HTML codes. If you are curious about how to scrape the data, I highly recommend check out the communities mentioned above.

Being resourceful, I in turn placed emphases on refining the codes to fetch data that could improve the quality of my analysis. Considering the enormous database Vivino has, I set some parameters for the targeted dataset:

  • Country: France
  • Currency: Euros
  • Price range: below €100
  • Grape type: Varietal
  • Rating: from 1 (min) to 5 (max)

Eventually, the fields scraped successfully are region name, winery, wine name, vintage, price, average rating, etc. The metadata of the dataset is shown on my Tableau Public. Three major actions of data wrangling were taken: correcting garbled texts with Python, grouping data like region category, and creating fields such as coordinates.

garbled texts showed up because of French special characters like “é” in “rosé”

Visualizing the geographic distribution of the wine regions is one of my priorities and it was the most challenging part. Tableau doesn’t support the geocodes of the French wine regions which requires manual work to fill in the missing data. In the beginning, I asked ChatGPT to find all the estimated coordinates of the wine regions, but due to the intricacies of the categorizing rules (or Appellation in French) of French wine regions, ChatGPT wasn’t able to process such data. But thanks to a powerful Google Map Extension, Geocode by Awesome Table, I was able to gather all the coordinates. Yet, with such size of data, I’d consider using Python libraries to complete this task if I started over.

3. Analyzing and Visualizing the Data

Tableau Visualization

To perform quantitative analyses of the dataset, I defined popularity and good value by numbers:

  • Popularity = Number of Reviews * Average Rating
  • Good Value = Price / Popularity

There is a limitation in the data, as the number of reviews for non-vintage (N.V.) wines, usually champagne, is aggregated, while reviews for the vintage wines are counted by the year of production. In this analysis, the vintage wines are grouped by their unique wine IDs, which means the number of reviews and average rating of a wine bottle produced by the same winery across different years are calculated as a whole. This method provides a more comprehensive insight into wine popularity.

Moving forward, I used Tableau Desktop to perform the majority of my analyses. It was my very first time signing up for and downloading Tableau Desktop.

Tableau offers students 1-year free Tableau License and e-learning courses. Students can use their school email to get verified and access the resources. Check out Tableau for Students.

Before I started analyzing the dataset, I built a general understanding of Tableau with Tableau e-learning. I watched Tableau tutorials on Coursera too, but personally I found the former more helpful because of their step-by-step instructions at the hands-on activities. The analyzing process was fast — most of the questions were answered while I created the visualization sheets.

Sheet showing price distribution

Learning Tableau tricks was a enjoyable process, and they truly help the viz stay compact yet add details to the visualizations at the same time.

Keyword Analysis

In addition to Tableau, I used Python to perform keyword analysis and find out the 3 taste features of the most popular wine — Impérial Brut Champagne from Moët & Chandon. If the data is not much, Excel can finish this analysis, too. Here’s part of my code for keyword analysis:

from nltk.corpus import stopwords
from langdetect import detect, DetectorFactory
.
.

df.drop_duplicates(subset=['reviews'], inplace=True)
df.dropna(subset=['reviews'], inplace=True)
df['reviews'] = df['reviews'].str.lower()
.
.
.

from collections import Counter

# Combine all reviews into a single string
all_reviews = ' '.join(df['reviews'])

# Split the string into words
words = all_reviews.split()

# Count the frequency of each word
word_counts = Counter(words)

# Print the most common words
print(word_counts.most_common(10))

I scraped reviews in many languages like English, French, Japanese, etc, and garbled texts showed up again so I used Python library langdetect to sort out the reviews and find keywords.

Apple, citrus, brioche, green, toast, pear and lemon are the keywords showing up most frequently in the user reviews.

Some extras

As there is no field for time data in my dataset, I didn’t have a chance to observe trends of the data over time. Instead, I visualized the time the users reviews were created, which was the unintended data gathered for keyword analysis, and saw seasonality in reviews.

I learned linear regression when watching MITx: Supply Chain Analytics, and analyzed correlation of price and rating with it. This method improved my reasoning for the correlation between prices and ratings, making my statement persuasive.

4. My Takeaways

The quality of data is key. Either the missing fields like coordinates and region categories, or the situations like garbled texts, incorrect prices and duplicate records were the roadblocks for the analysis. Adding on that, short of grape types and wine types influenced the quality of analysis as well. Instead of doing much work on data wrangling, focusing on improving data collection process can save much time. One way or another, in most cases data cleansing is inevitable, it’s important to take into account that how much impacts improving the data has on your project goal and make priorities.

Consistency in the grind. I learned this idea when reading Hidden Potential from Adam Grant:

Infusing Passion into Practice.

— Adam Grant

I knew nothing about data and wine before this project. Through regular input to the topics, it’s exciting to see how far I’ve gone. Doing anything from scratch is no easy work and the changes could be subtle but don’t let the boring work frustrate you. It’s how you build your superpowers!

Thanks for reading my new journey with data analytics! I hope you enjoy the article and don’t hesitate to let me know if you have any feedback.

Stay tuned if you’d like to see similar topics.

See My Tableau Public: Tzuhan Lo
Connect With Me at LinkedIn: Tzuhan Lo

Project inspirations
1. Vivino.com
2. Cousera: 5 Data Analytics Projects for Beginners
3. NYC Data Science: Wine 101: Gathering Data From Vivino
4. Datacamp: 8 Types of Data Analytics to Improve Decision-Making
5. Dataiku: 7 Fundamental Steps to Complete a Data Analytics Project

Clap 👏 this article to let me know you are here! :)

--

--