Today, the mention of wine is almost synonymous with its value. Often we want to know how much is the wine, and how good is it. In this regard, the question we want to be answered is does the wine has the best value? One way to judge the quality of the wine is by sentiment. Often sentiment is generated by the consumer in the form of reviews on the internet. In many ways, sentiment has transformed with the introduction of the internet. Before the internet, sentiment about a certain product was generated by those you were close too such as friends and family, or those that were reputable such as professionals. However, the introduction of the internet saw the rise of mass consumer sentiment, where consumers provided reviews for products with notable companies like Amazon and Google. Often many of these reviews can be fake with sellers trying to get the edge in selling their product. However, the wine industry not only has consumer sentiment, but also professional wine sentiment.
Since wine sentiment is not only exclusive to the consumer but also to a small group of professional wine tasters, these professionals are often criticized for rating wines randomly with no rhyme or reason. For this, we wanted to try to answer the question if professional wine tasters know what they are doing?
There are a lot of professional wine tasters, and their reviews of wines can be found on many websites. However, we also found a precompiled dataset of professional wine reviews on kaggle.com that we decided to use.
!pip install spacy
!pip install nltk
import spacy
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.corpus import twitter_samples
import pandas as pd
import numpy as np
import seaborn as sns
import string
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import accuracy_score
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# raw data
wine_df = pd.read_csv("wine.csv")
reviews = wine_df
reviews = reviews.drop(['Unnamed: 0', 'description', 'designation', 'region_2', \
'taster_twitter_handle', 'title'], axis=1)
# reviews = wine_df.sample(20000).reset_index()
wine_df = wine_df.drop(['Unnamed: 0', 'description', 'designation', 'region_2', \
'taster_twitter_handle', 'title'], axis=1)
wine_df.head()
Here we remove all rows that dont record a price value for a bottle of wine. This is due to the price being the main focus for the data visualization portion below.
reviews = reviews.dropna(subset=['price']).reset_index()
reviews.head()
Before we get into the data analysis let's get a general overivew of some wine trends and relationships that we should be aware of. We first want to get a good understanding of the spread of the data. Data visualization will help us highlight the relationship between trends in the dataset that we might have otherwise missed just by look at the dataframe.
First and foremost, let's get a summary of the rows and columns we will be dealing with. This includes the data type, the number of missing values, unique values value, and the type of value we are dealing with. This allows us to get a good summary of each column in the wine dataframe.
From the summary table, we can see many of these variables are categorical. In addition, we can see that region 1, price, and taster name have a lot of missing values. One interesting finding from this summary table we would otherwise have not known is that there are only 19 unique wine tasters in this data. This means that 19 names tasted around 130,000 wines, which is pretty impressive!
def summary_table(df):
print(f"Dataset Shape: {df.shape}")
summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name','dtypes']]
summary['Missing'] = df.isnull().sum().values
summary['Uniques'] = df.nunique().values
summary['First Value'] = df.loc[0].values
summary['Second Value'] = df.loc[1].values
summary['Third Value'] = df.loc[2].values
return summary
summary_table(reviews)
As we mentioned before, most of the columns are categorical varibles. However, there are 2 quantitative columns, which are points and price. Here we just perform a simple description of basic stats on these two columns. From this, one intersting thing that we can see is the points only range from 80 to 100, which is a pretty small range given the number of rows in the dataset.
reviews.describe()
The histogram below further shows how the points ranging from 80 to 100, which was mentioned above, is distributed. As you can see, the point distribution is almost normal. This is pretty insightful because it suggests that these wine tasters cluster their reviews towards the middle, rarely giving really low or high points. Next to the histogram, is a cumulative distribution of points, which reinforces the idea that most points are clustered in the middle.
The histogram below also provides some insight into the question of if these wine reviewers are randomly guessing. Since a random guessed graph would be entirely uniform, we can most likely conclude that is not the case. However, the question of if these wine reviewers are biased towards price is still up for grabs as we would not expect the distribution of points to be this normal.
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
g = sns.countplot(x='points', data=reviews, color='#1f77b4')
g.set_title("Distribution of Points", fontsize=20)
g.set_xlabel("Points", fontsize=15)
g.set_ylabel("Count", fontsize=15)
plt.subplot(1,2,2)
plt.scatter(range(reviews.shape[0]), np.sort(reviews.points.values), color='#1f77b4')
plt.xlabel('Index', fontsize=15)
plt.ylabel('Points', fontsize=15)
plt.title("Cummlative Points Distribuition", fontsize=20)
plt.show()
This is much like the previous graph, but here we first convert all the points into a categorical variable with a range from 0 to 5 before we plot. By converting the points into a categorical variable, we will later be able to perform some machine learning algorithms on the data. As you can see from the plot, there are barely any points in the 0th and 5th categories.
def points_to_categorical(points):
if points in list(range(80,83)):
return 0
elif points in list(range(83,87)):
return 1
elif points in list(range(87,90)):
return 2
elif points in list(range(90,94)):
return 3
elif points in list(range(94,98)):
return 4
else:
return 5
reviews["rating_cat"] = reviews["points"].apply(points_to_categorical)
total = len(reviews)
plt.figure(figsize=(14,6))
g = sns.countplot(x='rating_cat', color='#1f77b4',
data=reviews)
g.set_title("Point as Catigorical Variable Distribution", fontsize=20)
g.set_xlabel("Categories ", fontsize=15)
g.set_ylabel("Total Count", fontsize=15)
sizes=[]
plt.show()
Below we can see the price distribution for all prices less than 300. As you can see, there are relatively few wines above 100. Most wines are clustered between 0 and 100 with the peak wine price at around 20 dollars. It is interesting how fast the frequency of wines drop off, signaling a very saturated wine market for wines less than 100 dollars.
plt.figure(figsize=(12,5))
g = sns.kdeplot(reviews.query('price < 300').price, color='#1f77b4')
g.set_title("Price Distribuition Filtered 300", fontsize=20)
g.set_xlabel("Prices(US)", fontsize=15)
g.set_ylabel("Frequency Distribuition", fontsize=15)
plt.show()
So far, we have only been looking at one variable. Arguably one of the most important relationships in this dataset is that of points and price. In essence, we want to start to ask the question of how does price affects the points given by the wine reviewers. For the plots below, we should expect a linear trend where the cheaper the price, the lesser the points awarded, and higher prices being rewarded more points. In addition, we should see high-value wines to be closer in the low price and high points quadrant, and we should expect low-value wines to be in the high price and low points category. Looking at the scatter and hexplot below for wines with a price of less than 300, we can that almost all wines follow this trend, but there is a clear but slight skew towards low price and high points wines. The hexplot shows us that most wines are priced at around 25 dollars and given 87.5 points.
This scatter plot below also suggests that a high price does not indicate quality with many wines over 150 getting sub 90 points, which would seem to indicate a lack of bias. However, because we know the vast majority of points are clustered between 0 and 100 we should expect a point distribution that closely resembles that but, instead, we see a normal distribution. This indicates reviews are based more on subjectivity than objectivity.
#Scatter Plot
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 300])
# hexplot
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 300], kind='hex', gridsize=20)