Do Professional Wine Reviewers Know What They're Doing?

Aron Sun and Jordan Foster
CMSC320
Fall 2020

Introduction

Today, the mention of wine is almost synonymous with its value. Often we want to know how much is the wine, and how good is it. In this regard, the question we want to be answered is does the wine has the best value? One way to judge the quality of the wine is by sentiment. Often sentiment is generated by the consumer in the form of reviews on the internet. In many ways, sentiment has transformed with the introduction of the internet. Before the internet, sentiment about a certain product was generated by those you were close too such as friends and family, or those that were reputable such as professionals. However, the introduction of the internet saw the rise of mass consumer sentiment, where consumers provided reviews for products with notable companies like Amazon and Google. Often many of these reviews can be fake with sellers trying to get the edge in selling their product. However, the wine industry not only has consumer sentiment, but also professional wine sentiment.

Since wine sentiment is not only exclusive to the consumer but also to a small group of professional wine tasters, these professionals are often criticized for rating wines randomly with no rhyme or reason. For this, we wanted to try to answer the question if professional wine tasters know what they are doing?

Data Collection

There are a lot of professional wine tasters, and their reviews of wines can be found on many websites. However, we also found a precompiled dataset of professional wine reviews on kaggle.com that we decided to use.

In [14]:
!pip install spacy
!pip install nltk
import spacy
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.corpus import twitter_samples
import pandas as pd
import numpy as np
import seaborn as sns
import string
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import accuracy_score
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
Requirement already satisfied: spacy in /opt/conda/lib/python3.8/site-packages (2.3.5)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.0.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (4.48.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from spacy) (3.0.5)
Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /opt/conda/lib/python3.8/site-packages (from spacy) (7.4.5)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from spacy) (49.6.0.post20200814)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.0.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from spacy) (2.0.5)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (0.7.4)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.1.3)
Requirement already satisfied: numpy>=1.15.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.19.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.0.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (2.24.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (0.8.0)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.25.10)
Requirement already satisfied: nltk in /opt/conda/lib/python3.8/site-packages (3.5)
Requirement already satisfied: regex in /opt/conda/lib/python3.8/site-packages (from nltk) (2020.11.13)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from nltk) (0.16.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (from nltk) (4.48.2)
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
In [15]:
# raw data
wine_df = pd.read_csv("wine.csv")
reviews = wine_df
reviews = reviews.drop(['Unnamed: 0', 'description', 'designation', 'region_2', \
                        'taster_twitter_handle', 'title'], axis=1)
# reviews = wine_df.sample(20000).reset_index()
wine_df = wine_df.drop(['Unnamed: 0', 'description', 'designation', 'region_2', \
                        'taster_twitter_handle', 'title'], axis=1)
wine_df.head()
Out[15]:
country points price province region_1 taster_name variety winery
0 Italy 87 NaN Sicily & Sardinia Etna Kerin O’Keefe White Blend Nicosia
1 Portugal 87 15.0 Douro NaN Roger Voss Portuguese Red Quinta dos Avidagos
2 US 87 14.0 Oregon Willamette Valley Paul Gregutt Pinot Gris Rainstorm
3 US 87 13.0 Michigan Lake Michigan Shore Alexander Peartree Riesling St. Julian
4 US 87 65.0 Oregon Willamette Valley Paul Gregutt Pinot Noir Sweet Cheeks

Cleaning the Data

Here we remove all rows that dont record a price value for a bottle of wine. This is due to the price being the main focus for the data visualization portion below.

In [16]:
reviews = reviews.dropna(subset=['price']).reset_index()
reviews.head()
Out[16]:
index country points price province region_1 taster_name variety winery
0 1 Portugal 87 15.0 Douro NaN Roger Voss Portuguese Red Quinta dos Avidagos
1 2 US 87 14.0 Oregon Willamette Valley Paul Gregutt Pinot Gris Rainstorm
2 3 US 87 13.0 Michigan Lake Michigan Shore Alexander Peartree Riesling St. Julian
3 4 US 87 65.0 Oregon Willamette Valley Paul Gregutt Pinot Noir Sweet Cheeks
4 5 Spain 87 15.0 Northern Spain Navarra Michael Schachner Tempranillo-Merlot Tandem

Data Exploration

Before we get into the data analysis let's get a general overivew of some wine trends and relationships that we should be aware of. We first want to get a good understanding of the spread of the data. Data visualization will help us highlight the relationship between trends in the dataset that we might have otherwise missed just by look at the dataframe.

Summary of Data

First and foremost, let's get a summary of the rows and columns we will be dealing with. This includes the data type, the number of missing values, unique values value, and the type of value we are dealing with. This allows us to get a good summary of each column in the wine dataframe.

From the summary table, we can see many of these variables are categorical. In addition, we can see that region 1, price, and taster name have a lot of missing values. One interesting finding from this summary table we would otherwise have not known is that there are only 19 unique wine tasters in this data. This means that 19 names tasted around 130,000 wines, which is pretty impressive!

In [17]:
def summary_table(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values
    return summary
In [18]:
summary_table(reviews)
Dataset Shape: (120975, 9)
Out[18]:
Name dtypes Missing Uniques First Value Second Value Third Value
0 index int64 0 120975 1 2 3
1 country object 59 42 Portugal US US
2 points int64 0 21 87 87 87
3 price float64 0 390 15 14 13
4 province object 59 422 Douro Oregon Michigan
5 region_1 object 19575 1204 NaN Willamette Valley Lake Michigan Shore
6 taster_name object 24496 19 Roger Voss Paul Gregutt Alexander Peartree
7 variety object 1 697 Portuguese Red Pinot Gris Riesling
8 winery object 0 15855 Quinta dos Avidagos Rainstorm St. Julian

Distribution of Points and Price

As we mentioned before, most of the columns are categorical varibles. However, there are 2 quantitative columns, which are points and price. Here we just perform a simple description of basic stats on these two columns. From this, one intersting thing that we can see is the points only range from 80 to 100, which is a pretty small range given the number of rows in the dataset.

In [19]:
reviews.describe()
Out[19]:
index points price
count 120975.000000 120975.000000 120975.000000
mean 65045.760628 88.421881 35.363389
std 37512.060879 3.044508 41.022218
min 1.000000 80.000000 4.000000
25% 32574.500000 86.000000 17.000000
50% 65144.000000 88.000000 25.000000
75% 97506.500000 91.000000 42.000000
max 129970.000000 100.000000 3300.000000

Graph of Point Distribution

The histogram below further shows how the points ranging from 80 to 100, which was mentioned above, is distributed. As you can see, the point distribution is almost normal. This is pretty insightful because it suggests that these wine tasters cluster their reviews towards the middle, rarely giving really low or high points. Next to the histogram, is a cumulative distribution of points, which reinforces the idea that most points are clustered in the middle.

The histogram below also provides some insight into the question of if these wine reviewers are randomly guessing. Since a random guessed graph would be entirely uniform, we can most likely conclude that is not the case. However, the question of if these wine reviewers are biased towards price is still up for grabs as we would not expect the distribution of points to be this normal.

In [20]:
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
g = sns.countplot(x='points', data=reviews, color='#1f77b4')
g.set_title("Distribution of Points", fontsize=20) 
g.set_xlabel("Points", fontsize=15) 
g.set_ylabel("Count", fontsize=15) 

plt.subplot(1,2,2)  
plt.scatter(range(reviews.shape[0]), np.sort(reviews.points.values), color='#1f77b4')
plt.xlabel('Index', fontsize=15)  
plt.ylabel('Points', fontsize=15)  
plt.title("Cummlative Points Distribuition", fontsize=20) 

plt.show()

Rating Categories

This is much like the previous graph, but here we first convert all the points into a categorical variable with a range from 0 to 5 before we plot. By converting the points into a categorical variable, we will later be able to perform some machine learning algorithms on the data. As you can see from the plot, there are barely any points in the 0th and 5th categories.

In [21]:
def points_to_categorical(points):
    if points in list(range(80,83)):
        return 0
    elif points in list(range(83,87)):
        return 1
    elif points in list(range(87,90)):
        return 2
    elif points in list(range(90,94)):
        return 3
    elif points in list(range(94,98)):
        return 4
    else:
        return 5

reviews["rating_cat"] = reviews["points"].apply(points_to_categorical)
In [22]:
total = len(reviews)
plt.figure(figsize=(14,6))

g = sns.countplot(x='rating_cat',  color='#1f77b4',
                  data=reviews)
g.set_title("Point as Catigorical Variable Distribution", fontsize=20)
g.set_xlabel("Categories ", fontsize=15)
g.set_ylabel("Total Count", fontsize=15)

sizes=[]

plt.show()

Price Distribution

Below we can see the price distribution for all prices less than 300. As you can see, there are relatively few wines above 100. Most wines are clustered between 0 and 100 with the peak wine price at around 20 dollars. It is interesting how fast the frequency of wines drop off, signaling a very saturated wine market for wines less than 100 dollars.

In [23]:
plt.figure(figsize=(12,5))

g = sns.kdeplot(reviews.query('price < 300').price, color='#1f77b4')
g.set_title("Price Distribuition Filtered 300", fontsize=20)
g.set_xlabel("Prices(US)", fontsize=15)
g.set_ylabel("Frequency Distribuition", fontsize=15)


plt.show()

Scatter Plot of Points vs Price of Wine

So far, we have only been looking at one variable. Arguably one of the most important relationships in this dataset is that of points and price. In essence, we want to start to ask the question of how does price affects the points given by the wine reviewers. For the plots below, we should expect a linear trend where the cheaper the price, the lesser the points awarded, and higher prices being rewarded more points. In addition, we should see high-value wines to be closer in the low price and high points quadrant, and we should expect low-value wines to be in the high price and low points category. Looking at the scatter and hexplot below for wines with a price of less than 300, we can that almost all wines follow this trend, but there is a clear but slight skew towards low price and high points wines. The hexplot shows us that most wines are priced at around 25 dollars and given 87.5 points.

This scatter plot below also suggests that a high price does not indicate quality with many wines over 150 getting sub 90 points, which would seem to indicate a lack of bias. However, because we know the vast majority of points are clustered between 0 and 100 we should expect a point distribution that closely resembles that but, instead, we see a normal distribution. This indicates reviews are based more on subjectivity than objectivity.

In [24]:
#Scatter Plot
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 300])
Out[24]:
<seaborn.axisgrid.JointGrid at 0x7fc2bf3ecf70>
In [25]:
# hexplot
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 300], kind='hex', gridsize=20)
Out[25]:
<seaborn.axisgrid.JointGrid at 0x7fc2c4d74130>
TFIDF
In [1]:
!pip install nltk
import numpy as np 
import pandas as pd 
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, r2_score, mean_squared_error
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import nltk
nltk.download('stopwords')
Requirement already satisfied: nltk in /opt/conda/lib/python3.8/site-packages (3.5)
Requirement already satisfied: regex in /opt/conda/lib/python3.8/site-packages (from nltk) (2020.11.13)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from nltk) (0.16.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (from nltk) (4.48.2)
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[1]:
True

Wine Points Prediction

Note: a sample is being used due to the kernel running out of memory due to the jupyter notebook kernel running out of memory and dying when performing analysis otherwise as a result predictions depend on the sample that is given as it is obtained randomly. Furthermore, these predictions vary with $\pm1\%$

In [4]:
wine_df = pd.read_csv("wine.csv")
str_cols = ['description', 'price',  'title', 'variety', 'country', 'designation', 'province', 'winery']

reviews = wine_df.sample(20000)[['points'] + str_cols].reset_index()
reviews = reviews.drop(['index'], axis=1)
reviews.head()
Out[4]:
points description price title variety country designation province winery
0 83 Barely ripe, with green citrus and feline spra... 18.0 Starmont 2009 Sauvignon Blanc (Napa Valley) Sauvignon Blanc US NaN California Starmont
1 90 This 100% Syrah shows intense blackberry, crèm... 45.0 Donelan 2010 Cuvee Christine Syrah (Sonoma Cou... Syrah US Cuvee Christine California Donelan
2 93 This year's dry conditions produced a fine cro... 47.0 Burmester 1989 Colheita Tawny (Port) Port Portugal Colheita Tawny Port Burmester
3 86 This is a plush and upfront wine that offers s... 14.0 Pelassa 2005 Barbera d'Alba Barbera Italy NaN Piedmont Pelassa
4 87 Pineapple and mango aromas mix with notes of b... 12.0 Columbia Crest 2014 Grand Estates Chardonnay (... Chardonnay US Grand Estates Washington Columbia Crest

We first have to change features that we are going to use from categorical to numerical variables. This is done to give the features meaning when performing different forms of analysis on them to predict the points given to a bottle of wine.

In [5]:
# assign numerical values to string columns

factorized_wine = reviews[str_cols].drop(['description'], axis=1).copy()
for col in str_cols[2:]:
    factorized_wine[col] = pd.factorize(reviews[col])[0]

factorized_wine.head()
Out[5]:
price title variety country designation province winery
0 18.0 0 0 0 -1 0 0
1 45.0 1 1 0 0 0 1
2 47.0 2 2 1 1 1 2
3 14.0 3 3 2 -1 2 3
4 12.0 4 4 0 2 3 4

Now we assign the variables we just factorized along with the price of the wine to be our X values and our y value will be what we are trying to predict, which in this case are the points for a bottle of wine.

In [6]:
X = factorized_wine.to_numpy('int64')
y = reviews['points'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Below we perform several different forms of prediction to see which one produces the best result.

We then need to determine how accurate this algorithm is given the estimates returned from the random forest regression. We do this by using score() which returns the coefficient of determination of the prediction ($r^2$). In other words, it is the observed y variation that can be explained by the and by the regression model. We also compute the residual mean squared error of the model (rmse).

linear regression

In [7]:
from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print('r2 score:', model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))
r2 score: 0.025621093791445948
rmse score: 3.004001678134892

as you can see this isnt the best prediction model so lets try some other methods and see what we get

linear discriminant analysis

In [8]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)

pred = lda_model.predict(X_test)

print('r2 score:', lda_model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))
r2 score: 0.132
rmse score: 3.1166648841349627

The results from this method are not good either so onto the next one

classification tree

In [9]:
from sklearn import tree

dt_model = tree.DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

pred = dt_model.predict(X_test)

print('r2 score:', dt_model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))
r2 score: 0.1468
rmse score: 3.2819506394825626

The methods that we have done prior as well as this one are getting used nowhere and are showing very little signs of improvement so let's pivot to a different direction and try to predict the points based on the description of the wine.

Incorporating description

In [11]:
reviews.head()
Out[11]:
points description price title variety country designation province winery
0 83 Barely ripe, with green citrus and feline spra... 18.0 Starmont 2009 Sauvignon Blanc (Napa Valley) Sauvignon Blanc US NaN California Starmont
1 90 This 100% Syrah shows intense blackberry, crèm... 45.0 Donelan 2010 Cuvee Christine Syrah (Sonoma Cou... Syrah US Cuvee Christine California Donelan
2 93 This year's dry conditions produced a fine cro... 47.0 Burmester 1989 Colheita Tawny (Port) Port Portugal Colheita Tawny Port Burmester
3 86 This is a plush and upfront wine that offers s... 14.0 Pelassa 2005 Barbera d'Alba Barbera Italy NaN Piedmont Pelassa
4 87 Pineapple and mango aromas mix with notes of b... 12.0 Columbia Crest 2014 Grand Estates Chardonnay (... Chardonnay US Grand Estates Washington Columbia Crest

Because we are focusing on the description (review) of the wine here is an example of one

In [12]:
reviews['description'][5]
Out[12]:
"A crisp, minerally wave of apples and pear start this wine from the cool-climate region of Elgin, and on the palate, it's equally delicate. Dry but with a touch of pretty sweetness, the wine is embraceable and a great solo sip."

We remove punctuation and other special characters and convert everything to lower case as it is not significat that words be capitalized.

In [13]:
descriptions = []

for descrip in reviews['description']:
    line = re.sub(r'\W', ' ', str(descrip))
    line = line.lower()
    descriptions.append(line)
    
len(descriptions)
Out[13]:
20000

Here we use TfidfVectorizer, to understand what it is what term frequency-inverse document frequency (TF_IDT) is must be explained first. TF-IDF is a measure that evaluates the relevancy that a word has for a document inside a collection of other documents. Furthermore, TF-IDF can be defined as the following:

$ \text{Term Frequency (TF)} = \frac{\text{Frequency of a word}}{\text{Total number of words in document}} $

$ \text{Inverse Document Frequency (IDF)} = \log{\frac{\text{Total number of documents}}{\text{Number of documents that contain the word}}} $

$ \text{TF-IDF} = \text{TF} \cdot \text{IDF} $

In turn, what TfidfVectorizer gives us is a list of feature lists that we can use as estimators for prediction.

The parameters for TfidfVectorizer are max_features, max_df, and stop_words. max_features tells us to only look at the top n features of the total document max_df causes the vectorizer to ignore terms that have a document frequency strictly higher than the given threshold. In our case because a float is its value we ignore words that appear in more than 80% of documents stop_words allows us to pass in a set of stop words. Stop words are words that add little to no meaning to a sentence. This includes words such as I, our, him, and her. Following this we fit and transform the data then we split it into training and testing data

In [14]:
y = reviews['points'].values
vec = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
X = vec.fit_transform(descriptions).toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

Now that we've split the data we use RandomForestRegressor() to make our prediciton given that its a random forest algorithm it takes the average of the decision trees that were created and used as estimates.

In [15]:
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
Out[15]:
RandomForestRegressor()
In [16]:
pred = rfr.predict(X_test)

Now we check to see how good our model is at predicting the points for a bottle of wine

In [17]:
print('r2 score:', rfr.score(X_test, y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))
r2 score: 0.4950601387266653
rmse score: 2.1461066422710684
In [18]:
cvs = cross_val_score(rfr, X_test, y_test, cv=10)
cvs.mean()
Out[18]:
0.43710399909256914

This is solely based on the description of the wine. As you can see this is a large improvement in both the score and rmse for any sort of prediction that was done with any of the methods performed above. However, it is still not the best for several reasons. The first being the $r^2$ score, or how well our model is at making predictions. There is still a large portion of the data that is not being accurately predicted.

The other issue pertains to when the model does fail at making the prediction. given that the rmse score is very high this can be interpreted as when we do fail we fail rather spectacularly. However, given that the context of this problem is making a prediction for determining arbitrary integer point values for bottles of wine, failing spectacularly is not necessarily what is occurring. The rmse value tells us that with each incorrect prediction we are about 2.1 points off. However, it is still less than ideal.

Below we see if we can improve upon these shortcomings.

Combining features

Next we combine the features that were obtained from TfidfVectorizer with the features that we just factorized in there respective rows.

In [19]:
wine_X = factorized_wine.to_numpy('int64')
X = np.concatenate((wine_X,X),axis=1)
In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)
In [21]:
rfr_fac = RandomForestRegressor()
rfr_fac.fit(X_train, y_train)
Out[21]:
RandomForestRegressor()
In [22]:
fac_pred = rfr_fac.predict(X_test)

Next we perform the same actions as above to determine the accuracy of the prediction. That is we use score() and perform a 10 fold cross validation and then take the mean of the scores.

In [23]:
print('r2 score:', rfr_fac.score(X_test, y_test))
print('rmse score:', mean_squared_error(y_test, fac_pred, squared=False))
r2 score: 0.5738683914990945
rmse score: 1.9715297968836283
In [24]:
fac_cvs = cross_val_score(rfr_fac, X_test, y_test, cv=10)
fac_cvs.mean()
Out[24]:
0.5221010937266104

As we can see from the scores computed above the accuracy is an improvement from only using the wine description (review) as a feature. Both the $r^2$ score and RMSE value improved by about 8% and 0.15 respectively. However, the model isn't all that reliable as there is only slightly above 50% of the bottles of wine from the sample can have their score predicted accurately

Conclusion

After comparing the price to the points for a bottle of wine we learned that the majority of the data is clustered towards the middle in regards to the point value a bottle was awarded and there are few outliers in either the positive or negative direction. Furthermore, most wine follows the trend of having a greater number of points awarded as the price increases.

From these trends, we attempted to determine if we can actually predict how many points a bottle of wine will receive. Given the best prediction that we could obtain took into account several features, including the price of the wine, and only has 52% accuracy, we are lead to believe that the point system that results from the wine in this dataset is more subjective than objective.