Do Professional Wine Reviewers Know What They're Doing? ¶

Aron Sun and Jordan Foster

CMSC320

Fall 2020

Introduction¶

Today, the mention of wine is almost synonymous with its value. Often we want to know how much is the wine, and how good is it. In this regard, the question we want to be answered is does the wine has the best value? One way to judge the quality of the wine is by sentiment. Often sentiment is generated by the consumer in the form of reviews on the internet. In many ways, sentiment has transformed with the introduction of the internet. Before the internet, sentiment about a certain product was generated by those you were close too such as friends and family, or those that were reputable such as professionals. However, the introduction of the internet saw the rise of mass consumer sentiment, where consumers provided reviews for products with notable companies like Amazon and Google. Often many of these reviews can be fake with sellers trying to get the edge in selling their product. However, the wine industry not only has consumer sentiment, but also professional wine sentiment.

Since wine sentiment is not only exclusive to the consumer but also to a small group of professional wine tasters, these professionals are often criticized for rating wines randomly with no rhyme or reason. For this, we wanted to try to answer the question if professional wine tasters know what they are doing?

Data Collection¶

There are a lot of professional wine tasters, and their reviews of wines can be found on many websites. However, we also found a precompiled dataset of professional wine reviews on kaggle.com that we decided to use.

!pip install spacy
!pip install nltk
import spacy
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.corpus import twitter_samples
import pandas as pd
import numpy as np
import seaborn as sns
import string
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import accuracy_score
from sklearn import linear_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

Requirement already satisfied: spacy in /opt/conda/lib/python3.8/site-packages (2.3.5)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.0.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (4.48.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from spacy) (3.0.5)
Requirement already satisfied: thinc<7.5.0,>=7.4.1 in /opt/conda/lib/python3.8/site-packages (from spacy) (7.4.5)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from spacy) (49.6.0.post20200814)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.0.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from spacy) (2.0.5)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (0.7.4)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.1.3)
Requirement already satisfied: numpy>=1.15.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.19.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (1.0.5)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (2.24.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /opt/conda/lib/python3.8/site-packages (from spacy) (0.8.0)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests<3.0.0,>=2.13.0->spacy) (1.25.10)
Requirement already satisfied: nltk in /opt/conda/lib/python3.8/site-packages (3.5)
Requirement already satisfied: regex in /opt/conda/lib/python3.8/site-packages (from nltk) (2020.11.13)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from nltk) (0.16.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (from nltk) (4.48.2)

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

# raw data
wine_df = pd.read_csv("wine.csv")
reviews = wine_df
reviews = reviews.drop(['Unnamed: 0', 'description', 'designation', 'region_2', \
                        'taster_twitter_handle', 'title'], axis=1)
# reviews = wine_df.sample(20000).reset_index()
wine_df = wine_df.drop(['Unnamed: 0', 'description', 'designation', 'region_2', \
                        'taster_twitter_handle', 'title'], axis=1)
wine_df.head()

Cleaning the Data¶

Here we remove all rows that dont record a price value for a bottle of wine. This is due to the price being the main focus for the data visualization portion below.

reviews = reviews.dropna(subset=['price']).reset_index()
reviews.head()

Data Exploration¶

Before we get into the data analysis let's get a general overivew of some wine trends and relationships that we should be aware of. We first want to get a good understanding of the spread of the data. Data visualization will help us highlight the relationship between trends in the dataset that we might have otherwise missed just by look at the dataframe.

Summary of Data¶

First and foremost, let's get a summary of the rows and columns we will be dealing with. This includes the data type, the number of missing values, unique values value, and the type of value we are dealing with. This allows us to get a good summary of each column in the wine dataframe.

From the summary table, we can see many of these variables are categorical. In addition, we can see that region 1, price, and taster name have a lot of missing values. One interesting finding from this summary table we would otherwise have not known is that there are only 19 unique wine tasters in this data. This means that 19 names tasted around 130,000 wines, which is pretty impressive!

def summary_table(df):
    print(f"Dataset Shape: {df.shape}")
    summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name','dtypes']]
    summary['Missing'] = df.isnull().sum().values    
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values
    return summary

summary_table(reviews)

Dataset Shape: (120975, 9)

Distribution of Points and Price¶

As we mentioned before, most of the columns are categorical varibles. However, there are 2 quantitative columns, which are points and price. Here we just perform a simple description of basic stats on these two columns. From this, one intersting thing that we can see is the points only range from 80 to 100, which is a pretty small range given the number of rows in the dataset.

reviews.describe()

Graph of Point Distribution¶

The histogram below further shows how the points ranging from 80 to 100, which was mentioned above, is distributed. As you can see, the point distribution is almost normal. This is pretty insightful because it suggests that these wine tasters cluster their reviews towards the middle, rarely giving really low or high points. Next to the histogram, is a cumulative distribution of points, which reinforces the idea that most points are clustered in the middle.

The histogram below also provides some insight into the question of if these wine reviewers are randomly guessing. Since a random guessed graph would be entirely uniform, we can most likely conclude that is not the case. However, the question of if these wine reviewers are biased towards price is still up for grabs as we would not expect the distribution of points to be this normal.

plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
g = sns.countplot(x='points', data=reviews, color='#1f77b4')
g.set_title("Distribution of Points", fontsize=20) 
g.set_xlabel("Points", fontsize=15) 
g.set_ylabel("Count", fontsize=15) 

plt.subplot(1,2,2)  
plt.scatter(range(reviews.shape[0]), np.sort(reviews.points.values), color='#1f77b4')
plt.xlabel('Index', fontsize=15)  
plt.ylabel('Points', fontsize=15)  
plt.title("Cummlative Points Distribuition", fontsize=20) 

plt.show()

Rating Categories¶

This is much like the previous graph, but here we first convert all the points into a categorical variable with a range from 0 to 5 before we plot. By converting the points into a categorical variable, we will later be able to perform some machine learning algorithms on the data. As you can see from the plot, there are barely any points in the 0th and 5th categories.

def points_to_categorical(points):
    if points in list(range(80,83)):
        return 0
    elif points in list(range(83,87)):
        return 1
    elif points in list(range(87,90)):
        return 2
    elif points in list(range(90,94)):
        return 3
    elif points in list(range(94,98)):
        return 4
    else:
        return 5

reviews["rating_cat"] = reviews["points"].apply(points_to_categorical)

total = len(reviews)
plt.figure(figsize=(14,6))

g = sns.countplot(x='rating_cat',  color='#1f77b4',
                  data=reviews)
g.set_title("Point as Catigorical Variable Distribution", fontsize=20)
g.set_xlabel("Categories ", fontsize=15)
g.set_ylabel("Total Count", fontsize=15)

sizes=[]

plt.show()

Price Distribution¶

Below we can see the price distribution for all prices less than 300. As you can see, there are relatively few wines above 100. Most wines are clustered between 0 and 100 with the peak wine price at around 20 dollars. It is interesting how fast the frequency of wines drop off, signaling a very saturated wine market for wines less than 100 dollars.

plt.figure(figsize=(12,5))

g = sns.kdeplot(reviews.query('price < 300').price, color='#1f77b4')
g.set_title("Price Distribuition Filtered 300", fontsize=20)
g.set_xlabel("Prices(US)", fontsize=15)
g.set_ylabel("Frequency Distribuition", fontsize=15)


plt.show()

Scatter Plot of Points vs Price of Wine¶

So far, we have only been looking at one variable. Arguably one of the most important relationships in this dataset is that of points and price. In essence, we want to start to ask the question of how does price affects the points given by the wine reviewers. For the plots below, we should expect a linear trend where the cheaper the price, the lesser the points awarded, and higher prices being rewarded more points. In addition, we should see high-value wines to be closer in the low price and high points quadrant, and we should expect low-value wines to be in the high price and low points category. Looking at the scatter and hexplot below for wines with a price of less than 300, we can that almost all wines follow this trend, but there is a clear but slight skew towards low price and high points wines. The hexplot shows us that most wines are priced at around 25 dollars and given 87.5 points.

This scatter plot below also suggests that a high price does not indicate quality with many wines over 150 getting sub 90 points, which would seem to indicate a lack of bias. However, because we know the vast majority of points are clustered between 0 and 100 we should expect a point distribution that closely resembles that but, instead, we see a normal distribution. This indicates reviews are based more on subjectivity than objectivity.

#Scatter Plot
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 300])

<seaborn.axisgrid.JointGrid at 0x7fc2bf3ecf70>

# hexplot
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 300], kind='hex', gridsize=20)

<seaborn.axisgrid.JointGrid at 0x7fc2c4d74130>

!pip install nltk
import numpy as np 
import pandas as pd 
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, r2_score, mean_squared_error
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import nltk
nltk.download('stopwords')

Requirement already satisfied: nltk in /opt/conda/lib/python3.8/site-packages (3.5)
Requirement already satisfied: regex in /opt/conda/lib/python3.8/site-packages (from nltk) (2020.11.13)
Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from nltk) (7.1.2)
Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from nltk) (0.16.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (from nltk) (4.48.2)

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

Wine Points Prediction¶

Note: a sample is being used due to the kernel running out of memory due to the jupyter notebook kernel running out of memory and dying when performing analysis otherwise as a result predictions depend on the sample that is given as it is obtained randomly. Furthermore, these predictions vary with $\pm1\%$

wine_df = pd.read_csv("wine.csv")
str_cols = ['description', 'price',  'title', 'variety', 'country', 'designation', 'province', 'winery']

reviews = wine_df.sample(20000)[['points'] + str_cols].reset_index()
reviews = reviews.drop(['index'], axis=1)
reviews.head()

We first have to change features that we are going to use from categorical to numerical variables. This is done to give the features meaning when performing different forms of analysis on them to predict the points given to a bottle of wine.

# assign numerical values to string columns

factorized_wine = reviews[str_cols].drop(['description'], axis=1).copy()
for col in str_cols[2:]:
    factorized_wine[col] = pd.factorize(reviews[col])[0]

factorized_wine.head()

Now we assign the variables we just factorized along with the price of the wine to be our X values and our y value will be what we are trying to predict, which in this case are the points for a bottle of wine.

X = factorized_wine.to_numpy('int64')
y = reviews['points'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Below we perform several different forms of prediction to see which one produces the best result.

We then need to determine how accurate this algorithm is given the estimates returned from the random forest regression. We do this by using score() which returns the coefficient of determination of the prediction ($r^2$). In other words, it is the observed y variation that can be explained by the and by the regression model. We also compute the residual mean squared error of the model (rmse).

linear regression¶

from sklearn import linear_model

model = linear_model.LinearRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print('r2 score:', model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.025621093791445948
rmse score: 3.004001678134892

as you can see this isnt the best prediction model so lets try some other methods and see what we get

linear discriminant analysis¶

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)

pred = lda_model.predict(X_test)

print('r2 score:', lda_model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.132
rmse score: 3.1166648841349627

The results from this method are not good either so onto the next one

classification tree¶

from sklearn import tree

dt_model = tree.DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

pred = dt_model.predict(X_test)

print('r2 score:', dt_model.score(X_test,y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.1468
rmse score: 3.2819506394825626

The methods that we have done prior as well as this one are getting used nowhere and are showing very little signs of improvement so let's pivot to a different direction and try to predict the points based on the description of the wine.

Incorporating description¶

reviews.head()

Because we are focusing on the description (review) of the wine here is an example of one

reviews['description'][5]

"A crisp, minerally wave of apples and pear start this wine from the cool-climate region of Elgin, and on the palate, it's equally delicate. Dry but with a touch of pretty sweetness, the wine is embraceable and a great solo sip."

We remove punctuation and other special characters and convert everything to lower case as it is not significat that words be capitalized.

descriptions = []

for descrip in reviews['description']:
    line = re.sub(r'\W', ' ', str(descrip))
    line = line.lower()
    descriptions.append(line)
    
len(descriptions)

20000

Here we use TfidfVectorizer, to understand what it is what term frequency-inverse document frequency (TF_IDT) is must be explained first. TF-IDF is a measure that evaluates the relevancy that a word has for a document inside a collection of other documents. Furthermore, TF-IDF can be defined as the following:

$ \text{Term Frequency (TF)} = \frac{\text{Frequency of a word}}{\text{Total number of words in document}} $

$ \text{Inverse Document Frequency (IDF)} = \log{\frac{\text{Total number of documents}}{\text{Number of documents that contain the word}}} $

$ \text{TF-IDF} = \text{TF} \cdot \text{IDF} $

In turn, what TfidfVectorizer gives us is a list of feature lists that we can use as estimators for prediction.

The parameters for TfidfVectorizer are max_features, max_df, and stop_words. max_features tells us to only look at the top n features of the total document max_df causes the vectorizer to ignore terms that have a document frequency strictly higher than the given threshold. In our case because a float is its value we ignore words that appear in more than 80% of documents stop_words allows us to pass in a set of stop words. Stop words are words that add little to no meaning to a sentence. This includes words such as I, our, him, and her. Following this we fit and transform the data then we split it into training and testing data

y = reviews['points'].values
vec = TfidfVectorizer(max_features=2500, min_df=7, max_df=0.8, stop_words=stopwords.words('english'))
X = vec.fit_transform(descriptions).toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

Now that we've split the data we use RandomForestRegressor() to make our prediciton given that its a random forest algorithm it takes the average of the decision trees that were created and used as estimates.

rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)

RandomForestRegressor()

pred = rfr.predict(X_test)

Now we check to see how good our model is at predicting the points for a bottle of wine

print('r2 score:', rfr.score(X_test, y_test))
print('rmse score:', mean_squared_error(y_test, pred, squared=False))

r2 score: 0.4950601387266653
rmse score: 2.1461066422710684

cvs = cross_val_score(rfr, X_test, y_test, cv=10)
cvs.mean()

0.43710399909256914

This is solely based on the description of the wine. As you can see this is a large improvement in both the score and rmse for any sort of prediction that was done with any of the methods performed above. However, it is still not the best for several reasons. The first being the $r^2$ score, or how well our model is at making predictions. There is still a large portion of the data that is not being accurately predicted.

The other issue pertains to when the model does fail at making the prediction. given that the rmse score is very high this can be interpreted as when we do fail we fail rather spectacularly. However, given that the context of this problem is making a prediction for determining arbitrary integer point values for bottles of wine, failing spectacularly is not necessarily what is occurring. The rmse value tells us that with each incorrect prediction we are about 2.1 points off. However, it is still less than ideal.

Below we see if we can improve upon these shortcomings.

Combining features¶

Next we combine the features that were obtained from TfidfVectorizer with the features that we just factorized in there respective rows.

wine_X = factorized_wine.to_numpy('int64')
X = np.concatenate((wine_X,X),axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 12)

rfr_fac = RandomForestRegressor()
rfr_fac.fit(X_train, y_train)

RandomForestRegressor()

fac_pred = rfr_fac.predict(X_test)

Next we perform the same actions as above to determine the accuracy of the prediction. That is we use score() and perform a 10 fold cross validation and then take the mean of the scores.

print('r2 score:', rfr_fac.score(X_test, y_test))
print('rmse score:', mean_squared_error(y_test, fac_pred, squared=False))

r2 score: 0.5738683914990945
rmse score: 1.9715297968836283

fac_cvs = cross_val_score(rfr_fac, X_test, y_test, cv=10)
fac_cvs.mean()

0.5221010937266104

As we can see from the scores computed above the accuracy is an improvement from only using the wine description (review) as a feature. Both the $r^2$ score and RMSE value improved by about 8% and 0.15 respectively. However, the model isn't all that reliable as there is only slightly above 50% of the bottles of wine from the sample can have their score predicted accurately

Conclusion¶

After comparing the price to the points for a bottle of wine we learned that the majority of the data is clustered towards the middle in regards to the point value a bottle was awarded and there are few outliers in either the positive or negative direction. Furthermore, most wine follows the trend of having a greater number of points awarded as the price increases.

From these trends, we attempted to determine if we can actually predict how many points a bottle of wine will receive. Given the best prediction that we could obtain took into account several features, including the price of the wine, and only has 52% accuracy, we are lead to believe that the point system that results from the wine in this dataset is more subjective than objective.

	country	points	price	province	region_1	taster_name	variety	winery
0	Italy	87	NaN	Sicily & Sardinia	Etna	Kerin O’Keefe	White Blend	Nicosia
1	Portugal	87	15.0	Douro	NaN	Roger Voss	Portuguese Red	Quinta dos Avidagos
2	US	87	14.0	Oregon	Willamette Valley	Paul Gregutt	Pinot Gris	Rainstorm
3	US	87	13.0	Michigan	Lake Michigan Shore	Alexander Peartree	Riesling	St. Julian
4	US	87	65.0	Oregon	Willamette Valley	Paul Gregutt	Pinot Noir	Sweet Cheeks

	index	country	points	price	province	region_1	taster_name	variety	winery
0	1	Portugal	87	15.0	Douro	NaN	Roger Voss	Portuguese Red	Quinta dos Avidagos
1	2	US	87	14.0	Oregon	Willamette Valley	Paul Gregutt	Pinot Gris	Rainstorm
2	3	US	87	13.0	Michigan	Lake Michigan Shore	Alexander Peartree	Riesling	St. Julian
3	4	US	87	65.0	Oregon	Willamette Valley	Paul Gregutt	Pinot Noir	Sweet Cheeks
4	5	Spain	87	15.0	Northern Spain	Navarra	Michael Schachner	Tempranillo-Merlot	Tandem

	index	points	price
count	120975.000000	120975.000000	120975.000000
mean	65045.760628	88.421881	35.363389
std	37512.060879	3.044508	41.022218
min	1.000000	80.000000	4.000000
25%	32574.500000	86.000000	17.000000
50%	65144.000000	88.000000	25.000000
75%	97506.500000	91.000000	42.000000
max	129970.000000	100.000000	3300.000000

	points	description	price	title	variety	country	designation	province	winery
0	83	Barely ripe, with green citrus and feline spra...	18.0	Starmont 2009 Sauvignon Blanc (Napa Valley)	Sauvignon Blanc	US	NaN	California	Starmont
1	90	This 100% Syrah shows intense blackberry, crèm...	45.0	Donelan 2010 Cuvee Christine Syrah (Sonoma Cou...	Syrah	US	Cuvee Christine	California	Donelan
2	93	This year's dry conditions produced a fine cro...	47.0	Burmester 1989 Colheita Tawny (Port)	Port	Portugal	Colheita Tawny	Port	Burmester
3	86	This is a plush and upfront wine that offers s...	14.0	Pelassa 2005 Barbera d'Alba	Barbera	Italy	NaN	Piedmont	Pelassa
4	87	Pineapple and mango aromas mix with notes of b...	12.0	Columbia Crest 2014 Grand Estates Chardonnay (...	Chardonnay	US	Grand Estates	Washington	Columbia Crest

	points	description	price	title	variety	country	designation	province	winery
0	83	Barely ripe, with green citrus and feline spra...	18.0	Starmont 2009 Sauvignon Blanc (Napa Valley)	Sauvignon Blanc	US	NaN	California	Starmont
1	90	This 100% Syrah shows intense blackberry, crèm...	45.0	Donelan 2010 Cuvee Christine Syrah (Sonoma Cou...	Syrah	US	Cuvee Christine	California	Donelan
2	93	This year's dry conditions produced a fine cro...	47.0	Burmester 1989 Colheita Tawny (Port)	Port	Portugal	Colheita Tawny	Port	Burmester
3	86	This is a plush and upfront wine that offers s...	14.0	Pelassa 2005 Barbera d'Alba	Barbera	Italy	NaN	Piedmont	Pelassa
4	87	Pineapple and mango aromas mix with notes of b...	12.0	Columbia Crest 2014 Grand Estates Chardonnay (...	Chardonnay	US	Grand Estates	Washington	Columbia Crest

	Name	dtypes	Missing	Uniques	First Value	Second Value	Third Value
0	index	int64	0	120975	1	2	3
1	country	object	59	42	Portugal	US	US
2	points	int64	0	21	87	87	87
3	price	float64	0	390	15	14	13
4	province	object	59	422	Douro	Oregon	Michigan
5	region_1	object	19575	1204	NaN	Willamette Valley	Lake Michigan Shore
6	taster_name	object	24496	19	Roger Voss	Paul Gregutt	Alexander Peartree
7	variety	object	1	697	Portuguese Red	Pinot Gris	Riesling
8	winery	object	0	15855	Quinta dos Avidagos	Rainstorm	St. Julian