Maschine Learning Tutorial - Tuning your Parameters

Machine Learning - Parameter Tuning

In this Video I will show you how you can easily tune the crap out of your model… using python and scikit-learn. 

The model we will be using in this video is again the model from the Video about sentiment analysis.. but slightly changed…. so if you haven’t watched the video yet…now would be a good time to go for it!

I will explain a few things about this model in this video so that you should be able to follow along… but I won’t go into detail… for that… watch my other video.

Alright… now i have been talking for a while but what’s this thing “tuning your model”? And how does it help us making better predictions…

In Machine Learning we are building models and try to predict certain values…. and while building our models there a parameters called hyper parameters which kind of control how our model works. Some of those parameters are the learning rate, number of epochs to run and the stopping criteria… but there are many more…

The performance of our model changes depending on this parameters… so our goal is to choose them wisely.. to make our model perform better… and this is what this video will be about.

So… now let’s get this going… and let’s dive into the code….

Preparing the Data and Creating the model

This is basically the model from the last video… except that in this example I’m using a “Pipeline” to build

the model. I also used the Pipeline in the code of the Video about sentiment analysis which I uploaded on GitHub.. so have a look at that. It’s just a shorter and more convenient 

way of building your model… and gives us in our case for parameter tuning a way to access the parameters more easy… but we will see this soon.

 

import json as j
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

json_data = None
with open('/Users/jo/Documents/coding-maniac/2_sentiment_analysis/code/sentiment_tut1/data/yelp_academic_dataset_review.json') as data_file:
    lines = data_file.readlines()
    joined_lines = "[" + ",".join(lines) + "]"
    json_data = j.loads(joined_lines)
    
data = pd.DataFrame(json_data)
data.head()

data = data[data.stars != 3]
data['sentiment'] = data['stars'] >= 4

X_train, X_test, y_train, y_test = train_test_split(data, data.sentiment, test_size=0.2)
pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('classifier', LogisticRegression()) ])
 

Let’s step through this code really quick… here we are loading our dataset… and have to transform it a bit to make it valid json… then we create a Pandas DataFrame out of it...

After this… we need to preprocess our data… we filter all 3 star reviews since we cannot be sure whether those are positive or negative… and then we create a new column in the data frame

which is called sentiment… which will be our target vector… we assign the value 1 to this column if the review has a rating of 4 or 5 and 0 if the star rating is 1 or 2…

The next thing we are gonna do is to split our data into a training and a test set… and the final thing to do is to creating the model…

Now we are all set up… and ready to go...

Checking Model Performance with Cross-Validation

In this example we will be using k-fold cross validation with k = 5.. 

So what does this mean… this will split our train_data into 5 parts and train and run our model multiple times every time using a different part as test set and the others as training set… so we will get 5 scores out of this method… 

This k-fold algorithm is very helpful especially for parameter tuning… since you should never tune your parameters using the test data set… since this will make you vulnerable for overfitting… you should always use a validation data set for performance tuning… and when done with the tuning use your test dataset to check your models performance… 

But splitting your data into 3 datasets reduces the amount of training data… this is where cross validation comes in handy… because you don’t have to provide a specific validation dataset… 

If you want to know more about overfitting… my last video addresses overfitting and ways to avoid it… check it out…

 

from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X_train.text, y_train, scoring='accuracy', cv=5, n_jobs=-1)

mean = scores.mean()
std = scores.std()
print(mean)
print(std)

 

Because we are getting 5 values as the result… we use the mean to get our models accuracy for the cross-validation

We also check the standard deviation to see how much variance there is in the results…

Soo.. this is our baseline…. now lets see how to tune the parameters...

Tuning the Model and (hopefully) improving Performance

First of all… we need to check what parameters exist which can be tuned by us… we do this with the method “get_params”.

 

print(pipeline.get_params())

 

The parameters which are important for us are at the bottom and include a double underscore in their names.

Those are the parameters we are able to tune… i won’t go into detail what they stand for… check the documentation for that…

but as you can see thanks to using the Pipeline we are able to tune the parameters of all three components of our model… the vectoriser, tfidf and the classifier...

With that in mind.. let’s tune our model…

 

We will use GridSearchCV to tune our model… what this will do… it will use cross-validation to calculate our model multiple times with all parameter variations that we will feed to it…

and tell us the parameter combination that performed the best…

For this to happen we need to tell it which parameters and values to test… this is done using a dictionary… where the name is the name of the parameter and the value is a list of values to try for this parameter…

I picked this parameters to tune… as you see I’m tuning parameters for the vectoriser as well as for the classifier…

With this dictionary set up… we can create an instance of GridSearchCV and start fitting it….

 

from sklearn.model_selection import GridSearchCV

grid = {
    'vectorizer__ngram_range': [(1, 1), (2, 1)],
    'vectorizer__stop_words': [None, 'english'],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__C': [1.0, 0.8],
    'classifier__class_weight': [None, 'balanced'],
    'classifier__n_jobs': [-1]
}

grid_search = GridSearchCV(pipeline, param_grid=grid, scoring='accuracy', n_jobs=-1, cv=5)
grid_search.fit(X=X_train.text, y=y_train)

print("-----------")
print(grid_search.best_score_)
print(grid_search.best_params_)

 

But be careful… building all those models takes a lot of time… so the more parameters you choose the longer the tuning will take… I have the latest MacBook pro and this setup takes about 2 hours on my laptop…

 

When it is done… we can see the best achieved accuracy score and the parameters that made this happen….

Actually checking performance with the test-dataset

Now that we have our tuned parameters we can create a model using them and try it on our test dataset which the model has never seen before… so this should be unbiased…

 

pipeline2 = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(2, 1))),
    ('tfidf', TfidfTransformer()),
    ('classifier', LogisticRegression(C=1.0, class_weight=None, n_jobs=-1, penalty='l1'))])

model = pipeline.fit(X_train.text, y_train)
model2 = pipeline2.fit(X_train.text, y_train)

predicted = model.predict(X_test.text)
predicted2 = model2.predict(X_test.text)

print("model1: " + str(np.mean(predicted == y_test)))
print("model2: " + str(np.mean(predicted2 == y_test)))

 

And here you go… this is the final result…

Now play around with it and have fun

 

Add new comment