Loading
Malick A. Sarr

Data Scientist

Data Analyst

Malick A. Sarr

Data Scientist

Data Analyst

Blog Post

Hyperparameter Tuning with Grid Search in Python

Hyperparameter Tuning with Grid Search in Python
 

Hyperparameter Tuning with GridSearch Cross-Validation (CV) is another technique to find the best parameters for your model. In this post, I will be providing code snippets of how to perform hyperparameter tuning with Grid Search using Scikit Learn (Sklearn).

 

1. Load your data for hyperparameter tuning with Grid Search CV

To perform the task, you will need data. We will use again the well-known Boston dataset for regression. It is a very good dataset to practice your regression skills.

 

To perform hyperparameter tuning with GridSearch, we will use the GridSearchCV
module from the sklearn.model_selection library. Additionally, the RandomForestRegression function from Scikit learn will perform the Random Forest regression in this problem.

# import random forest, the Random Search Cross Validation
from sklearn import datasets
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestRegressor

# get the boston dataset
data = datasets.load_boston()
X = data.data
y = data.target

 

2. Load the model parameters to be tested using hyperparameter tuning with Grid Search CV

In this step, we will use set up the arbitrary parameters that we want to. As opposed to the  RandomSearch hyperparameter tuning, we set a fixed value for your model’s hyperparameters. The model will test every single combination of these values, and take the parameters that work best. 

 

For example in our case below, say that we set the following hyperparameters:

  • number of estimators : [10,20]
  • minimum sample split: [0,1, 0.3]

 

When doing GridSearch, the model will train and validate(test) first using n_estimator = 10, min_sam_split = 0.1 and leaving the rest default. Then it will test n_estimator = 10, min_sam_split = 0.3, then 20 and 0,1 and finally 20 and 0.3. From those values, we will save the model having the best result. For example, if using n_estimator = 10 and min_sam_split = 0.3 yields the highest say accuracy, then you can use those values to predict your test set. 

 

To know what hyperparameters are available to tune, refer to sklearn documentation for a particular model or, if your IDE allows it, use the context window that appears which gives you more information about a function.

# Load the model parameters to be test 
model_params = {
    'n_estimators': [50,100, 150],
    'max_features':  ['sqrt', 0.3, 0.6, 0.9 , 1.0],
    'min_samples_split': [0.1,0.3,0.6]
}

 

3. Create, initialize and test your model

Now that we have arbitrarily set the range on the hyperparameter we want to test. We can finally perform the hyperparameter tuning itself. We will use 3-fold cross-validation. 

As a reminder, K-fold cross-validation is a resampling technique that ensures that every data point from the original dataset has the chance to appear and be used in training and test set. Then, we can fit the best GridSearch model to our dataset using the integrated fit method. 

 

k-fold cross validation

 

# create random forest regressor model
rf_model = RandomForestRegressor()

# set up Grid-Search meta-estimator
# this will train 100 models over 5 folds of cross validation (500 models total)
clf = GridSearchCV(rf_model, model_params, cv=5)

# train the random search meta-estimator to find the best model out of 100 candidates
model = clf.fit(X, y)

Now that we have fitted the best model, let us see how the best hyperparameters look like.

print(model.best_estimator_.get_params())
{'bootstrap': True, 
'ccp_alpha': 0.0,
'criterion': 'mse',
'max_depth': None,
'max_features': 0.6,
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 1,
'min_samples_split': 0.1,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 50,
'n_jobs': None,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}

This is it. As you can see above, there are other parameters that we could have used to fit the model (eg, min sample leaf, max_depth, etc.). Which one you chose is up to you and depends on the data, the type of project, and your knowledge of the algorithm. The more you understand an algorithm, the easier it’s going to be to, to some extent, understand the effect of a parameter on a model. This ability comes with experience.

 

Why is Grid Search Hyperparameter tuning important?

GridSearch is different from RandomSearch. RandomSearch works great if you need to discover potential hyperparameters that would work great with a particular model. Indeed, Random Search randomly samples hyperparameters from a distribution of values. It means that the algorithm can efficiently and fully pick and test different values (in a range) for the hyperparameters in a defined number of steps. 

 

GridSearch, however, works much better for specific verification with arbitrary values (that you set manually). This specificity and ability to “spot check” is particularly important in domains in which the accuracy of your model is critical (medicine, fault detection, fraud detection, etc).

 

In fact, it may be a good idea to combine RandomSearch and GridSeach to maximize your chances to improve the model your work with. For example, say that you ran a Random Search CV algorithm whose best hyperparameters have these values: param_a = 10, param_b = 2. What you can do further, is test if the values around those best values could improve your model. Ergo, you can set up a grid that would test, for example: param_a = [ 8, 10, 12] and param_b = [1,2,3].

 

After running the above grid of hyperparameters, you may end up with the best values as param_a = 12 and param_b = 2. Therefore, checking the adjacent values of the RandomSearch with GridSearch allowed you to improve your model. Additionally, it is highly advised to used Grid Search to test few hyperparameters that are known to do well in terms of improving a model.

 

Understandably, GridSearch does have some major drawbacks. For starters, it is a very expensive algorithm to run. As you may have noticed, the more parameters value you add, the algorithm will train it against all the other values of all the other hyperparameters you are tuning. Grid Search CV time complexity is exponential. Additionally, I’ve added a 3-fold validation, meaning that it will perform that operation 3 times. Consequently, it is possible to end up with a model that would train indefinitely if the CV or Grid is large, making it unfeasible to use in some projects.

 

Finally, and I mentioned this at the beginning, Grid Search is great at improving a model. From my experience, you generally do have good results by simply using Random Search. You can check out more about how to use Random Seach in the preceding link.

 

 

If you made this far in the article, thank you very much.

 

I hope this information was of use to you. 

 

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com  or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

[boldgrid_component type=”wp_mc4wp_form_widget”]

 

 

Tags:
Write a comment