[PYTHON] Let's tune the model hyperparameters with scikit-learn!

What is hyperparameter tuning?

There are some parameters that must be decided in advance depending on the model. (For example, the number of k-means clusters, the strength of the SVC regularization term, the depth of the decision tree, etc.)

It is called "hyperparameter", but the trouble is that even if it is the same model, the accuracy may change ** significantly ** depending on the value of the hyperparameter.

Hyperparameter tuning is to decide it well using training data! !!

Grid search and random search

Of the tuning methods, we will deal with two, grid search and random search. Roughly speaking, if there is hyperparameter α, it will be executed according to the following flow.

・ For grid search, specify the ** range ** of α (ex. 0,1,2,3,4,5, etc.) in advance, and actually try to get the accuracy of the model with that parameter. Make him a parameter.

-For random search, specify the ** distribution ** that α follows in advance (ex. Normal distribution with mean 0, standard deviation 1, etc.), randomly extract from it, and actually use that parameter to determine the accuracy of the model. Look, the best one is the parameter.

Screen Shot 2017-02-25 at 22.37.35.png

As mentioned above, both are not the procedure of deciding the hyperparameter α as it is. Before that, you can see that the procedure is to determine the ** range and distribution ** and use the actual training data. (See Resources for more details!)

Python code

The above two are standard equipment in scikit-learn, so we will use them! Code for python3.5.1, scikit_learn-0.18.1.

This time, we take data from UCI's Machine Learning Repository and use two classifiers of RandomForestClassifier to tune the parameters. The full code has been uploaded to github.

STEP1 Download data from UCI repository

Grid_and_Random_Search.ipynb


 df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases'
                  '/breast-cancer-wisconsin/wdbc.data', header=None)

For the sake of clarity, set the column you want to predict to Target and the others to a ~.

Grid_and_Random_Search.ipynb


 columns_list = [] 
 for i in range(df.shape[1]):
     columns_list.append("a%d"%i) 
 columns_list[1] = "Target" 
 df.columns = columns_list

STEP2 Divide the data

Grid_and_Random_Search.ipynb


 y = df["Target"].values
 X = df.drop(["a0","Target"],axis=1)

Divided into train data and test data

Grid_and_Random_Search.ipynb


 #split X,y to train,test(0.5:0.5)
 from sklearn.cross_validation import train_test_split

 X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5,random_state=2017)

STEP3 Check the accuracy of the model in the default state.

Grid_and_Random_Search.ipynb


 from sklearn.metrics import classification_report

 def model_check(model):
     model.fit(X_train,y_train)
     y_train_pred = classification_report(y_train,model.predict(X_train))
     y_test_pred  = classification_report(y_test,model.predict(X_test))
        
     print("""【{model_name}】\n Train Accuracy: \n{train}
           \n Test Accuracy:  \n{test}""".format(model_name=model.__class__.__name__, train=y_train_pred, test=y_test_pred))

print(model_check(RandomForestClassifier()))

Output result 1(Default)


    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       1.00      1.00      1.00        67
              M       1.00      1.00      1.00        75

    avg / total       1.00      1.00      1.00       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.89      0.93      0.91        72
              M       0.93      0.89      0.91        70

    avg / total       0.91      0.91      0.91       142

It turned out that the correct answer rate of Train data is 1.0 and the correct answer rate of Test data is 0.91. From here, we will implement grid search and random search. From now on, reference 3 is referred to.

STEP4 grid search

Grid_and_Random_Search.ipynb


 #Grid search

 from sklearn.grid_search import GridSearchCV

 # use a full grid over all parameters
 param_grid = {"max_depth": [2,3, None],
              "n_estimators":[50,100,200,300,400,500],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

 forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=0),
                 param_grid = param_grid,   
                 scoring="accuracy",  #metrics
                 cv = 3,              #cross-validation
                 n_jobs = 1)          #number of core

 forest_grid.fit(X_train,y_train) #fit

 forest_grid_best = forest_grid.best_estimator_ #best estimator
 print("Best Model Parameter: ",forest_grid.best_params_)

Output result 2(Grid search)


    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       0.99      0.99      0.99        67
              M       0.99      0.99      0.99        75

    avg / total       0.99      0.99      0.99       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.96      0.89      0.92        72
              M       0.89      0.96      0.92        70

    avg / total       0.92      0.92      0.92       142

All accuracy such as total correct answer rate and f1-score has increased! !!

STEP5 Random search

Random search uses scipy to represent the distribution that the parameters follow. This time, the number of iterations is the same as the grid search.

Grid_and_Random_Search.ipynb


#Random search
from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_dist = {"max_depth": [3, None],                  #distribution
              "n_estimators":[50,100,200,300,400,500],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

forest_random = RandomizedSearchCV( estimator=RandomForestClassifier( random_state=0 ),
                                    param_distributions=param_dist,
                                    cv=3,              #CV
                                    n_iter=1944,          #interation num
                                    scoring="accuracy", #metrics
                                    n_jobs=1,           #num of core
                                    verbose=0,          
                                    random_state=1)

forest_random.fit(X,y)
forest_random_best = forest_random.best_estimator_ #best estimator
print("Best Model Parameter: ",forest_random.best_params_)

Output result 3(Random search)


    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       1.00      1.00      1.00        67
              M       1.00      1.00      1.00        75

    avg / total       1.00      1.00      1.00       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.94      0.92      0.93        72
              M       0.92      0.94      0.93        70

    avg / total       0.93      0.93      0.93       142

We found that all items increased by 2% compared to the default case!

Summary

The accuracy of both grid search and random search has improved! However, I think that the effect has become difficult to see because I originally selected data with high accuracy this time. It may be easier to see the effect of tuning if you try on data that is not accurate.

The full code has been uploaded to github.

References

  1. Bergstra, J., & Bengio, Y. (2012)
  2. http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a
  3. http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html

Recommended Posts

Let's tune the model hyperparameters with scikit-learn!
Calibrate the model with PyCaret
Grid search of hyperparameters with Scikit-learn
Validate the learning model with Pylearn2
The most basic clustering analysis with scikit-learn
Let's solve the portfolio with continuous optimization
[Scikit-learn] I played with the ROC curve
Run the interaction model with Attention Seq2 Seq
Let's read the RINEX file with Python ①
[Translation] scikit-learn 0.18 Tutorial Choosing the Right Model
A model that identifies the guitar with fast.ai
Isomap with Scikit-learn
Let's explain the asset allocation by the Black-Litterman model (with an execution example by Python)
Exposing the DCGAN model for Cifar 10 with keras
Let's simulate the Izhikevich neuron model on the web!
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Predict the second round of summer 2016 with scikit-learn
Solving the Lorenz 96 model with Julia and Python
Clustering with scikit-learn (2)
PCA with Scikit-learn
Load the TensorFlow model file .pb with readNetFromTensorflow ().
Multivariable regression model with scikit-learn --SVR comparison verification
kmeans ++ with scikit-learn
Let's transpose the matrix with numpy and multiply the matrices.
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
How to visualize the decision tree model of scikit-learn
Monitor the training model with TensorBord on Jupyter Notebook
Solving the iris problem with scikit-learn ver1.0 (logistic regression)
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Clustering with scikit-learn + DBSCAN
Learn with chemoinformatics scikit-learn
Model fitting with lmfit
Regression with linear model
DBSCAN (clustering) with scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Let's tweet by hitting the terminal while screaming with Selenium! !!
Implement the mathematical model "SIR model" of infectious diseases with OpenModelica
Let's execute the command on time with the bot of discord
Enjoy the Gray-Scott model with short code using matrix math
Let's touch the API of Netatmo Weather Station with Python. #Python #Netatmo
Let's visualize the number of people infected with coronavirus with matplotlib
Let's use the distributed expression of words quickly with fastText!
Analyze the topic model of becoming a novelist with GensimPy3
Let's move word2vec with Chainer and see the learning progress
Let's reduce the man-hours required for server setup with Ansible