[PYTHON] Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial

Introduction

In the Kaggle / Titanic tutorial, we are learning with RandomForestClassifier (). Adjust this parameter to see if the learning accuracy improves.

Data preparation

import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt

train_data = pd.read_csv("../train.csv")

from sklearn.model_selection import train_test_split

train_data_orig = train_data
train_data, cv_data = train_test_split(train_data_orig, test_size=0.3, random_state=1)

We used train_test_split to split the data into train: cv = 7: 3 (cv; cross validation).

train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 623 entries, 114 to 37
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  623 non-null    int64  
 1   Survived     623 non-null    int64  
 2   Pclass       623 non-null    int64  
 3   Name         623 non-null    object 
 4   Sex          623 non-null    object 
 5   Age          496 non-null    float64
 6   SibSp        623 non-null    int64  
 7   Parch        623 non-null    int64  
 8   Ticket       623 non-null    object 
 9   Fare         623 non-null    float64
 10  Cabin        135 non-null    object 
 11  Embarked     622 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 63.3+ KB
cv_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 862 to 92
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  268 non-null    int64  
 1   Survived     268 non-null    int64  
 2   Pclass       268 non-null    int64  
 3   Name         268 non-null    object 
 4   Sex          268 non-null    object 
 5   Age          218 non-null    float64
 6   SibSp        268 non-null    int64  
 7   Parch        268 non-null    int64  
 8   Ticket       268 non-null    object 
 9   Fare         268 non-null    float64
 10  Cabin        69 non-null     object 
 11  Embarked     267 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 27.2+ KB

There are 623 trains, 268 cvs, and a total of 891.

Learn according to the tutorial

from sklearn.ensemble import RandomForestClassifier

features = ["Pclass", "Sex", "SibSp", "Parch"]

y = train_data["Survived"]
y_cv = cv_data["Survived"]
X = pd.get_dummies(train_data[features])
X_cv = pd.get_dummies(cv_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)

print('Train score: {}'.format(model.score(X, y)))
print('CV score: {}'.format(model.score(X_cv, y_cv)))
Train score: 0.8394863563402889
CV score: 0.753731343283582

The train is about 84% correct, but the CV is only 75% correct. Is it overfitting?

Manually change the parameters of RandomForestClassifier

n_estimator Try changing the value of n_estimator.

rfc_results = pd.DataFrame(columns=["train", "cv"])

for iter in [1, 10, 100]:
    model = RandomForestClassifier(n_estimators=iter, max_depth=5, random_state=1, max_features="auto")
    model.fit(X, y)
    predictions = model.predict(X_cv)
    rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train cv
1 0.826645 0.753731
10 0.833066 0.753731
100 0.839486 0.753731

As the number of decision trees increases, the train score increases slightly, but the cv score does not change.

max_depth

Try changing the value of max_depth.

max_depth = 2

for iter in [1, 10, 100]:
    model = RandomForestClassifier(n_estimators=iter, max_depth=2, random_state=1, max_features="auto")
    model.fit(X, y)
    predictions = model.predict(X_cv)
    rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train cv
1 0.813804 0.731343
10 0.81862 0.753731
100 0.817014 0.761194

I got a cv score of 76%.

max_depth = 3

for iter in [1, 10, 100]:
    model = RandomForestClassifier(n_estimators=iter, max_depth=3, random_state=1, max_features="auto")
    model.fit(X, y)
    predictions = model.predict(X_cv)
    rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train cv
1 0.81862 0.753731
10 0.82504 0.776119
100 0.82504 0.768657

I got a cv score of 77.6%.

max_depth = 4

for iter in [1, 10, 100]:
    model = RandomForestClassifier(n_estimators=iter, max_depth=4, random_state=1, max_features="auto")
    model.fit(X, y)
    predictions = model.predict(X_cv)
    rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train cv
1 0.823435 0.764925
10 0.82825 0.761194
100 0.826645 0.764925

The score of cv is about 76.5%.

From the above, with max_depth = 3, n_estimators = 10 had the highest score.

Automatically change the parameters of RandomForestClassifier

Find the best parameters with a method called GridSearch (GridSearchCV). This is a method of trying out the combinations of the listed parameters and finding the best one.

from sklearn.model_selection import GridSearchCV

param_grid = {"max_depth": [2, 3, 4, 5, None],
              "n_estimators":[1, 3, 10, 30, 100],
              "max_features":["auto", None]}

model_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=1),
                 param_grid = param_grid,   
                 scoring="accuracy",  # metrics
                 cv = 3,              # cross-validation
                 n_jobs = 1)          # number of core

model_grid.fit(X, y) #fit

model_grid_best = model_grid.best_estimator_ # best estimator
print("Best Model Parameter: ", model_grid.best_params_)

Note the from line. When I searched on the net, there was a way to write from sklearn.grid_search import GridSearchCV, but in my case this was NG.

Best Model Parameter:  {'max_depth': 3, 'max_features': 'auto', 'n_estimators': 10}

As I tried manually, max_depth = 3 and n_estimators = 10 were the best. I also tried two types of max_features, but"auto"was good.

print('Train score: {}'.format(model.score(X, y)))
print('CV score: {}'.format(model.score(X_cv, y_cv)))

Train score: 0.826645264847512
CV score: 0.7649253731343284

Submit to Kaggle / Titanic

I predicted the result using this parameter and submitted it to Kaggle. However, the accuracy was 0.77751, which was the same as the parameter of the tutorial ***. Bien.

in conclusion

Before trying to improve with features, we were able to improve a little by adjusting the hyperparameters of learning. Next, we would like to consider features.

reference

-Scikit-learn splits data for training and testing train_test_split -Add columns and rows to pandas.DataFrame (assign, append, etc.) -Let's tune the model hyperparameters with scikit-learn! -Scikit-learn's GridSearchCV for hyperparameter search -What to do when sklearn grid search cannot be used in Python

Recommended Posts

Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
Kaggle Tutorial Titanic know-how to be in the top 2%
[Kaggle] I made a collection of questions using the Titanic tutorial
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Count the number of parameters in the deep learning model
Take a closer look at the Kaggle / Titanic tutorial
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
Examine the margin of error in the number of deaths from pneumonia
[Kaggle] Participation in the Melanoma Competition
The story of participating in AtCoder
MySQL-automatic escape of parameters in python
The story of the "hole" in the file
Examine the object's class in python
The meaning of ".object" in Django
(Kaggle) Titanic Survivor Prediction Model Assessed Impact of Adjusting Random Forest Parameters
[Understanding in 3 minutes] The beginning of Linux
Check the behavior of destructor in Python
Challenges for the Titanic Competition for Kaggle Beginners
The result of installing python in Anaconda
Let's claim the possibility of pyenv-virtualenv in 2021
The basics of running NoxPlayer in Python
In search of the fastest FizzBuzz in Python
Visualization of the firing state of the hidden layer of the model learned in the TensorFlow MNIST tutorial
Output the number of CPU cores in Python
The meaning of {version-number} in the mysql rpm package
[Python] Sort the list of pathlib.Path in natural sort
The contents of the Python tutorial (Chapter 5) are itemized.
Change the font size of the legend in df.plot
The contents of the Python tutorial (Chapter 4) are itemized.
The contents of the Python tutorial (Chapter 2) are itemized.
Match the distribution of each group in Python
View the result of geometry processing in Python
The contents of the Python tutorial (Chapter 8) are itemized.
The contents of the Python tutorial (Chapter 1) are itemized.
Make a copy of the list in Python
Find the number of days in a month
The contents of the Python tutorial (Chapter 10) are itemized.
Read the output of subprocess.Popen in real time
Find the divisor of the value entered in python
The story of finding the optimal n in N fist
Fix the argument of the function used in map
Find the solution of the nth-order equation in python
The story of reading HSPICE data in Python
[Note] About the role of underscore "_" in Python
Put the second axis in 2dhistgram of matplotlib
About the behavior of Model.get_or_create () of peewee in Python
Solving the equation of motion in Python (odeint)
Visualized the usage status of the sink in the company
Output in the form of a python array
The story of viewing media files in Django
Search by the value of the instance in the list
The contents of the Python tutorial (Chapter 6) are itemized.
Make progress of dd visible in the progress bar
The contents of the Python tutorial (Chapter 3) are itemized.
Factfulness of the new coronavirus seen in Splunk
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
[Statistics] Let's explain the execution of logistic regression in stan in detail (w / Titanic dataset)