[PYTHON] Random forest (implementation / parameter summary)

Introduction

This is an article summarizing the implementation and parameters of Random Forest.

What is Random Forest?

A model that combines multiple decision trees to improve prediction performance.

The flow of learning is as follows

① Prepare multiple decision tree models (2) For the learning data of each decision tree, allow duplication from the original learning data and randomly extract the same number (increase the variation of learning by subtly changing the learning data for each decision tree). ③ Give the final answer from the prediction of each decision tree. Classification model → majority vote Regression model → mean

Random forest features

Random forest is a method classified as bagging for ensemble learning.

Bagging

Extract different data (bootstrap method) to create multiple different models (weak learners). After that, the average of the created multiple models is used as the final model.

Implementation

This time, we will focus on the evaluation of [SIGNATE] automobiles. Link below. https://signate.jp/competitions/122

Data preprocessing

Read the data and change "String" to "Numeric".

python.py


import pandas as pd
import numpy as np

#Data reading
df = pd.read_csv('train.tsv', delimiter = '\t')
df = df.drop('id', axis = 1)

#Explanatory variable
df = df.replace({'buying': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'maint': {'low': 1, 'med': 2, 'high': 3, 'vhigh': 4}})
df = df.replace({'doors': {'2': 2, '3': 3, '4': 4, '5': 5, '5more': 6}})
df = df.replace({'persons': {'2': 2, '4': 4, 'more': 6}})
df = df.replace({'lug_boot': {'small': 1, 'med': 2, 'big': 3}})
df = df.replace({'safety': {'low': 1, 'med': 2, 'high': 3}})

#Objective variable
df = df.replace({'class': {'unacc': 1, 'acc': 2, 'good': 3, 'vgood': 4}}) 

Classify into training data and evaluation data.

python.py


from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state = 0)

#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('class', axis=1)
y_train = train_set['class']
 
#Explain the evaluation data Variable data(X_train)And objective variable data(y_train)Divided into
X_test = test_set.drop('class', axis=1)
y_test = test_set['class']

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


(691, 6)
(173, 6)
(691,)
(173,)

Random forest implementation

python.py


#Random forest
from sklearn.ensemble import RandomForestClassifier
#Evaluation
from sklearn import metrics

model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)

print(metrics.classification_report(y_test, pred))


              precision    recall  f1-score   support

           1       0.97      0.96      0.97       114
           2       0.84      0.88      0.86        42
           3       0.71      0.56      0.63         9
           4       0.89      1.00      0.94         8

    accuracy                           0.92       173
   macro avg       0.85      0.85      0.85       173
weighted avg       0.92      0.92      0.92       173

The correct answer rate is 92%. Next, let's adjust the parameters.

Parameter overview

Here are some of the most important parameter adjustments.

①n_estimators

Number of decision tree models Specify an integer (default: 100)

②criterion

Indicator for dividing data into decision tree models 'gini': Gini coefficient (default) 'entropy': Cross entropy

③max_depth

Depth of each decision tree model Specify an integer or None (default: None) Parameters that are important for suppressing overfitting Typically, Small value: low accuracy Large value: High accuracy but prone to overfitting

④min_samples_split

Number of samples needed to split a node (When the number of samples in the node becomes less than the specified value, the division of the decision tree stops) Specify an integer or decimal (default: None) In general, too small a value can easily overfit the model.

⑤max_leaf_nodes

Number of leaves in the decision tree model Specify an integer or None (default: None)

⑥min_samples_leaf

Number of samples required for leaves after division of decision tree Specify an integer or decimal (default: 1)

Implemented parameter adjustment with grit search

python.py


#Grit search
from sklearn.model_selection import GridSearchCV

#Specify the parameter you want to verify
search_gs = {
"max_depth": [None, 5, 25],
"n_estimators":[150, 180],
"min_samples_split": [4, 8, 12],
"max_leaf_nodes": [None, 10, 30],
}

model_gs = RandomForestClassifier()
#Grit search settings
gs = GridSearchCV(model_gs,
                  search_gs,
                  cv = 5,
                  iid = False)
#Learning
gs.fit(X_train, y_train)
#Display of optimal parameters
print(gs.best_params_)

{'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 4, 'n_estimators': 180}

Check the result

python.py


clf_rand = RandomForestClassifier(max_depth = None, 
                                  max_leaf_nodes = None, 
                                  min_samples_split = 4, 
                                  n_estimators =180)
model_rand = clf_rand.fit(X_train, y_train)
pred_rand = model_rand.predict(X_test)

print(metrics.classification_report(y_test, pred_rand))



              precision    recall  f1-score   support

           1       1.00      0.97      0.99       114
           2       0.87      0.95      0.91        42
           3       0.71      0.56      0.63         9
           4       0.89      1.00      0.94         8

    accuracy                           0.95       173
   macro avg       0.87      0.87      0.87       173
weighted avg       0.95      0.95      0.95       173

in conclusion

Correct answer rate improved from 92% to 95%!

It is important to adjust the parameters, but if you want to improve the accuracy rate further, I think that data preprocessing (extraction of features) will be important!

Recommended Posts

Random forest (implementation / parameter summary)
Random Forest
Stackful coroutine implementation summary
7-line interpreter implementation summary
Balanced Random Forest in python
Ensemble learning summary! !! (With implementation)
I tried using Random Forest
[Machine learning] Understanding random forest
Decision tree and random forest
Use Random Forest in Python
Machine Learning: Supervised --Random Forest
Unity IAP implementation method summary
HMM parameter estimation implementation in python
Summary of basic implementation by PyTorch
Random Forest size / processing time comparison
Random forest (classification) and hyperparameter tuning
1D-CNN, 2D-CNN scratch implementation summary by Pytorch
Regression model comparison-ARMA vs. Random Forest Regression
[Machine learning] Try studying random forest
Random number generation summary by Numpy