In the Kaggle / Titanic tutorial, we are learning with RandomForestClassifier (). Adjust this parameter to see if the learning accuracy improves.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
train_data = pd.read_csv("../train.csv")
from sklearn.model_selection import train_test_split
train_data_orig = train_data
train_data, cv_data = train_test_split(train_data_orig, test_size=0.3, random_state=1)
We used train_test_split to split the data into train: cv = 7: 3 (cv; cross validation).
train_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 623 entries, 114 to 37
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 623 non-null int64
1 Survived 623 non-null int64
2 Pclass 623 non-null int64
3 Name 623 non-null object
4 Sex 623 non-null object
5 Age 496 non-null float64
6 SibSp 623 non-null int64
7 Parch 623 non-null int64
8 Ticket 623 non-null object
9 Fare 623 non-null float64
10 Cabin 135 non-null object
11 Embarked 622 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 63.3+ KB
cv_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 862 to 92
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 268 non-null int64
1 Survived 268 non-null int64
2 Pclass 268 non-null int64
3 Name 268 non-null object
4 Sex 268 non-null object
5 Age 218 non-null float64
6 SibSp 268 non-null int64
7 Parch 268 non-null int64
8 Ticket 268 non-null object
9 Fare 268 non-null float64
10 Cabin 69 non-null object
11 Embarked 267 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 27.2+ KB
There are 623 trains, 268 cvs, and a total of 891.
from sklearn.ensemble import RandomForestClassifier
features = ["Pclass", "Sex", "SibSp", "Parch"]
y = train_data["Survived"]
y_cv = cv_data["Survived"]
X = pd.get_dummies(train_data[features])
X_cv = pd.get_dummies(cv_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
print('Train score: {}'.format(model.score(X, y)))
print('CV score: {}'.format(model.score(X_cv, y_cv)))
Train score: 0.8394863563402889
CV score: 0.753731343283582
The train is about 84% correct, but the CV is only 75% correct. Is it overfitting?
n_estimator
Try changing the value of n_estimator.
rfc_results = pd.DataFrame(columns=["train", "cv"])
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=5, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
| train | cv | |
|---|---|---|
| 1 | 0.826645 | 0.753731 |
| 10 | 0.833066 | 0.753731 |
| 100 | 0.839486 | 0.753731 |
As the number of decision trees increases, the train score increases slightly, but the cv score does not change.
max_depth
Try changing the value of max_depth.
max_depth = 2
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=2, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
| train | cv | |
|---|---|---|
| 1 | 0.813804 | 0.731343 |
| 10 | 0.81862 | 0.753731 |
| 100 | 0.817014 | 0.761194 |
I got a cv score of 76%.
max_depth = 3
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=3, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
| train | cv | |
|---|---|---|
| 1 | 0.81862 | 0.753731 |
| 10 | 0.82504 | 0.776119 |
| 100 | 0.82504 | 0.768657 |
I got a cv score of 77.6%.
max_depth = 4
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=4, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
| train | cv | |
|---|---|---|
| 1 | 0.823435 | 0.764925 |
| 10 | 0.82825 | 0.761194 |
| 100 | 0.826645 | 0.764925 |
The score of cv is about 76.5%.
From the above, with max_depth = 3, n_estimators = 10 had the highest score.
Find the best parameters with a method called GridSearch (GridSearchCV). This is a method of trying out the combinations of the listed parameters and finding the best one.
from sklearn.model_selection import GridSearchCV
param_grid = {"max_depth": [2, 3, 4, 5, None],
"n_estimators":[1, 3, 10, 30, 100],
"max_features":["auto", None]}
model_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=1),
param_grid = param_grid,
scoring="accuracy", # metrics
cv = 3, # cross-validation
n_jobs = 1) # number of core
model_grid.fit(X, y) #fit
model_grid_best = model_grid.best_estimator_ # best estimator
print("Best Model Parameter: ", model_grid.best_params_)
Note the from line. When I searched on the net, there was a way to write from sklearn.grid_search import GridSearchCV, but in my case this was NG.
Best Model Parameter: {'max_depth': 3, 'max_features': 'auto', 'n_estimators': 10}
As I tried manually, max_depth = 3 and n_estimators = 10 were the best. I also tried two types of max_features, but"auto"was good.
print('Train score: {}'.format(model.score(X, y)))
print('CV score: {}'.format(model.score(X_cv, y_cv)))
Train score: 0.826645264847512
CV score: 0.7649253731343284
I predicted the result using this parameter and submitted it to Kaggle. However, the accuracy was 0.77751, which was the same as the parameter of the tutorial ***. Bien.
Before trying to improve with features, we were able to improve a little by adjusting the hyperparameters of learning. Next, we would like to consider features.
-Scikit-learn splits data for training and testing train_test_split -Add columns and rows to pandas.DataFrame (assign, append, etc.) -Let's tune the model hyperparameters with scikit-learn! -Scikit-learn's GridSearchCV for hyperparameter search -What to do when sklearn grid search cannot be used in Python
Recommended Posts