1.First of all

As a tutorial on learning machine learning, I'll take a reminder of how I've done to predict the Titanic survivors, which is a must-have path for everyone.

Last time, we used Random Forest to predict the survival of Titanic. We evaluated the effect of changing the parameters used at that time on the accuracy rate.

See below for the entire program. 　https://qiita.com/Fumio-eisan/items/77339fc737a3d8cfe179

2 Random forest parameter impact assessment

Random forest is an algorithm that prepares multiple decision trees and takes a majority vote to level the overfitting of the decision trees.

As its feature

1. Overfitting is unlikely to occur due to ensemble learning by majority vote
No need for feature standardization or normalization processing
1. There are few hyperparameters (about the number of samplings and the number of features of the decision tree)
You can know which features are important

And so on.

It seems that it is sometimes called a robust dead technology, as opposed to glittering technology such as deep running such as neural networks. As a beginner in machine learning, I would like to actually move my hands and check the feel.

The page that I used as a reference http://aiweeklynews.com/archives/50653819.html

2-1 About the influence of n_estimators


x=[]
y=[]
for i in [1,2,3,4,5,6,7,8,9,10,15,20,25,30,40,50]:
    clf = RandomForestClassifier(
        n_estimators = i ,
        max_depth=5,
        random_state = 0)
    clf = clf.fit(train_X , train_y)
    pred = clf.predict(test_X)
    fpr, tpr , thresholds = roc_curve(test_y,pred,pos_label = 1)
    auc(fpr,tpr)
    accuracy_score(pred,test_y)
    y.append(accuracy_score(pred,test_y))
    x.append(i)
    
plt.scatter(x, y)
plt.xlabel('n_estimators')
plt.ylabel('accuracy')
plt.show()

n_estimators is the number of decision trees. You can see that n_estimators are saturating at about 10. If you raise it to 50 further, the value has decreased whether it will be overfitting. https://amalog.hateblo.jp/entry/hyper-parameter-search

2-2 About the influence of criterion


x=[]
y=[]
z=[]
for i in [1,2,3,4,5,6,7,8,9,10,15,20,25,30,40,50,100]:
    clf = RandomForestClassifier(
        criterion='gini',
        n_estimators = i ,
        max_depth=5,
        random_state = 0)
    clf = clf.fit(train_X , train_y)
    pred = clf.predict(test_X)
    fpr, tpr , thresholds = roc_curve(test_y,pred,pos_label = 1)
    auc(fpr,tpr)
    accuracy_score(pred,test_y)
    y.append(accuracy_score(pred,test_y))
    
    clf_e = RandomForestClassifier(
        criterion='entropy',
        n_estimators = i ,
        max_depth=5,
        random_state = 0)
    clf_e = clf_e.fit(train_X , train_y)
    pred_e = clf_e.predict(test_X)
    fpr, tpr , thresholds = roc_curve(test_y,pred_e,pos_label = 1)
    auc(fpr,tpr)
    accuracy_score(pred_e,test_y)
    z.append(accuracy_score(pred_e,test_y))

    x.append(i)
    
plt.xlabel('n_estimators')
plt.ylabel('accuracy')
plt.plot(x,y,label="gini")
plt.plot(x,z,label="entropy")
plt.legend(bbox_to_anchor=(1, 1), loc='center right', borderaxespad=0, fontsize=18)

We evaluated the effect of the difference in handling of the Gini function and information entropy. As I introduced last time, it seems that the Gini function is good at regression and the information entropy is good at categorical data. When the number of decision trees is less than 10, the value is almost the same, and after that, the Gini function becomes high from 10 to 40. However, after that, it turned out that information entropy had an advantage. There is something I do not understand about this factor, so I will do my homework. .. ..

http://data-analysis-stats.jp/2019/01/14/%E6%B1%BA%E5%AE%9A%E6%9C%A8%E5%88%86%E6%9E%90%E3%81%AE%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E8%A7%A3%E8%AA%AC/

2-3 About the influence of ramdom_state


from sklearn.ensemble import RandomForestClassifier
x=[]
y=[]
z=[]
for i in [1,2,3,4,5,6,7,8,9,10,15,20,25,30,40,50,60,70,80,90,100]:
    clf = RandomForestClassifier(
        criterion='gini',
        n_estimators = 10 ,
        max_depth=5,
        random_state = i)
    clf = clf.fit(train_X , train_y)
    pred = clf.predict(test_X)
    fpr, tpr , thresholds = roc_curve(test_y,pred,pos_label = 1)
    auc(fpr,tpr)
    accuracy_score(pred,test_y)
    y.append(accuracy_score(pred,test_y))
    
    clf_e = RandomForestClassifier(
        criterion='entropy',
        n_estimators = 10 ,
        max_depth=5,
        random_state = i)
    clf_e = clf_e.fit(train_X , train_y)
    pred_e = clf_e.predict(test_X)
    fpr, tpr , thresholds = roc_curve(test_y,pred_e,pos_label = 1)
    auc(fpr,tpr)
    accuracy_score(pred_e,test_y)
    z.append(accuracy_score(pred_e,test_y))

    x.append(i)
    
    

plt.xlabel('ramdom_state')
plt.ylabel('accuracy')
plt.plot(x,y,label="gini")
plt.plot(x,z,label="entropy")
plt.legend(bbox_to_anchor=(1, 1), loc='center right', borderaxespad=0, fontsize=18)

Next, I changed random_state. You can see that this is not quite stable. It seems that this is also saturating in about 10 times. In the first place, the meaning of this value is to ensure reproducibility.

https://teratail.com/questions/99054

The same random number can be generated by specifying> random_state. The purpose is to ensure reproducibility. The reason why random numbers are needed is that a part of the data needs to be randomly selected to create a weak predictor (s).

3 Summary

From this confirmation, it was found that it is n_estimators: the number of decision trees that greatly affects the accuracy rate. It was confirmed that by entering a large value to some extent, it will be saturated. If it takes a lot of calculation time, it is necessary to add work to stop it when it automatically saturates.