1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** You can see that the explanation "I don't know the background, but I got this result" is obviously weak **.

In [Machine learning] Understanding decision trees from both scikit-learn and mathematics that I posted last time, I described the details of decision trees. This time, I will summarize the random forest that is used in more practical work and competitions such as kaggle.

I don't talk about mathematics as usual this time, but ** I could only understand "Random forest is a combination of decision trees" **, so I organized it myself * * The purpose of this time is to help you understand "what is a random forest" and "what should be done for parameter tuning" while keeping the background in mind **.

Also, this time, O'Reilly's [Machine learning starting with Python](https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82%81] % E3% 82% 8B% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E2% 80% 95scikit-learn% E3% 81% A7% E5% AD% A6% E3% 81% B6% E7% 89% B9% E5% BE% B4% E9% 87% 8F% E3% 82% A8% E3% 83% B3% E3% 82% B8% E3% 83% 8B% E3% 82% A2% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 9F% BA% E7% A4% 8E-Andreas-C-Muller / dp / 4873117984 / ref = sr_1_4? adgrpid = 79259353864 & dchild = 1 & gclid = Cj0KCQjw3qzzBRDnARIsAECmrypfSaVgzur1vjdrANcvYmfbh5o4vqR0LY6sH-cKX14mFgJ95QpG5sQaAkdAEALw_wcB & hvadid = 358533815035 & hvdev = c & hvlocphy = 1009318 & hvnetw = g & hvqmt = e & hvrand = 15282066364140801380 & hvtargid = kwd-475056195101 & hydadcr = 27269_11561183 & jp-ad-ap = 0 & keywords = python% E3% 81% A7% E3% 81% AF% E3% 81% E3% 82% 8B%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92&qid=1584146942&sr=8-4) was referred to.

I have posted several articles as a series of "Understanding from Mathematics", so I hope you can read them together. [Machine learning] Understanding decision trees from both scikit-learn and mathematics [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/b84a0d669bcf5267e750) [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1] (https://qiita.com/Hawaii/items/3f4e91cf9b86676c202f)

2. Ensemble learning and random forest

To understand Random Forest, we will touch on ensemble learning.

(1) What is ensemble learning?

Ensemble learning is ** a way to build more powerful models by combining multiple machine learning models **.

There are many machine learning models such as "logistic regression", "SVM", and "decision tree", but each of these makes predictions for data independently.

However, in general, I think that there are many cases where some kind of ** majority vote **, in which several people come together to come up with an answer, produces better results than one person giving an answer at their own discretion.

Ensemble learning is exactly this way of thinking, and it is a learning method that makes a final decision based on the judgment results of multiple machine learning models. The image is below.

(2) Types of ensemble learning

There are two main types of ensemble learning methods, "bagging" and "boostering". Random forest makes predictions based on this "bagging".

◆ What is bagging?

It is a method to train multiple models in parallel using the method of ** bootstrap **. → When new data comes in, we will make a majority vote for classification and average prediction for regression.

What is bootstrap?

A method of sampling some data from the original data by ** restoration extraction **. In the restoration extraction, the data once taken is also returned to the original data and sampled, so the same data may be selected many times.

◆ What is boosting?

How to prepare multiple models and proceed with learning in series. We will build the next model while referring to the results of the model created earlier.

Models based on boosting have AdaBoost (not mentioned this time).

(3) What is Random Forest?

Random forest is a collection of a lot of slightly different decision trees based on ** ensemble learning bagging **.

Random forests are one way to deal with this problem, as the decision tree alone has the drawback of being overfitting.

As mentioned in the bagging, each decision tree is constructed with each data overfitted because it randomly samples several groups from the original data.

** The idea is that if you make a lot of decision trees that are overfitting in different directions, you can reduce the degree of overfitting by averaging the results **.

Let's illustrate this idea. STEP1: Randomly sample data from the original data with bootstrap and create data groups for N groups

STEP2: Create a decision tree model for each of the N groups.

STEP3: Make a prediction once with the decision tree model of each N group.

STEP4: Take a majority vote of N groups (regression is average) and make a final prediction. キャプチャ2.PNG

(4) Parameters when implementing random forest with scikit-learn

The concrete implementation with scikit-learn will be done from the next, but I will explain how to set each parameter first.

However, as a premise, Random Forest is known to provide reasonably good accuracy without much parameter tuning (no need to convert scales such as data standardization). Therefore, this time we will only introduce it, and in the next implementation, we will build the model with the default settings.

Here, [Machine learning starting with Python] introduced at the beginning (https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82% 81% E3% 82% 8B% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E2% 80% 95scikit-learn% E3% 81% A7% E5% AD % A6% E3% 81% B6% E7% 89% B9% E5% BE% B4% E9% 87% 8F% E3% 82% A8% E3% 83% B3% E3% 82% B8% E3% 83% 8B % E3% 82% A2% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7 % BF% 92% E3% 81% AE% E5% 9F% BA% E7% A4% 8E-Andreas-C-Muller / dp / 4873117984 / ref = sr_1_4? adgrpid = 79259353864 & dchild = 1 & gclid = Cj0KCQjw3qzzBRDnARIsAECmrypfSaVgzur1vjdrANcvYmfbh5o4vqR0LY6sH-cKX14mFgJ95QpG5sQaAkdAEALw_wcB & hvadid = 358533815035 & hvdev = c & hvlocphy = 100009318 & hvnetw = g & hvqmt = e & hvrand = 152802066364140801380 & hvtargid = kwd-475056195101 & hydadcr = 27269_11561183 & jp-ad-ap = 0 & keywords = python% E3% 81% A7% E3% 81% AF% E3% 81% 98% E3% 82% % 8B% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92 & qid = 1584146942 & sr = 8-4) mentions "important parameters to adjust" on page 87. Introducing n_estimators, max_features.

◆n_estimators Set how many decision trees to prepare. It is how many N of "N data" shown in the figure. The bigger this is, the better (the image is that you can get a majority vote from many people), but if you increase it too much, it will take time and memory, so I think it will be a balance with this area.

◆max_features This is the first time I will describe it here, but there is actually one more thing that is done when sampling the data in STEP1. It is "selection of features". Not all features are used for model construction, and features are also randomly distributed when building a decision tree in each group. Set the number of features in each group with max_features.

If max_features is set to "n_features", all features will be selected.

Increasing max_features should make each decision tree model similar, while decreasing it will result in significantly different decision tree models, but too small will result in decision trees that do not fit the data. I will end up.

max_features is generally stated in "Machine learning starting with python" that default values should be used.

3. Implement a random forest with scikit-learn

Now let's actually implement a random forest with scikit-learn.

(1) Data set

Use kaggle's Kickstarter Projects dataset. https://www.kaggle.com/kemical/kickstarter-projects

(2) Import what you need, read data

(I) Import

import pandas as pd#Import pandas
import datetime#Import for date processing of original data
from sklearn.model_selection import train_test_split#For data division
from sklearn.ensemble import RandomForestClassifier#Random forest

(Ii) Data reading


df = pd.read_csv(r"C:~~\ks-projects-201801.csv")

(Iii) Data appearance

From the following, you can see that it is the dataset of (378661, 15).

df.shape

Also, let's take a quick look at the data in .head.

df.head()

(3) Data molding

(I) Number of recruitment days

Since we will focus on random forest this time, we will omit the details, but since the recruitment start time and end time of crowdfunding are in the data, we will convert this to "recruitment days".

df['deadline'] = pd.to_datetime(df["deadline"])
df["launched"] = pd.to_datetime(df["launched"])
df["days"] = (df["deadline"] - df["launched"]).dt.days

(Ii) About the objective variable

I will omit the details here as well, but there are categories other than success ("successful") and failure ("failed") for the objective variable "state", but this time I will only use data for success and failure.

df = df[(df["state"] == "successful") | (df["state"] == "failed")]

Then replace success with 1 and failure with 0.

df["state"] = df["state"].replace("failed",0)
df["state"] = df["state"].replace("successful",1)

(Iii) Delete unnecessary lines

Before building the model, delete the id and name that you think you don't need (this should be kept, but this time it's gone), and the variables that you only need to crowdfunding ..

df = df.drop(["ID","name","deadline","launched","backers","pledged","usd pledged","usd_pledged_real","usd_goal_real"], axis=1)

(Iv) Category variable processing

Perform categorical variable processing with pd.get_dummies.

df = pd.get_dummies(df,drop_first = True)

(4) Finally the main subject-data division and random forest-

(I) Data division

First, divide it into training data and test data.

train_data = df.drop("state", axis=1)
y = df["state"].values
X = train_data.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(Ii) Random forest

clf = RandomForestClassifier(random_state=1234)
clf.fit(X_train, y_train)
print("score=", clf.score(X_test, y_test))

If you do the above, you should get an accuracy of about 0.638. If it's a basic model, that's it!

4. Conclusion

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from the background how they work behind the scenes. As I learn more, I would like to update this random forest to a deeper level.

I think there are many contents that are difficult to understand, but I hope it helps to deepen my understanding.

[PYTHON] [Machine learning] Understanding random forest