[PYTHON] Read kaggle Courses --- Intermediate Machine Learning 6

twitter started

Hello! My name is Nike! Suddenly, I finally started twitter! I can't tell you anything useful, but I hope to gain the knowledge and experience of my seniors!

Well then, thank you again this time!


Intermediate Machine Learning digs deeper into machine learning

~ Flow of Intermediate Machine Learning ~

  1. Introduction
  2. Missing value
  3. Category variable
  4. Pipeline
  5. Cross-validation
  6. XGBoost
  7. Data leakage

This time it is the contents of 6!

eXtreme Gradient Boosting Gradient Boosting is translated as Gradient Boosting. XGBoost is an implementation of this gradient boosting with greater accuracy and speed. Scikit-learn has another gradient boosting technique, but XGBoost seems to have technical advantages. We'll dig a little deeper into XGBoost.

Python: Try using XGBoost (reference site) What is a gradient boosting decision tree (reference site)

Ensemble method

I've already learned a technique called Random Forest, which is categorized as __ "ensemble methods" . The "ensemble method" is the combination of guessing from multiple models. Random forests are an ensemble technique because they integrate multiple decision tree guesses. (* There is also the term "" ensemble learning "__, but I couldn't find a clear difference *)

There are three types of ensemble learning

--Bagging --Boosting --Stacking

Are all advanced machine learning users using it? !! I will explain the mechanism of ensemble learning and the three types (reference site)

And gradient boosting is also one of ensemble learning.

What is gradient boosting?

First, boosting is a method of continuously adding __models __ to create an entire (ensemble) by repeating a fixed procedure. Initially it is one immature model, but it will be optimized more and more by the models that will be added later.

And gradient boosting is a technique that "redefines as a problem that minimizes the loss function and uses gradient information to find the direction that minimizes the loss." About Gradient Boosting --Preparation-- (Reference Site)

MvCGENh.png Quoted from Kaggle

First, initialize the existing model. Then enter the cycle below.

  1. Make inferences with all the models that exist in the existing ensemble and integrate them to make one inference
  2. Evaluate the prediction with a loss function (see here)
  3. Optimize new models added to the ensemble. Specifically, with the addition of this model, the parameters will be adjusted to reduce the loss of the ensemble. (The "gradient" of "gradient boosting" comes from the "gradient descent method" of the new model, which determines which parameters to adjust.)
  4. Add the new model to the ensemble
  5. Repeat ...

Write code

The data to be used is in here as before. It is divided into three parts.

  1. Creating an initial model
  2. Improve model performance
  3. Try breaking the model (← !?)

First of all, preparation.

import pandas as pd
from sklearn.model_selection import train_test_split

# Data reading
X = pd.read_csv('train.csv', index_col='Id')
X_test_full = pd.read_csv('test.csv', index_col='Id')

# Exclude rows where the objective variable is missing and separate the objective variable from the data
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice              
X.drop(['SalePrice'], axis=1, inplace=True)

# Separate verification data and learning data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" represents the number of unique values ​​in the column
# Extract columns of category data with low cardinality
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Extract numerical data
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Combine the extracted numbers and columns of category data
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# Perform One-Hot encoding (pandas allows you to write shorter code than before)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

1. Creating an initial model

We're still doing the same thing here with Random Forest. Define the model → Fit the model → Guess → Verify This is the flow.

from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

# Define model
my_model_1 = XGBRegressor(random_state=0)

# Fit model
my_model_1.fit(X_train, y_train)

# Guess
predictions_1 = predictions_1 = my_model_1.predict(X_valid)

# Verification: Calculate MAE
mae_1 = mean_absolute_error(predictions_1, y_valid)
print("Mean Absolute Error:" , mae_1)

Execution result

Mean Absolute Error: 17662.736729452055

2. Improve model performance

Here is the real thrill of XGBoost. XGBoost has a terrifying number of parameters. Adjusting this parameter will improve performance. (* Of course, other models can also be improved in performance by adjusting the parameters *) Here, typical parameters that improve the performance of the model are introduced.

--_n_estimators __ .. .. The number to repeat the above cycle. If it is too small, it will be overfitting, and if it is too large, it will be overfitting. I pass about 100-1000, but I am deeply involved in the learning_rate that will appear later. --learning_rate .. .. Determines the weight size __ during learning. The default is 0.1. -- eval_metric __ .. . You can determine the loss function __. --early_stopping_rounds .. .. This parameter automatically derives the number of n_estimators. For example, if you pass "5", the cycle ends when the score of the 5th validation deteriorates. Therefore, set n_estimators to a large value. If you set this parameter, you must pass validation data to eval_set. --eval_set .. .. It is for passing verification data. --n_jobs .. .. Decide the number of parallel processes. Make it her number of cores on your PC. It doesn't improve the score, and adjusting when the dataset is very large will reduce the execution time. --verbose .. If set to False, the calculation process in the middle will not be displayed. If you pass a number, the calculation process separated by each number you pass will be displayed.

# Define model
my_model_2 = XGBRegressor(n_estimators=1000,
                          learning_rate=0.05,
                          eval_metric='mae')

# Fit model
my_model_2.fit(X_train, y_train, 
               early_stopping_rounds=5, 
               eval_set=[(X_valid, y_valid)], 
               verbose=1)


# Predict
predictions_2 = my_model_2.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid)
print("Mean Absolute Error:" , mae_2)

Execution result If you set verbose to False, the calculation process in the middle will not be displayed. You can see that the MAE is steadily decreasing during the calculation process.

[0]     validation_0-mae:172457.42188
[1]     validation_0-mae:163972.64062
[2]     validation_0-mae:155982.82812
......
[154]   validation_0-mae:16951.49609
 [155] validation_0-mae: 16948.06641 # ←← This is the minimum value, but it is different from the value shown in the result ... I don't know the reason ... (crying)
[156]   validation_0-mae:16954.53516
[157]   validation_0-mae:16962.16211
[158]   validation_0-mae:16956.42383
[159]   validation_0-mae:16956.51172
[160]   validation_0-mae:16952.38086
mae_2, Mean Absolute Error: 16948.067128638697

3. Try breaking the model

Let's set n_estimators = 1. I'm looking forward to the result.

 Define the model
my_model_3 = XGBRegressor(n_estimators=1,
                          learning_rate=0.05,
                          eval_metric='mae')
 Fit the model
my_model_3.fit(X_train, y_train)

 Get predictions
predictions_3 = my_model_3.predict(X_valid)

 Calculate MAE
mae_3 = mean_absolute_error(predictions_3, y_valid)
print("Mean Absolute Error:" , mae_3)

Execution result

mae_3, Mean Absolute Error: 172457.41701141777

Thank you for reading until the end!

For me as a beginner, this time it was quite difficult. Boosting and new jargon have suddenly increased in the ensemble method ... I was really helped by various site owners. truly, thank you very much.

Next time, Intermediate Machine Learning will be completed! With this, I can say that I am finally studying machine learning. I'll finish it by the end of the year!

Thank you for reading until the end!

Recommended Posts

Read kaggle Courses --- Intermediate Machine Learning 5
Read kaggle Courses --- Intermediate Machine Learning 6
Try machine learning with Kaggle
[Python] First data analysis / machine learning (Kaggle)
[Memo] Machine learning
Machine learning classification
Machine Learning sample
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Machine learning starting from scratch (machine learning learned with Kaggle)
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Learning record 12 (16th day) Kaggle2
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Machine learning article summary (self-authored)
About machine learning mixed matrices
Machine Learning: Supervised --Random Forest
Practical machine learning system memo
Machine learning Minesweeper with PyTorch
Machine learning environment construction macbook 2021
Build a machine learning environment
Python Machine Learning Programming> Keywords
Machine learning algorithm (simple perceptron)
Used in machine learning EDA
Importance of machine learning datasets
Machine learning and mathematical optimization
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)
I implemented Extreme learning machine
Beginning with Python machine learning
Machine learning algorithm (support vector machine)
Super introduction to machine learning
4 [/] Four Arithmetic by Machine Learning
Machine learning ④ K-nearest neighbor Summary
Pokemon machine learning Nth decoction
Machine learning stacking template (regression)