[PYTHON] [Kaggle] Try using xg boost

1. Purpose

Gradient boosting such as xgboost and LGBM is often used in competitions like Kaggle. However, I felt that there were few articles and sites that could be used as reference for these / I had a lot of trouble when implementing them myself, so this time I would like to describe what I tried about xgboost and the meaning of each parameter. Purpose.

2. Benefits of gradient boosting

・ No need to complete missing values -There is no problem even if there are redundant features (even if there are explanatory variables with high correlation, they can be used as they are)

-The difference from Random Forest is that the trees are made in series.

With the above features, gradient boosting seems to be often used.

3. Try using xgboost

This time, I will implement it using Kaggle's Houseprice.

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

(1) Pretreatment

(I) Import

import numpy as np
import pandas as pd

#For data division
from sklearn.model_selection import train_test_split

#XGBoost
import xgboost as xgb

(Ii) Data reading / combining

#Data read
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

#Data join
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False

df_all = df_train.append(df_test)
df_all.index = df_all["Id"]
df_all.drop("Id", axis = 1, inplace = True)

(Iii) Dummy variable

df_all = pd.get_dummies(df_all, drop_first=True)

(Iv) Data division

#df_Divide all into training data and test data again
df_train = df_all[df_all["TrainFlag"] == True]
df_train = df_train.drop(["TrainFlag"], axis = 1)

df_test = df_all[df_all["TrainFlag"] == False]
df_test = df_test.drop(["TrainFlag"], axis = 1)
df_test = df_test.drop(["SalePrice"], axis = 1)
#Data split
y = df_train["SalePrice"].values
X = df_train.drop("SalePrice", axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

(2) Try using xg boost

(I) xgboost data creation

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_test, label=y_test)
dtest = xgb.DMatrix(df_test.values)

To use xgboost, it is necessary to process its own data with "xgb.DMatrix".

※important point※

The third line says "df_test.values". Note that this is because df_test is in DataFrame format, so if you do not set it as a value (Numpy array) in .values, an error will occur in later model training.

(Ii) Parameter setting

params = {
        'objective': 'reg:squarederror','silent':1, 'random_state':1234, 
        #Indicators for learning(RMSE)
        'eval_metric': 'rmse',
    }
num_round = 500
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]#Set training data as dtrain and test data for evaluation as dvalid

* Explanation of parameters *

-Objective: Specify the loss function to be minimized. The default is linear. -Silent: Specify how to record the model execution log. The default is 0 (output log). -Eval_metric: An evaluation index for data. There are rmse and logloss. -Num_round: Maximum number of learnings.

(Iii) Model training

model = xgb.train(params,
                    dtrain,#Training data
                    num_round,#Set number of learnings
                    early_stopping_rounds=20,
                    evals=watchlist,
                    )

-Early_stopping_rounds: Means that learning is stopped if the accuracy does not improve 20 times in a row. The maximum number of learnings is set in num_round, but if the accuracy is not improved by the number of early_stopping_rounds set here before that, it will be stopped.

(Iv) Forecast

#Forecast
prediction_XG = model.predict(dtest, ntree_limit = model.best_ntree_limit)

#Rounding decimals
prediction_XG = np.round(prediction_XG)

・ About ntree_limit You can set the number of trees with the best precision by setting model.best_ntree_limit.

(V) Creating a file for submission

submission = pd.DataFrame({"id": df_test.index, "SalePrice": prediction_XG})

That is all!

4. Conclusion

What did you think. You can implement it by copying the code, but I feel that it is very important to know what each code means because it is a rough idea.

I hope it will help you to deepen your understanding.

Recommended Posts

[Kaggle] Try using xg boost
[Kaggle] Try using LGBM
Try using Tkinter
Try using docker-py
Try using cookiecutter
Try using geopandas
Try using Selenium
Try using scipy
Survivor prediction using kaggle's titanic xg boost [80.1%]
How to set xg boost using Optuna
Try using pandas.DataFrame
Try using django-swiftbrowser
Try using matplotlib
Try using tf.metrics
Try using PyODE
Try using virtualenv (virtualenvwrapper)
[Azure] Try using Azure Functions
Try using virtualenv now
Try using W & B
Try using Django templates.html
Try using Python's feedparser.
Try using Python's Tkinter
Try using Tweepy [Python2.7]
Try using Pytorch's collate_fn
Try using PythonTex with Texpad.
[Python] Try using Tkinter's canvas
Try using Jupyter's Docker image
Try using scikit-learn (1) --K-means clustering
Try using matplotlib with PyCharm
Try using Azure Logic Apps
Try using Kubernetes Client -Python-
[Kaggle] Try Predict Future Engineering
Try using the Twitter API
Try using OpenCV on Windows
Try using Jupyter Notebook dynamically
Try using AWS SageMaker Studio
Try tweeting automatically using Selenium.
Try using SQLAlchemy + MySQL (Part 1)
Try using the Twitter API
Try using SQLAlchemy + MySQL (Part 2)
Try using Django's template feature
Try using the PeeringDB 2.0 API
Try using Pelican's draft feature
Try using pytest-Overview and Samples-
Try machine learning with Kaggle
Try using folium with anaconda
Try using Janus gateway's Admin API
Try using Spyder included in Anaconda
Try using design patterns (exporter edition)
Try using Pillow on iPython (Part 1)
Try using Pillow on iPython (Part 2)
Try using Pleasant's API (python / FastAPI)
Try using LevelDB in Python (plyvel)
Try using pynag to configure Nagios
Try using PyCharm's remote debugging feature
Try using ArUco on Raspberry Pi
Try using cheap LiDAR (Camsense X1)
[Sakura rental server] Try using flask.
Try using Pillow on iPython (Part 3)
Reinforcement learning 8 Try using Chainer UI
Try to get statistics using e-Stat