[PYTHON] Implementation and explanation using XGBoost for beginners

Introduction

XGBoost that often appears in Kaggle. There were many parts that I couldn't understand even after reading the code, so I investigated and summarized it as a beginner. Please note that it is not accurate because it is written in as easy-to-understand and difficult words as possible. Please do not hesitate to let me know if you have any additions or corrections. This time, I will explain about XGBoost while implementing it.

Contents of this article

table of contents

  1. What is XGBoost?
  2. Import XGBoost.
  3. Import the dataset
  4. Data set division, format conversion
  5. Model definition and training
  6. Model evaluation
  7. Confirm the importance of features

Operating environment

Windows: 10 Anaconda Python: 3.7.4 pandas: 0.25.1 numpy: 1.16.5 scikit-learn: 0.21.2 XGBoost: 0.90

Dataset used in this article

This time, we will use scikit-learn's breast cancer dataset (Breast cancer wisconsin [diagnostic] dataset). The dataset contains characteristic data about the cell nucleus of breast cancer, and this time we will determine whether the breast cancer is a "malignant tumor" or a "benign tumor".

important point

This article does not explain the detailed parameters of XGBoost.

About the source

The source of this article is listed below. https://github.com/Bacchan0718/qiita/blob/master/xgb_breast_cancer_wisconsin.ipynb

1. What is XGBoost?

XGBoost (eXtreme Gradient Boosting) is an implementation of the decision tree gradient boosting algorithm. Decision trees are a method of classifying datasets using a tree-like model as shown in the figure below, analyzing the factors that influenced the results, and using the classification results to make future predictions. 決定木.png

The gradient boosting algorithm describes "gradient" and "boostering" separately. Gradient is to minimize the difference between the two values and reduce the prediction error. Boosting combines weak classifiers (processes that make inaccurate judgments) in series. An algorithm that improves the accuracy of predictions. (The weak discriminator here is the decision tree.) ブースティング.png Reference link XGBoost: https://logmi.jp/tech/articles/322734 XGBoost: http://kamonohashiperry.com/archives/209 Gradient method: https://to-kei.net/basic-study/neural-network/optimizer/ Loss function: https://qiita.com/mine820/items/f8a8c03ef1a7b390e372 Decision tree: https://enterprisezine.jp/iti/detail/6323

2. XGBoost installation

** (1) Open Anaconda Prompt ** Start> Anaconda 3 (64-bit)> Anaconda Prompt Open from. XGBoostインストール.png

** (2) Run conda install -c anaconda py-xg boost **

** (3) Open Terminal from Anaconda ** Anaconda Navigator> Click on the virtual environment to install> Open Terminal Open from. XGBoostインストール2.png

** (4) Run conda install py-xg boost ** During execution, Proceed ([y] / n)? Is displayed. Enter "y" and press Enter.

Now you can use it by doing import xgboost as xgb in jupyter notebook.

important point The installation method of XGBoost differs depending on the operating environment. If the environment is different from this article, you may not be able to install it using this method.

3. Read dataset

You can import the scikit-learn dataset in the following ways.

xgb_breast_cancer_wisconsin.ipynb


from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

The variable cancer is of type Bunch (a subclass of dictionary) and The explanatory variable is stored in data and the objective variable is stored in target.

xgb_breast_cancer_wisconsin.ipynb


X = cancer.data
y = cancer.target

The objective variable is the result of determining whether the tumor is a cancer cell or a benign cell. It is 0 for cancer cells and 1 for benign cells.

The summary is stored in DSCR.

xgb_breast_cancer_wisconsin.ipynb


print(cancer.DESCR)

A detailed description of the dataset can be found below. https://ensekitt.hatenablog.com/entry/2018/08/22/200000

4. Data set division, format conversion

Divide into training data and test data.

xgb_breast_cancer_wisconsin.ipynb


import numpy as np
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

After splitting, it will be in the format of the dataset handled by XGBoost. For feature_names, pass column names for visualization of features.

xgb_breast_cancer_wisconsin.ipynb


import xgboost as xgb
xgb_train = xgb.DMatrix(X_train, label=Y_train, feature_names=cancer.feature_names)
xgb_test = xgb.DMatrix(X_test, label=Y_test, feature_names=cancer.feature_names)
  1. Model definition and training Define the parameters. This time, without setting detailed parameters, the Learning Parameters (learning task parameters) Set objective to "binary: logistic" argument. This argument returns the probability of being a binary classification.

xgb_breast_cancer_wisconsin.ipynb


param = {
    #Binary classification problem
    'objective': 'binary:logistic',  
} 

Learn the model.

xgb_breast_cancer_wisconsin.ipynb


model = xgb.train(param, xgb_train)

Reference link: https://blog.amedama.jp/entry/2019/01/29/235642

6. Model evaluation

Using the trained model, calculate the probability that the validation data will be classified in each class.

xgb_breast_cancer_wisconsin.ipynb


y_pred_proba = model.predict(xgb_test)

Check the contents of y_pred_proba.

xgb_breast_cancer_wisconsin.ipynb


print(y_pred_proba)

The contents are as follows.

[0.974865   0.974865   0.974865   0.974865   0.02652072 0.02652072
 0.02652072 0.93469375 0.15752992 0.9459383  0.05494327 0.974865
 0.974865   0.793823   0.95098037 0.974865   0.93770874 0.02652072
 0.92342764 0.96573967 0.92566985 0.95829874 0.9485401  0.974865
 0.96885294 0.974865   0.9670915  0.9495995  0.9719596  0.9671308
 0.974865   0.974865   0.9671308  0.974865   0.974865   0.974865
 0.96525717 0.9248287  0.4881295  0.974865   0.9670915  0.02652072
 0.974865   0.04612969 0.9459383  0.7825349  0.974865   0.02652072
 0.04585124 0.974865   0.1232813  0.974865   0.974865   0.3750245
 0.9522517  0.974865   0.05884887 0.02652072 0.02652072 0.02652072
 0.974865   0.94800293 0.9533147  0.974865   0.9177746  0.9665209
 0.9459383  0.02652072 0.974865   0.974865   0.974865   0.974865
 0.6874632  0.72485    0.31191444 0.02912194 0.96525717 0.09619693
 0.02652072 0.9719596  0.9346858  0.02652072 0.974865   0.02652072
 0.0688739  0.974865   0.64381874 0.97141886 0.974865   0.974865
 0.974865   0.1619863  0.974865   0.02652072 0.02652072 0.974865
 0.9670915  0.45661741 0.02652072 0.02652072 0.974865   0.03072577
 0.9670915  0.974865   0.9142289  0.7509865  0.9670915  0.02652072
 0.02652072 0.9670915  0.02652072 0.78484446 0.974865   0.974865  ]

The objective variable of the dataset is a binary classification, so the value must be 0 or 1. Set the threshold (reference value) to 1 when it is 0.5 or more, and 0 when it is less than 0.5. Convert to 0 and 1.

xgb_breast_cancer_wisconsin.ipynb


y_pred = np.where(y_pred_proba > 0.5, 1, 0)

Verify accuracy. This time, we will use Accuracy to verify the accuracy rate.

xgb_breast_cancer_wisconsin.ipynb


from sklearn.metrics import accuracy_score
acc = accuracy_score(Y_test, y_pred)

The accuracy is 0.9912280701754386.

7. Confirmation of features

A feature is a measurable characteristic used for learning input. Check the graph to see what features are strongly related to the explanatory variables. Reference link: https://qiita.com/daichildren98/items/ebabef57bc19d5624682 The graph can be saved as png with fig.savefig.

xgb_breast_cancer_wisconsin.ipynb


import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(figsize=(8,15))
xgb.plot_importance(model, ax=ax1)
plt.show()
fig.savefig("FeatureImportance.png ")

FeatureImportance.png

Looking at the graph, I found that it was strongly related to the worst texture.

in conclusion

XGBoost often appears in Kaggle, but when I first saw it, I didn't understand it at all, so I looked it up and summarized it. I will investigate and summarize the parameters again.

Recommended Posts

Implementation and explanation using XGBoost for beginners
Explanation and implementation of SocialFoceModel
[Python] Accessing and cropping image pixels using OpenCV (for beginners)
Anomaly detection by autoencoder using keras [Implementation example for beginners]
[Explanation for beginners] TensorFlow tutorial MNIST (for beginners)
Explanation and implementation of PRML Chapter 4
Explanation and implementation of ESIM algorithm
Explanation and implementation of simple perceptron
[For beginners] Process monitoring using cron
[Explanation for beginners] OpenCV face detection mechanism and practice (detect MultiScale)
[Explanation for beginners] TensorFlow tutorial Deep MNIST
Explanation and implementation of Decomposable Attention algorithm
Causal reasoning and causal search with Python (for beginners)
Roadmap for beginners
Explanation of edit distance and implementation in Python
[For beginners of deep learning] Implementation of simple binary classification by full coupling using Keras
Let's analyze Covid-19 (Corona) data using Python [For beginners]
Rock-paper-scissors poi in Python for beginners (answers and explanations)
Initial settings for using Python3.8 and pip on CentOS8
Extendable skeletons for Vim using Python, Click and Jinja2
FFT (Fast Fourier Transform): Formulas and Implementation Examples for Implementation
Caffe Model Zoo for beginners [Age and gender classification]
[Introduction for beginners] Reading and writing Python CSV files
Python # How to check type and type for super beginners
Learn about python's print function and strings for beginners.
Explanation of package tools and commands for Linux OS
[For beginners] Django Frequently used commands and reference collection
Spacemacs settings (for beginners)
python textbook for beginners
Perceptron basics and implementation
Dijkstra algorithm for beginners
OpenCV for Python beginners
Derivation and implementation of update equations for non-negative tensor factorization
[For beginners] Summary of standard input in Python (with explanation)
[Reinforcement learning] Explanation and implementation of Ape-X in Keras (failure)
[Explanation for beginners] Introduction to convolution processing (explained in TensorFlow)
[Explanation for beginners] Introduction to pooling processing (explained in TensorFlow)
How to learn TensorFlow for liberal arts and Python beginners
Pandas basics for beginners ④ Handling of date and time items
Verification and implementation of video reconstruction method using GRU and Autoencoder
[Python] Introduction to graph creation using coronavirus data [For beginners]
This and that for using Step Functions with CDK + Python
Explanation of CSV and implementation example in each programming language
[For beginners] I tried using the Tensorflow Object Detection API