[PYTHON] GBDT library: I tried fuel consumption prediction (regression) with CatBoost

I didn't know CatBoost, which is sometimes used alongside XGBoost and lightGBM, which are GBDT (Gradient Boosting Decision Trees) libraries, until recently, so I tried to move it with a regression task. It was.

CatBoost?

I will paste the introductory text of the official website. (Google translated)

CatBoost is a decision tree gradient boosting algorithm.
Developed by Yandex researchers and engineers
Used for search, recommendation systems, personal assistants, self-driving cars, weather forecasts, and many other third-party tasks such as Yandex, CERN, Cloudflare, and Careem taxis.

Articles that I used as a reference

-Regression: Predict fuel economy

The data set used this time

--Auto MPG dataset -This is the dataset used in TensorFlow Tutorials here. --Predict the fuel efficiency of a car. Explanatory variables include number of cylinders, displacement, horsepower, weight, and so on.

Contents

The following code ran on Google Colab. (CPU)

Install CatBoost with pip

!pip install catboost -U

Download dataset

import urllib.request

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
file_path = './auto-mpg.data'
urllib.request.urlretrieve(url, file_path)

Data preprocessing

import pandas as pd

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                   'Acceleration', 'Model Year', 'Origin'] 
dataset = pd.read_csv(file_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

#This time the purpose is to move, so nan will drop
dataset = dataset.dropna().reset_index(drop=True)

#Category variable:Origin is handled by Cat Boost, so make it a String type
dataset['Origin'] = dataset['Origin'].astype(str)

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

Prepare a dataset for use by CatBoost

import numpy as np
from catboost import CatBoostRegressor, FeaturesData, Pool

def split_features(df):
    cfc = []
    nfc = []
    for column in df:
        if column == 'Origin':
            cfc.append(column)
        else:
            nfc.append(column)
    return df[cfc], df[nfc]

cat_train, num_train = split_features(train_dataset)
cat_test, num_test = split_features(test_dataset)

train_pool = Pool(
    data = FeaturesData(num_feature_data = np.array(num_train.values, dtype=np.float32), 
                    cat_feature_data = np.array(cat_train.values, dtype=object), 
                    num_feature_names = list(num_train.columns.values), 
                    cat_feature_names = list(cat_train.columns.values)),
    label =  np.array(train_labels, dtype=np.float32)
)

test_pool = Pool(
    data = FeaturesData(num_feature_data = np.array(num_test.values, dtype=np.float32), 
                    cat_feature_data = np.array(cat_test.values, dtype=object), 
                    num_feature_names = list(num_test.columns.values), 
                    cat_feature_names = list(cat_test.columns.values))
)

Learning

model = CatBoostRegressor(iterations=2000, learning_rate=0.05, depth=5)
model.fit(train_pool)

The above parameters are the values as they are in the reference article. By the way, the learning ended with total: 4.3 s.

Inference / result plot

import matplotlib.pyplot as plt

preds = model.predict(test_pool)

xs = list(range(len(test_labels)))
plt.plot(xs, test_labels.values, color = 'r')
plt.plot(xs, preds, color = 'k');
plt.legend(['Target', 'Prediction'], loc = 'upper left');
plt.show()

The result of plotting is as follows. catboost_result.png

Impressions etc.

――This time, I just moved the reference article almost as it is, but I'm glad that I understood the rough usage in regression. --Although it is also commented on Kaggle's Kernel that I referred to, it seems that it is better to use BayesSearchCV for hyperparameter tuning, so I will try it next. ([Material](https://colab.research.google.com/github/lmassaron/kaggledays-2019-gbdt/blob/master/Kaggle%20Days%20Paris%20-%20%20GBDT%20workshop.ipynb# scrollTo = WvJdFN3xbyIz) seemed to be helpful)

Recommended Posts

GBDT library: I tried fuel consumption prediction (regression) with CatBoost
I tried to implement time series prediction with GBDT
I tried multiple regression analysis with polynomial regression
I tried fp-growth with python
I tried scraping with Python
I tried using the Python library from Ruby with PyCall
I tried Learning-to-Rank with Elasticsearch!
I tried clustering with PyCaret
I tried the changefinder library!
I tried gRPC with Python
I tried scraping with python
I tried trimming efficiently with OpenCV
I tried summarizing sentences with summpy
I tried web scraping with python.
I tried moving food with SinGAN
I tried implementing DeepPose with PyTorch
I tried face detection with MTCNN
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried learning LightGBM with Yellowbrick
I tried face recognition with OpenCV
Sine wave prediction (regression) with Pytorch
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried sending an SMS with Twilio
I tried using Amazon SQS with django-celery
I tried to implement Autoencoder with TensorFlow
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried a functional language with Python
I tried batch normalization with PyTorch (+ note)
I tried recursion with Python ② (Fibonacci sequence)
I tried implementing DeepPose with PyTorch PartⅡ
I tried to implement CVAE with PyTorch
I tried playing with the image with Pillow
I tried to solve TSP with QAOA
I tried simple image recognition with Jupyter
I tried CNN fine tuning with Resnet
I tried natural language processing with transformers.
#I tried something like Vlookup with Python # 2
I tried to extract named entities with the natural language processing library GiNZA
I tried Hello World with 64bit OS + C language without using the library
I tried to implement Cifar10 with SONY Deep Learning library NNabla [Nippon Hurray]