[PYTHON] GBDT library: I tried fuel consumption prediction (regression) with CatBoost

I didn't know CatBoost, which is sometimes used alongside XGBoost and lightGBM, which are GBDT (Gradient Boosting Decision Trees) libraries, until recently, so I tried to move it with a regression task. It was.

CatBoost？

I will paste the introductory text of the official website. (Google translated)

CatBoost is a decision tree gradient boosting algorithm.
Developed by Yandex researchers and engineers
Used for search, recommendation systems, personal assistants, self-driving cars, weather forecasts, and many other third-party tasks such as Yandex, CERN, Cloudflare, and Careem taxis.

Articles that I used as a reference

-Regression: Predict fuel economy

House Prices Regression Using CatBoost

The data set used this time

--Auto MPG dataset -This is the dataset used in TensorFlow Tutorials here. --Predict the fuel efficiency of a car. Explanatory variables include number of cylinders, displacement, horsepower, weight, and so on.

The following code ran on Google Colab. (CPU)

Install CatBoost with pip

!pip install catboost -U

Download dataset

import urllib.request

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
file_path = './auto-mpg.data'
urllib.request.urlretrieve(url, file_path)

Data preprocessing

import pandas as pd

column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                   'Acceleration', 'Model Year', 'Origin'] 
dataset = pd.read_csv(file_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

#This time the purpose is to move, so nan will drop
dataset = dataset.dropna().reset_index(drop=True)

#Category variable:Origin is handled by Cat Boost, so make it a String type
dataset['Origin'] = dataset['Origin'].astype(str)

train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

Prepare a dataset for use by CatBoost

import numpy as np
from catboost import CatBoostRegressor, FeaturesData, Pool

def split_features(df):
    cfc = []
    nfc = []
    for column in df:
        if column == 'Origin':
            cfc.append(column)
        else:
            nfc.append(column)
    return df[cfc], df[nfc]

cat_train, num_train = split_features(train_dataset)
cat_test, num_test = split_features(test_dataset)

train_pool = Pool(
    data = FeaturesData(num_feature_data = np.array(num_train.values, dtype=np.float32), 
                    cat_feature_data = np.array(cat_train.values, dtype=object), 
                    num_feature_names = list(num_train.columns.values), 
                    cat_feature_names = list(cat_train.columns.values)),
    label =  np.array(train_labels, dtype=np.float32)
)

test_pool = Pool(
    data = FeaturesData(num_feature_data = np.array(num_test.values, dtype=np.float32), 
                    cat_feature_data = np.array(cat_test.values, dtype=object), 
                    num_feature_names = list(num_test.columns.values), 
                    cat_feature_names = list(cat_test.columns.values))
)

Learning

model = CatBoostRegressor(iterations=2000, learning_rate=0.05, depth=5)
model.fit(train_pool)

The above parameters are the values as they are in the reference article. By the way, the learning ended with total: 4.3 s.

Inference / result plot

import matplotlib.pyplot as plt

preds = model.predict(test_pool)

xs = list(range(len(test_labels)))
plt.plot(xs, test_labels.values, color = 'r')
plt.plot(xs, preds, color = 'k');
plt.legend(['Target', 'Prediction'], loc = 'upper left');
plt.show()

The result of plotting is as follows.

Impressions etc.

――This time, I just moved the reference article almost as it is, but I'm glad that I understood the rough usage in regression. --Although it is also commented on Kaggle's Kernel that I referred to, it seems that it is better to use BayesSearchCV for hyperparameter tuning, so I will try it next. ([Material](https://colab.research.google.com/github/lmassaron/kaggledays-2019-gbdt/blob/master/Kaggle%20Days%20Paris%20-%20%20GBDT%20workshop.ipynb# scrollTo = WvJdFN3xbyIz) seemed to be helpful)