I didn't know CatBoost, which is sometimes used alongside XGBoost and lightGBM, which are GBDT (Gradient Boosting Decision Trees) libraries, until recently, so I tried to move it with a regression task. It was.
CatBoost?
I will paste the introductory text of the official website. (Google translated)
CatBoost is a decision tree gradient boosting algorithm.
Developed by Yandex researchers and engineers
Used for search, recommendation systems, personal assistants, self-driving cars, weather forecasts, and many other third-party tasks such as Yandex, CERN, Cloudflare, and Careem taxis.
-Regression: Predict fuel economy
--Auto MPG dataset -This is the dataset used in TensorFlow Tutorials here. --Predict the fuel efficiency of a car. Explanatory variables include number of cylinders, displacement, horsepower, weight, and so on.
The following code ran on Google Colab. (CPU)
!pip install catboost -U
import urllib.request
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
file_path = './auto-mpg.data'
urllib.request.urlretrieve(url, file_path)
import pandas as pd
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(file_path, names=column_names,
na_values = "?", comment='\t',
sep=" ", skipinitialspace=True)
#This time the purpose is to move, so nan will drop
dataset = dataset.dropna().reset_index(drop=True)
#Category variable:Origin is handled by Cat Boost, so make it a String type
dataset['Origin'] = dataset['Origin'].astype(str)
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')
import numpy as np
from catboost import CatBoostRegressor, FeaturesData, Pool
def split_features(df):
cfc = []
nfc = []
for column in df:
if column == 'Origin':
cfc.append(column)
else:
nfc.append(column)
return df[cfc], df[nfc]
cat_train, num_train = split_features(train_dataset)
cat_test, num_test = split_features(test_dataset)
train_pool = Pool(
data = FeaturesData(num_feature_data = np.array(num_train.values, dtype=np.float32),
cat_feature_data = np.array(cat_train.values, dtype=object),
num_feature_names = list(num_train.columns.values),
cat_feature_names = list(cat_train.columns.values)),
label = np.array(train_labels, dtype=np.float32)
)
test_pool = Pool(
data = FeaturesData(num_feature_data = np.array(num_test.values, dtype=np.float32),
cat_feature_data = np.array(cat_test.values, dtype=object),
num_feature_names = list(num_test.columns.values),
cat_feature_names = list(cat_test.columns.values))
)
model = CatBoostRegressor(iterations=2000, learning_rate=0.05, depth=5)
model.fit(train_pool)
The above parameters are the values as they are in the reference article.
By the way, the learning ended with total: 4.3 s
.
import matplotlib.pyplot as plt
preds = model.predict(test_pool)
xs = list(range(len(test_labels)))
plt.plot(xs, test_labels.values, color = 'r')
plt.plot(xs, preds, color = 'k');
plt.legend(['Target', 'Prediction'], loc = 'upper left');
plt.show()
The result of plotting is as follows.
――This time, I just moved the reference article almost as it is, but I'm glad that I understood the rough usage in regression. --Although it is also commented on Kaggle's Kernel that I referred to, it seems that it is better to use BayesSearchCV for hyperparameter tuning, so I will try it next. ([Material](https://colab.research.google.com/github/lmassaron/kaggledays-2019-gbdt/blob/master/Kaggle%20Days%20Paris%20-%20%20GBDT%20workshop.ipynb# scrollTo = WvJdFN3xbyIz) seemed to be helpful)
Recommended Posts