[PYTHON] I don't want to search for high para because it is IQ1 (how to use lightgbm_tuner)

This article was posted in IQ1's second Advent Calendar 2019.

IQ1 machine learning

Data preprocessing and model high-para search are inevitable for machine learning. However, since IQ is 1, I don't want to do it if I can do data preprocessing and high para search. If data preprocessing is included in this time-graded gradient boosting tree algorithm (lightgbm, etc.), missing values can be handled as they are, and there is no need to preprocess categorical variables, so the world becomes relatively IQ1 friendly. It was. On the other hand, high para search is too difficult for IQ1. It is very difficult because you have to know what each model's high para is and how much range to search.

lightgbm_tuner Recently, optuna has released a module called ʻoptuna.integration.lightgbm_tuner` to automate lightgbm's high para search. This module is also IQ1 friendly for a variety of reasons.

  1. Fully automatic search for high para
  2. Search by step-wise (search for each parameter), so you can search at high speed.
  3. You can use it immediately by rewriting the existing lightgbm code in a few places (described later).

IQ1 automatic high para search

Please install the latest version of optuna. I think it was 0.19.0. Enter with pip install optuna --upgrade.

Data set preparation

The dataset used this time is Kaggle's House Prices. (https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) This dataset is a dataset that predicts the price of the house from land information or building information.

First, download the data using kaggle's API

$ kaggle competitions download -c house-prices-advanced-regression-techniques

If you unzip the zip properly, train.csv and test.csv will appear. This time, submit is annoying, so I will only use train.csv. You can use any jupyter notebook, so open python and read it with pandas.

import pandas as pd

df=pd.read_csv("train.csv")
print(df.shape)

.out


(1460, 81)

We found that there are 81 columns in the 1460 data. Since the SalePrice that predicts the ID is included in this, the feature amount that can be used is 79 dimensions.

Let's drop the columns that we don't need for the time being and set the objective variable separately.

y=df.SalePrice
X=df.drop(["Id","SalePrice"],axis=1)

Next, check the missing values. However, it does not process because it plunges into lightgbm. Just confirmation.

X.loc[:,pd.isnull(X).any(axis=0)].columns

.out


Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Electrical', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
       'MiscFeature'],
      dtype='object')

Next, label encode the column whose element is String as a categorical variable in lightgbm, and then set dtype to category. In a nutshell, labelencoding just replaces the string with an integer value so that the elements don't double.

from sklearn.preprocessing import LabelEncoder

for name in X.columns:
    if X[name].dtype=="object":
        #NaN cannot be input to LabelEncoder"NAN"To
        X[name]=X[name].fillna("NAN")
        le = LabelEncoder()
        le.fit(X[name])
        encoded = le.transform(X[name])
        X[name] = pd.Series(encoded).astype('category')

Learn with ordinary lightgbm

Since the pre-processing is finished, let lightgbm learn. This code is when I dig into a normal lightgbm

import lightgbm as lgb
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

train_dataset=lgb.Dataset(X_train,y_train)
valid_dataset=lgb.Dataset(X_test,y_test,reference=train_dataset)

#For time measurement
#%%time
params={"objective":"regression",
                    "learning_rate":0.05}
model=lgb.train(params,
                train_set=train_dataset,
                valid_sets=[valid_dataset],
                num_boost_round=300,
                early_stopping_rounds=50)

.out


...(abridgement)...
Early stopping, best iteration is:
[113]	valid_0's l2: 6.65263e+08
CPU times: user 3.11 s, sys: 537 ms, total: 3.65 s
Wall time: 4.47 s

I was able to learn. The learning time was 4.47 seconds. By the way, when I plotted the prediction result, it looked like this. The horizontal axis is the predicted value and the vertical axis is the true value. image.png

Learn with lightgbm_tuner

Rewrite the above code to search for IQ1 high para

import lightgbm as lgb
import optuna.integration.lightgbm_tuner as lgb_tuner
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)

train_dataset=lgb.Dataset(X_train,y_train)
valid_dataset=lgb.Dataset(X_test,y_test,reference=train_dataset)

params={"objective":"regression",
                    "learning_rate":0.05,
                    "metric":"l2"}
model=lgb_tuner.train(params,
                      train_set=train_dataset,
                      valid_sets=[valid_dataset],
                      num_boost_round=300,
                      early_stopping_rounds=50)

Do you know where it was rewritten? I'm looking for a mistake in IQ1.

The rewritten places are the following 3 places

  1. Add import statement
  2. Change from lgb.train to lgb_tuner.train
  3. Add the value metric to optimize to params

By the way, the study time is as follows. It was slower than I expected ...

.out


CPU times: user 3min 24s, sys: 33.8 s, total: 3min 58s
Wall time: 3min 48s

The score of validation data is as follows.

model.best_score

.out


defaultdict(dict, {'valid_0': {'l2': 521150494.1730755}})

Comparing the results of this experiment, it looks like this. After tuning, the performance has improved properly.

lightgbm lightgbm_tuner
Learning time 4.47 s 228 s
valid data accuracy(MSE) 6.65263e+08 5.21150e+08

Even if you look at the plot, you can see that the one with the higher price (the one on the right) looks better. image.png

end

If you use this, it seems that you can do machine learning even with IQ1! !! By the way, when I submitted the model made with this, it was about 2000th. (Since the number of participants is sample_submission.csv or more and 4900 people, there were many people with IQ1 or less)

Postscript

When I tried high para search using optuna in my mood, the validation score went up, but the submit score got a little worse. (When submitting, relearn with learning_rate = 0.05, num_boosting_round = 1000, early_stopping_rounds = 50) Did you overfit the validation data? After all it is difficult to search for high para in IQ1.

** If you have any practical high para exploration advice, please comment! !! ** **

The tuning strategy this time is as follows.

――The parameters that can be tuned are lined up in a wide range. --Set learning_rate coarsely to increase the number of trials

(The lower the score, the better)

--Defopara's lightgbm submission score: 0.13852 --Submitted score with lightgbm_tuner: 0.13174 --Submitted score in optuna: 0.13401

import optuna

def objective(trial):
    '''
    trial:set of hyperparameter    
    '''
    # hypyer param
    bagging_fraction = trial.suggest_uniform("bagging_fraction",0,1)
    bagging_freq = trial.suggest_int("bagging_freq",0,10)
    feature_fraction = trial.suggest_uniform("feature_fraction",0,1)
    lambda_l1 = trial.suggest_uniform("lambda_l1",0,50)
    lambda_l2 = trial.suggest_uniform("lambda_l2",0,50)
    min_child_samples = trial.suggest_int("min_child_samples",1,50)
    num_leaves = trial.suggest_int("num_leaves",2,50)
    max_depth = trial.suggest_int("max_depth",0,8)
    params={"learning_rate":0.5,
                    "objective":"regression",
                    "bagging_fraction":bagging_fraction,
                    "bagging_freq":bagging_freq,
                    "feature_fraction":feature_fraction,
                    "lambda_l1":lambda_l1,
                    "lambda_l2":lambda_l2,
                    "min_child_samples":min_child_samples,
                    "num_leaves":num_leaves,
                    "max_depth":max_depth}

    model_opt = lgb.train(params,train_set=train_dataset,valid_sets=[valid_dataset],
                          num_boost_round=70,early_stopping_rounds=10)
 
    return model_opt.best_score["valid_0"]["l2"]

study = optuna.create_study()
study.optimize(objective, n_trials=500)
...(abridgement)...
[I 2019-12-01 15:02:35,075] Finished trial#499 resulted in value: 537618254.528029. Current best value is 461466711.4731979 with parameters: {'bagging_fraction': 0.9973929186258068, 'bagging_freq': 2, 'feature_fraction': 0.9469601028256658, 'lambda_l1': 10.1589501379876, 'lambda_l2': 0.0306013767707684, 'min_child_samples': 2, 'num_leaves': 35, 'max_depth': 2}.

The validation score was 4.61467e + 08

Recommended Posts

I don't want to search for high para because it is IQ1 (how to use lightgbm_tuner)
How to use search sorted
[NetworkX] I want to search for nodes with specific attributes
I don't tweet, but I want to use tweepy: just display the search results on the console
I didn't know how to use the [python] for statement
I want to specify a file that is not a character string for logrotate, but is it impossible?
Comparison of GCP computing services [I want to use it serverless]
How to use bing search api
For the time being using FastAPI, I want to display how to use API like that on swagger
Tips for manipulating numpy.ndarray from c ++ -I want to use an iterator-
When you want to use it as it is when using it with lambda memo
I don't want to admit it ... The dynamical representation of Neural Networks
Is it deprecated to use pip directly?
[Pandas] What is set_option [How to use]
I want to know how LINUX works!
If you don't know how to draw the graph you want with matplotlib, it is convenient to look at the gallery.
I want to use jar from python
I want to use Linux on mac
[Python] Organizing how to use for statements
How to use Pylint for PyQt5 apps
How to use "deque" for Python data
How to use fingerprint authentication for KDE
How to use is and == in Python
python I don't know how to get the printer name that I usually use.
Tips for those who are wondering how to use is and == in Python
Because I don't want to go out with people whose desktops are dirty
Creating a Python document generation tool because it is difficult to use sphinx
I want to use MATLAB feval with python
How to use Template Engine for Network Engineer
I don't want to take a coding test
How to use data analysis tools for beginners
I want to use Temporary Directory with Python2
I want to use ceres solver from python
I want to use ip vrf with SONiC
How to install Cascade detector and how to use it
I want to use the activation function Mish
How to use pip, a package management system that is indispensable for using Python
[Python] I want to use only index when looping a list with a for statement
How to use any or all to check if it is in a dictionary (Hash)
I don't know what HEIC is. But for the time being, let's use PNG!