[PYTHON] LightGBM UserWarning: Using categorical_feature in Dataset

When dealing with categorical variables in the standard library LightGBM of Gradient Boosting Decision Tree, which is a standard in machine learning. The LightGBM version at the time of writing is 2.3.0.

Conclusion

There are at least three ways to specify a categorical variable, but at the time of writing (3) dtype ='category' seems to be good. (1) (2) is also popular, but UserWarning appears; has it been de-encouragement recently?

3 methods

1. Set the Dataset to categorical_feature

lgb_train = lgb.Dataset(X_train, y_train, categorical_feature=['A'])

X_train is pandas.DataFrame and 'A' is the column name of the categorical variable.

UserWarning appears:

python3.7/site-packages/lightgbm/basic.py:1243: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')

Yes, I specified categorical_feature in the Dataset, what?

2. Set train () to categorical_feature

gbm = lgb.train(params,
                lgb_train,
                categorical_feature=['A'],
                )

UserWarning:

python3.7/site-packages/lightgbm/basic.py:1247: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['A']

Well, is it unencouragement to set categorical_feature here? If you set both Dataset in (1) andtrain ()in (2), UserWarning will not be possible, but I feel that it is uselessly duplicated.

3. Use dtype ='category'

X_train['A'] = X_train['A'].astype('category')

With this, UserWarning does not appear. If you set it to category type first, you do not have to specify categorical_feature twice in train and validation as in the case of (1). The category type uses a reasonably small integer type internally, so it's also RAM friendly. This looks good.

It is unknown when this UserWarning began to appear and whether it will continue. I wrote this article because I couldn't find any information on the net. It seems to be a recent change.

Recommended Posts

LightGBM UserWarning: Using categorical_feature in Dataset
Using verticalenv in shell scripts
Translate using googletrans in Python
Using Python mode in Processing