[PYTHON] I tried "K-Fold Target Encoding"

Introduction

From the end of 2019 to the beginning of 2020? I remember that Target Encoding became a hot topic.

Target Encoding replaces the categorical variable with the average value of the objective variable, but if you simply process it, a leak will occur, so you need to devise it. To prevent leaks, you can take measures such as the Leave One Out method that uses the average value other than the row to be converted, or the K-fold division and replacement with the average value other than the fold that includes the target industry.

There are many explanations of Target Encoding in the world, so I will not explain it in this article (for example, this site out-TS-% E4% BD% BF% E3% 81% A3% E3% 81% A1% E3% 82% 83% E3% 83% 80% E3% 83% A1) is very helpful). In this article, I'm trying to see how it works by actually using the nice Target Encoding code from this site. I will try it.

reference

--Python: How to use Target Encoding (https://blog.amedama.jp/entry/target-mean-encoding-types#Leave-one-out-TS-%E4%BD%BF%E3%81%A3% E3% 81% A1% E3% 82% 83% E3% 83% 80% E3% 83% A1)

procedure

Sample data frame

A function that creates a sample data frame.

import numpy as np
import pandas as pd

def getRandomDataFrame(data, numCol):
    
    if data== 'train':
    
        key = ["A" if x ==0  else 'B' for x in np.random.randint(2, size=(numCol,))]
        value = np.random.randint(2, size=(numCol,))
        df = pd.DataFrame({'Feature':key, 'Target':value})
        return df
    
    elif data=='test':
        
        key = ["A" if x ==0  else 'B' for x in np.random.randint(2, size=(numCol,))]
        df = pd.DataFrame({'Feature':key})
        return df
    else:
        print(';)')

You can generate a data frame with the following code. If test is specified as the first argument, the objective variable string will not be output. Specify the number of lines in the second argument.

train = getRandomDataFrame('train', 10)
test = getRandomDataFrame('test', 10)

The contents are as shown in the figure below. スクリーンショット 2020-01-22 23.37.22.png スクリーンショット 2020-01-22 23.37.32.png

K-fold Target Encoding K-fold Target Encoding class. It has fit and transform, so it can be used in the same way as sklern's preprocessing. The Test encoder takes the train data result as an input and adds the Target Encoding feature. In addition, the part (1) written in the comment is processed to fill the line that becomes nan when K-folded with the average value. We will see this later.

from sklearn import base
from sklearn.model_selection import KFold

class KFoldTargetEncoderTrain(base.BaseEstimator,
                               base.TransformerMixin):
    """How to use.
    targetc = KFoldTargetEncoderTrain('Feature','Target',n_fold=5)
    new_train = targetc.fit_transform(train)
    """
    def __init__(self,colnames,targetName,
                  n_fold=5, verbosity=True,
                  discardOriginal_col=False):
        self.colnames = colnames
        self.targetName = targetName
        self.n_fold = n_fold
        self.verbosity = verbosity
        self.discardOriginal_col = discardOriginal_col
        
    def fit(self, X, y=None):
        return self
    
    def transform(self,X):        
        assert(type(self.targetName) == str)
        assert(type(self.colnames) == str)
        assert(self.colnames in X.columns)
        assert(self.targetName in X.columns)       
        
        mean_of_target = X[self.targetName].mean()
        kf = KFold(n_splits = self.n_fold,
                   shuffle = False, random_state=2019)        
        col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
        X[col_mean_name] = np.nan       
        
        for tr_ind, val_ind in kf.split(X):
            X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]
            X.loc[X.index[val_ind], col_mean_name] = X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())
            X[col_mean_name].fillna(mean_of_target, inplace = True)  #Fill in the place that has become nan with the average value--(1)
            
        if self.verbosity:            
            encoded_feature = X[col_mean_name].values
            print('Correlation between the new feature, {} and, {} is {}.'.format(col_mean_name,self.targetName, 
                                                                                  np.corrcoef(X[self.targetName].values,encoded_feature)[0][1]))
        if self.discardOriginal_col:
            X = X.drop(self.targetName, axis=1)
        return X
    
    
class TargetEncoderTest(base.BaseEstimator, base.TransformerMixin):
    """How to use.
    test_targetc = TargetEncoderTest(new_train,
                                      'Feature',
                                      'Feature_Kfold_Target_Enc')
    new_test = test_targetc.fit_transform(test)
    """
    
    def __init__(self,train,colNames,encodedName):
        
        self.train = train
        self.colNames = colNames
        self.encodedName = encodedName
        
    def fit(self, X, y=None):
        return self
    
    def transform(self,X):       
        mean =  self.train[[self.colNames, self.encodedName]].groupby(self.colNames).mean().reset_index() 
        
        dd = {}
        for index, row in mean.iterrows():
            dd[row[self.colNames]] = row[self.encodedName]
            X[self.encodedName] = X[self.colNames]
        X = X.replace({self.encodedName: dd})
        return X

Use it as follows. In the constructor of KFoldTargetEncoderTrain, specify the category variable column name to encode, the objective variable column name, and the number of folds. In the constructor of TargetEncoderTest, specify the encoded data frame, the encoded categorical variable column name, and the Target Encoded feature amount column name ([encoded categorical variable column name] _Kfold_Target_Enc).

targetc = KFoldTargetEncoderTrain('Feature','Target',n_fold=5)
new_train = targetc.fit_transform(train)

test_targetc = TargetEncoderTest(new_train, 'Feature', 'Feature_Kfold_Target_Enc')
new_test = test_targetc.fit_transform(test)

Each has the following contents. スクリーンショット 2020-01-22 23.54.52.png スクリーンショット 2020-01-22 23.55.05.png

Let's check new_train. Since it is a 5-fold, the data is divided into 5 folds, 2 lines each. The first fold is the first and second lines from the top. To encode the first and second lines, look at the combined data of the other four folds, the records on lines 3-10. The average value of Target in each group of A and B is 3/4 = 0.75 for A and 1/4 = 0.25 for B. Use this value to encode the value of the first fold. The first and second lines of the first fold are both A, so encode with 0.75. Perform the above procedure for all folds.

Let's check new_test. Test data is encoded by using the average value of the Target Encoding features of Train data as the categorical variables to be encoded. A is (0.75 + 0.75 + 0.6 + 0.8 + 0.5 + 0.5) / 6 = 0.65, and B is (0.3333333333333333 + 0.3333333333333333 + 0.0 + 0.0) / 4 = 0.166666666666666666.

Next, consider the case of becoming nan.

train = getRandomDataFrame('train', 10)
train['Feature'].iloc[0] = "C"
スクリーンショット 2020-01-23 0.12.34.png

With this data, it is necessary to calculate the mean value of C group in the remaining folds in order to encode the first line, but there is no C in the remaining folds. Therefore, the Target Encoding feature will be nan. Therefore, fill it with the average value of the objective variables in all lines. Therefore, C is (1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1) / 10 = 0.5.

スクリーンショット 2020-01-23 0.16.57.png

By the way, if you comment out the part (1), you can leave it as np.nan. With LightGBM, you can learn and predict even with nan, so it may be better not to fill in with the average value.

Leave-one-out Target Encoding It is said that this method should not be used because it leaks more than K-fold Target Encoding. However, since it's a big deal, I'll put the code here.

class LOOTargetEncoderTrain(base.BaseEstimator,
                               base.TransformerMixin):
    """How to use.
    targetc = LOOTargetEncoderTrain('Feature','Target')
    new_train = targetc.fit_transform(train)
    """
    def __init__(self,colnames,targetName,
                  verbosity=True, discardOriginal_col=False):
        self.colnames = colnames
        self.targetName = targetName
        self.verbosity = verbosity
        self.discardOriginal_col = discardOriginal_col
        
    def fit(self, X, y=None):
        return self
    
    def transform(self,X):        
        assert(type(self.targetName) == str)
        assert(type(self.colnames) == str)
        assert(self.colnames in X.columns)
        assert(self.targetName in X.columns)
        
        col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
        X[col_mean_name] = np.nan
        self.agg_X = X.groupby(self.colnames).agg({self.targetName: ['sum', 'count']})
        X[col_mean_name] = X.apply(self._loo_ts, axis=1)
        
        return X
        
    def _loo_ts(self, row):
        group_ts = self.agg_X.loc[row[self.colnames]]
        loo_sum = group_ts.loc[(self.targetName, 'sum')] - row[self.targetName]
        loo_count = group_ts.loc[(self.targetName, 'count')] - 1
        return loo_sum / loo_count

in conclusion

This time I tried K-Fold Target Encoding.

If the objective variable is binary, there seems to be a way to prevent overfitting, such as Smoothing.

Recommended Posts

I tried "K-Fold Target Encoding"
I tried scraping
I tried PyQ
I tried AutoKeras
I tried papermill
I tried django-slack
I tried Django
I tried spleeter
I tried cgo
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried running pymc
I tried ARP spoofing
I tried using aiomysql
I tried using Summpy
I tried Python> autopep8
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried PyCaret2.0 (pycaret-nightly)
I tried using openpyxl
I tried deep learning
I tried AWS CDK!
I tried using Ipython
I tried to debug.
I tried using PyCaret
I tried using cron
I tried Kivy's mapview
I tried using ngrok
I tried using face_recognition
I tried to paste
I tried using Jupyter
I tried using PyCaret
I tried moving EfficientDet
I tried shell programming
I tried using Heapq
I tried using doctest
I tried Python> decorator
I tried running TensorFlow
I tried Auto Gluon
I tried using folium
I tried using jinja2
I tried AWS Iot
I tried Bayesian optimization!
I tried using folium
I tried using time-window
I tried Value Iteration Networks
I tried fp-growth with python
I tried scraping with Python
I tried to learn PredNet
I tried Learning-to-Rank with Elasticsearch!
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried to organize SVM.
I tried face recognition using Face ++
I tried using Random Forest