Introduction

I participated in the 1st Beginner Limited Competition (https://signate.jp/competitions/292) held at SIGNATE in August. This was the first time I had a solid competition, but the final score was AUC = 0.8588949, which was 13th place (although it was a very half-finished result ...). In this competition, if the score was higher than a certain value, I was able to be promoted from Beginner to Intermediate, and I was promoted successfully.

I would like to summarize what I did and what I should have looked back on for myself in the future.

The model and analysis results of this competition are disclosed in accordance with the information disclosure policy.

Overview of the competition

The data is campaign data for time deposits at financial institutions. The source of the data is here, but I think it has been slightly processed. The evaluation index is AUC. See the link above for details.

environment


$sw_vers 
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G14019


$python --version
Python 3.7.3

What i did

0. Determine random_seed

It's like washing your hands before cooking, but it's important because it may not be reproduced later. Be sure to assign when using a function that has an argument of random_seed or random_state``` to ensure that the result is reproduced.

1. Let's see what it looks like (H2O)

I put it in H2O and checked the data information and what kind of algorithm comes to the top when it is turned by AutoML. Please see Past Articles for H2O. As a result of running with AutoML while looking at the data at this point, the decision tree algorithm came to the top, so in the future LightGBM I decided to go with.

2. Create a flow from data acquisition to machine learning model construction to prediction (JupyterNotebook)

Notebook files are prepared separately for data processing and model construction (because if one file is used, the visibility may be poor or unnecessary processing may be performed each time).

2-1. Data processing part

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce


%matplotlib inline
pd.set_option('display.max_columns', None)
random_state = 1234

df = pd.read_csv('./0_rawdata/train.csv')

I will write some code for checking the data. Check the data type and the presence or absence of null ↓

df.info()
df.describe()

Visualization of numerical data ↓

df.hist( figsize=(14, 10), bins=20)

Visualization of character string data ↓

plt.figure( figsize = (20, 15))

cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
for i, col in enumerate(cols):
    plt.subplot(3,3,i+1)
    df[col].value_counts().plot.bar()
    plt.title(col)

In the above visualization, of course, `id```, but balance``` and `` pdaysseemed to have a uniform distribution, so we will use it for later learning. Delete from the data.default```Since most of the data was no, delete it. In addition, we created data for learning by adding processing to digitize character strings and category data.

df2 = df.copy()
df2 = df2.drop( columns=['id', 'balance', 'pdays', 'default'])

# month
month_map={
    'jan':1,
    'feb':2,
    'mar':3,
    'apr':4,
    'may':5,
    'jun':6,
    'jul':7,
    'aug':8,
    'sep':9,
    'oct':10,
    'nov':11}
df2['month'] = df2['month'].fillna(0)
df2['month'] = df2['month'].map(month_map)

# job, marital, education, housing, loan, contact, poutcome
cols = ['job', 'marital', 'education', 'housing', 'loan', 'contact', 'poutcome']
ce_onehot = ce.OneHotEncoder(cols=cols,handle_unknown='impute')
ce_onehot.fit( df2 )
df2 = ce_onehot.transform( df2 )

df2['duration'] = df2['duration'] / 3600

df2.to_csv('mytrain.csv', index=False)

2-2. Model construction / prediction part


import pandas as pd
import numpy as np
import category_encoders as ce
import lightgbm as lgb
#import optuna
from optuna.integration import lightgbm as lgb_optuna
from sklearn import preprocessing
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_validate
from sklearn.metrics import roc_auc_score

pd.set_option('display.max_columns', None)

random_state = 1234
version = 'v1'

Divide the data for training and validation (8: 2).


df_train = pd.read_csv('mytrain.csv')

X = df_train.drop( columns=['y'] )
y = df_train['y']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=random_state)

The following methods were used for model construction and accuracy verification.

--Cross Validation by dividing the training data into 5 by stratified sampling --Hyper parameter (hereinafter, high para) tuning is left to optuna --The index used for optimization is logloss. --Retrain the model with the entire training data and calculate the AUC using the verification data


def build():
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
    
    lgb_train = lgb_optuna.Dataset(X_train, y_train)
    
    lgbm_params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'random_state':random_state,
        'verbosity': 0
    }
    
    tunecv = lgb_optuna.LightGBMTunerCV(
        lgbm_params,
        lgb_train,
        num_boost_round=100,
        early_stopping_rounds=20,
        seed = random_state,
        verbose_eval=20,
        folds=kf
    )
    
    tunecv.run()
    
    print( 'Best score = ',tunecv.best_score)
    print( 'Best params= ',tunecv.best_params)
    
    return tunecv

tunecv = build()

Retrain the model with the entire training data and calculate the AUC using the verification data ↓

train_data = lgb.Dataset( X_train, y_train )
eval_data = lgb.Dataset(X_holdout, label=y_holdout, reference= train_data)
clf = lgb.train( tunecv.best_params, 
                train_data,
                valid_sets=eval_data,
                num_boost_round=50,
                verbose_eval=0
               )
y_pred = clf.predict( X_holdout )
print('AUC: ', roc_auc_score(y_holdout, y_pred))
# AUC:  0.8486429810797091

3. Trial and error while looking at data and accuracy

#	What i did	AUC	submit score	Impressions
00	Make the above process the default	0.8486	---	---
01	`job`, `marital`, `education`, `poutcome`Change the encoding of to target encoding	0.8458	---	It went down slightly, but once it goes
02	num_boost_round=200 (Because the score seemed to improve a little more if I put out the learning curve)	0.8536	---	It went up. Go with this
03	Notice that the learning parameters of the part that retrains the model in the entire training data are different from the parameters for high para tuning. num_boost_round=200、early_stopping_rounds =Unified with 20.	0.8585	---	Go with this
04	Try to set the optimization index to AUC	0.8557	---	lowered. Leave logloss
05	loan, housing,Change contact to ordinal encoding	0.8593	0.8556	The AUC is up, so I'll go with this. However, the submit score is a little low.
06	Check the difference between test data and training data. There is no big difference when compared by visualization. I tried to create a model that predicts test data, but AUC=0.Since it is about 5, it is judged that there is no difference between test data and training data	---	---	---
07	Change the encoding of month (combine several months with a small amount of data)	0.8583	0.8585	Almost the same as the AUC of 03. Rejected.
08	Change the encoding of month (combine several months with a small amount of data)	0.8583	0.8585	AUC dropped from 05. Rejected.
09	Add last month's average of y as a column like a time series lag variable	0.8629	0.8559	The training data improved the score, but it was rejected because the test score decreased.
10	`age`Categorize (small number of lines)`age`Combined)	0.8599	0.8588	Subtly improved. I will go with this.
11	Try to get into PCA	0.8574	---	lowered
12	Try other algorithms (SVM), RandomForest, LogisticRegression）	---	---	lowered

I tried to change other details besides the above, but the accuracy did not improve. Also, it's annoying to record each time ... It feels like the competition period is over.

What I should have done

--Data processing system ――If you look closely at the data, including a little more cross tabulation, you may have discovered something. --Try merging with the original data (UCI) (probably partly processed, so some ingenuity is required) --Consideration of interaction term --Model system --You could try the ensemble with LightGBM that changed the random_state. --Interpretation system ――I should have dug deeper into the part where the accuracy was poor (there was a part where age could be categorized, but if possible a little more) --Tools and other systems ――Git is fine, but I should have used a code management tool ――Similarly, I should have added an experiment management tool (like MLOps)

at the end

There are many other things you should do. I would appreciate it if you could comment. When I go to the next competition, I would like to incorporate the technique while referring to this reflection and the kaggle kernel.

[PYTHON] Signate_ Review of the 1st Beginner Limited Competition