Introduction

At SIGNATE, one of the domestic platforms for machine learning competitions, I participated in the "[1st _Beginner Limited Competition] Bank Customer Targeting" held in August 2020, so I also used my own memorandum to solve the problem. I will describe it. ** In addition, we do not have a particularly original solution. I hope it will be helpful for beginners in machine learning ** (the text is long).

Beginner limited competition

In SIGNATE, titles are given according to the results of the competition, but the title at the time of registration with SIGNATE will be "Begginer". This competition was a competition that only people in the bottom Beginer class could participate in (it seems that this is the first time for the Beginer limited competition).

Normally, you will participate in the competition for the next title Intermediate from Beginner, and if you enter the top 60% even once, you will be promoted, but in this competition, if you achieve the specified score, you will be automatically promoted to Intermediate at that point. It will be a competition to the effect.

I only registered SIGNATE and was a Beginer, so I participated.

Competition overview

As a result of a campaign conducted by a bank, whether or not a customer has opened an account is predicted based on customer attribute data and contact information from past campaigns. This is a so-called "classification" problem in machine learning.

The data provided is as follows. The train data was 27100 records and the test data was 18050 records.

column	Header name	Data type	Description
0	id	int	Line serial number
1	age	int	age
2	job	varchar	Occupation
3	marital	varchar	Unmarried/married
4	education	varchar	Education level
5	default	varchar	Is there a default (yes), no）
6	balance	int	Average annual balance (€)
7	housing	varchar	Mortgage (yes), no）
8	loan	varchar	Personal loan (yes), no）
9	contact	varchar	Contact method
10	day	int	Last contact date
11	month	char	Last contact month
12	duration	int	Last contact time (seconds)
13	compaign	int	Number of contacts in the current campaign
14	pdays	int	Elapsed days: Days after contact with the previous campaign
15	previous	int	Contact record: Number of contacts with customers before the current campaign
16	poutcome	varchar	Results of the previous campaign
17	y	boolean	Whether or not to apply for a fixed deposit (1):Yes, 0:None)

Execution environment

OS: Windows10 Processor: core i7 5500U Memory: 16GB Anaconda3 environment (Python 3.7.6)

Directory structure

Bank_Prediction 　├ notebook/ ●●●.ipynb 　├ input/ train.csv、test.csv └ output / Output the prediction result here

Flow of predictive model creation

Create a prediction model in the following order.

The converter that outputs a certain predicted value to the input value is defined here as a predicted model.

1. EDA (Exploratory Data Analysis)
Data Preprocessing
1. Learning and prediction Four. result

1. 1. EDA (Exploratory Data Analysis)

First of all, we will perform an analysis to confirm the structure and characteristics of the given data. For the sake of simplicity in the article, I will omit the EDA result of the test data.

Data reading

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns

#Set the maximum number of display columns to 50
pd.set_option('display.max_columns', 50)

#Reading various data
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

Confirmation of number of records, number of features, data type, presence of missing values

train.info()

The number of records is 27100, the number of features is 18, and it is clear which features are numerical variables and categorical variables. Also, it seems that there are no missing values in the data given this time. Since the data this time is like the data created for the competition, it is a beautiful data with no missing values, but in the case of reality-based data, it is common to perform complementary processing with a lot of missing values.

Confirmation of basic statistics

train.describe()

Histogram confirmation of each feature

train.hist(figsize=(20,20), color='r')

Although y indicates whether or not an account has been opened, it can be seen that the number of opened (1) is very small compared to the number of unopened (0), resulting in imbalanced data.

Correlation coefficient

colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.select_dtypes(exclude='object').astype(int).corr(),linewidths=0.1,vmax=1.0, vmin=-1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

Among the features, previous (the number of contacts with customers so far) seems to have the highest correlation with whether or not an account has been opened.

Confirmation of distribution between features


g = sns.pairplot(train, hue='y', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])

Blue is the distribution of customers who have not opened an account, and red is the distribution of customers who have opened an account. Looking at the age of the diagonal histogram, it seems that younger people are more likely not to open an account. There is also a difference in distribution on day (last contact day).

Check the number of elements in each categorical variable


for _ in range(len(train.select_dtypes(include='object').columns)):
    print(train.select_dtypes(include='object').columns.values[_])
    print(len(train.select_dtypes(include='object').iloc[:,_].value_counts().to_dict()))
    print(train.select_dtypes(include='object').iloc[:,_].value_counts().to_dict())

I was able to confirm the number of elements included in each categorical variable.

2. Data Preprocessing

From here, we will perform data preprocessing for creating a prediction model.

Addition of features

First, for categorical variables, we added one feature that combines three features related to loans.

We also added features for numerical variables. Here, we added a feature that is the difference between the average of each existing feature and the feature of each record. It seems that generalization performance may be improved by adding a new feature quantity squared or cubed as a feature quantity, but this time I did not try it.

#Merge train and test data
train2 =  pd.concat([train,test],axis=0)

#Feature addition
train2['default_housing_loan'] = train2['default'].str.cat([train2['housing'],train2['loan']], sep='_')
train2['age_median'] = train2['age'] - train2['age'].median()
train2['day_median'] = train2['day'] - train2['day'].median()
train2['duration_median'] = train2['duration'] - train2['duration'].median()
train2['campaign_median'] = train2['campaign'] - train2['campaign'].median()
train2['previous_median'] = train2['previous'] - train2['previous'].median()

Label Encoding Categorical variables cannot be entered into the prediction model as training data as they are, so they must be encoded. There are several encoding methods, but since the algorithm used for learning this time is a gradient boosting tree, we use Label Encoding (One-Hot-Encoding is better when solving problems such as "regression").

The following is an example of Label Encoding the feature amount marital.

married → 0 single → 1 divorced → 2

#Label Encoding
from sklearn.preprocessing import LabelEncoder

category = train2.select_dtypes(include='object')

for col in list(category):
  le = LabelEncoder()
  le.fit(train2[col])
  le.transform(train2[col])
  train2[col] = le.transform(train2[col])

3. 3. Learning and prediction

Now that the data preprocessing for the given data is complete, we will continue to train and predict. The algorithm used for learning is LightGBM. This time, 20 models were created by changing the random numbers when dividing into training data and verification data, and the average of each predicted value was taken as the final prediction result (Random Seed Average). The hyperparameters are tuned by Oputuna.

~~ Also, due to ** unbalanced data, I specified "'class_weight':'balanced'" in LightGBM params **. ~~ ** (Correction) AUC was not necessary because it is an evaluation index that is not affected by data bias. Also, it was LightGBM Classiefier that could specify class_weight as a parameter. ** **

`train&predict`



#import lightgbm
import optuna.integration.lightgbm as lgb #High para tuning with Optuna
from sklearn.model_selection import  train_test_split
import datetime

#Divide the merged train2 into train and test again
train = train2[:27100]
test = train2[27100:].drop(['y'],axis=1)

#Get the values of the objective and explanatory variables of train
target = train['y'].values
features = train.drop(['id','y'],axis=1).values

#test data
test_X = test.drop(['id'],axis=1).values

lgb_params = {'objective': 'binary',
              'metric': 'auc', #The evaluation index specified by the competition is AUC
              #'class_weight': 'balanced' #I didn't need it here
             }

#Random seed average 20 times
for _ in range(20):

    #Divide train into training data and verification data
    (features , val_X , target , val_y) = train_test_split(features, target , test_size = 0.2)


    #Creating a dataset for LightGBM
    lgb_train = lgb.Dataset(features, target,feature_name = list(train.drop(['id','y'],axis=1))) #For learning
    lgb_eval = lgb.Dataset(val_X, val_y, reference=lgb_train) #For Boosting
    
    #Specifying categorical variables
    categorical_features = ['job', 'marital', 'education', 'default', 'balance','month',
                            'housing', 'loan','poutcome', 'default_housing_loan']

    #Learning
    model = lgb.train(lgb_params, lgb_train, valid_sets=lgb_eval,
                      categorical_feature = categorical_features,
                      num_boost_round=1000,
                      early_stopping_rounds=20,
                      verbose_eval=10)

    pred = model.predict(test_X) #Account application probability value
    
    #Store each prediction result
    if _ == 0:
        output = pd.DataFrame(pred,columns=['pred' + str(_+1)])
        output2 = output
    
    else:
        output2 = pd.concat([output2,output],axis=1)
    
    #End of for

#Average each prediction result
df_mean = output2.mean(axis='columns')
df_result = pd.concat([test['id'],df_mean],axis=1)

#Export with time attached to file name
now = datetime.datetime.now()
df_result.to_csv('../output/submission' + now.strftime('%Y%m%d_%H%M%S') + '.csv',index=None,header=None)

Four. result

The score (AUC) specified in the competition was 0.85, but my ** final score was 0.855 **. I was successfully promoted to Intermediate. ** The final ranking was 62nd out of 787 people **, which was neither bad nor extremely good.

By the way, the transition of the score is as follows.

** 0.8470: No Random Seed Average ** ↓ (+0.0034) ** 0.8504: Random Seed Average 5 times ** ↓ (+0.0035) ~~ ** 0.8539: Specifying "'class_weight':'balanced'" ** ~~ ↓ (+0.0016) ** 0.8555: Random Seed Average 20 times **

~~ In my case, I feel that the specification of "'class_weight':'balanced'" was quite effective. ~~

In addition, although I have corrected it in the code posted on Qiita, there was one fatal mistake, so I feel that I could have gone up to about 0.857 without it (a little disappointing).

By the way, on the forum (competition bulletin board), it was written that if you do Random Seed Average 100 times, the score will increase considerably. I should have increased the average number of times (I wasn't prepared to learn for 10 hours lol).

Handling of imbalanced data

** (Correction) As described above, this evaluation index AUC is an evaluation index that is not affected by data bias, so it was not necessary to consider it this time. Also, it was LightGBM Classiefier that could specify class_weight as a parameter. ** **

I noticed that the training data this time was unbalanced data. When training with imbalanced data, it is easy to predict that the prediction model is a negative example, so the following processing is common.

1. Undersampling the number of negative cases according to the number of positive cases
Weight the number of samples during learning without undersampling

This time, I did not undersample, but did 2. I referred to the following page.

It is better to set class weight when classifying biased data in random forest

If you want to implement undersampling of 1, the following page will be helpful.

Downsampling + bagging with LightGBM --a memorandum of u ++

By the way, in my experience, whether undersampling or weighting is good depends on the problem. Therefore, it is recommended to try both once and adopt the one with the better score.

Other things I tried

I also tried Pseudo Labeling, but I didn't use it because it wasn't very effective in this competition.

According to the stories of other people who participated in the competition, Target Encoding and Stacking are not very effective, so it seems that it was a good competition to attack orthodox with a single model.

Supplement

Since this competition has the same subject in SIGNATE's exercises, you can download the data from the following page and check the operation of the code. If you want to actually move it, please.

[Practice question] Bank customer targeting

However, the data format is the same as this competition, but the contents are different. If you execute this code with the exercise data, AUC: 0.95 will go. Also, since the number of records of train data is different, it is necessary to modify a part of this code when executing it (when dividing train2 into train and test in the code of "3. Learning and prediction"). The number "27100").

Finally

Although it was a Begginer limited competition, it was a very rewarding competition with many things to learn. In the future, I would like to challenge Kaggle's MoA (pharmacokinetic competition) and ProbSpace's Splatoon competition. By the way, I also applied for the AI human resources development program "AI QUEST" sponsored by the Minister of Economy, Trade and Industry, so if I'm lucky enough to pass it, I'm going to be busy every day.

P.S. It took a lot of time to draw the SIGNATE title pyramid ...

[PYTHON] SIGNATE [1st _Beginner Limited Competition] How to Solve Bank Customer Targeting