[PYTHON] Stop thinking for use in data analysis competition LightGBM


Recently, I've been addicted to data analysis competitions such as Kaggle and Signate, and I'm studying every day while participating in several competitions little by little. Before each first facing the data, I have a LightGBM template that I am doing to know the difficulty of the competition and the tendency of the data, so I will publish it.

Overall picture

Data reading

Load the data and import the required libraries. If you start without checking the training data carefully, the amount of data may be unexpectedly huge, so check the amount of data.

from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#Data reading
train_df = pd.read_csv("./train.csv")
test_df = pd.read_csv("./test.csv")

print(train_df.shape, test_df.shape)
(891, 12) (418, 11)

Feature processing

First look at the data

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	

It's important to see the data in any competition. Check the minimum data, such as not using the objective variables Suvived, PassengerId and Name because they are unique features.

Divided into explanatory variable and objective variable

Divide into explanatory variables and objective variables.

train_x, train_y = train_df.drop("Survived", axis=1), train_df["Survived"]

Feature processing

Feature processing is also performed, but the minimum. It is done only from the following three viewpoints.

--Null padding --Qualitative variables-> Quantitative variables (label encoding) --Delete unnecessary columns (PassengerId and Name)

def label_encording(data_col):
Label encoding
    data_col     :One column of the data frame of interest
    le = LabelEncoder()
    le = le.fit(data_col)
    #Convert label to integer
    data_col = le.transform(data_col)

    return data_col
def preprocess(df):
Perform pretreatment
    df : padnas.Dataframe
Target data frame
    df = df.drop("PassengerId", axis=1)
    df = df.drop("Name", axis=1)
    #Convert qualitative variables to numbers
    for column_name in df:
        if df[column_name][0].dtypes == object: #Substitute NULL for missing values
            df[column_name] = df[column_name].fillna("NULL")
            df[column_name] = label_encording(df[column_name])   
        elif df[column_name][0].dtypes == ( "int64"  or  "float64") : #Regarding missing values-Substitute 999
            df[column_name] = df[column_name].fillna(-999)   
    return df

When performing label encoding, it is not good if the correspondence between the labels in the training data and the test data is broken, so the training data and the test data are subjected to feature quantity processing at the same time.

all_x = pd.concat([train_x, test_df])
preprocessed_all_x = preprocess(all_x)

#The preprocessed data is subdivided into training data and test data.
preprocessed_train_x, preprocessed_test_x = preprocessed_all_x[:train_x.shape[0]], preprocessed_all_x[train_x.shape[0]:]


Create a class to learn LightGBM. See the official website below for detailed parameter explanations.

ʻObjectiveandmetrics` are changed according to the training data and competition.

# LightGBM
import lightgbm as lgb

class lightGBM:
    def __init__(self, params=None):
        self.model = None
        if params is not None:
            self.params = params
            self.params = {'objective':'binary',
                            'seed': 0,
                            'boosting_type': 'gbdt',
                            'reg_alpha': 0.0,
                            'reg_lambda': 0.0,
        self.num_round = 20000
        self.early_stopping_rounds = self.num_round/100

    def fit(self, tr_x, tr_y, va_x, va_y):
        self.target_columms = tr_x.columns
        #Convert dataset
        lgb_train = lgb.Dataset(tr_x, tr_y)
        lgb_eval = lgb.Dataset(va_x, va_y)
        self.model = lgb.train(self.params, 
                            valid_names=['train', 'valid'],
                            valid_sets=[lgb_train, lgb_eval],
        return self.model
    def predict(self, x):
        data = lgb.Dataset(x)
        pred = self.model.predict(x, num_iteration=self.model.best_iteration)
        return pred
    def get_feature_importance(self, target_columms=None):
Feature output
        if target_columms is not None:
            self.target_columms = target_columms
        feature_imp = pd.DataFrame(sorted(zip(self.model.feature_importance(), self.target_columms)), columns=['Value','Feature'])
        return feature_imp

Definition of learner

def model_learning(model, x, y):
Train the model.
    tr_x, va_x, tr_y, va_y = train_test_split(x, train_y, test_size=0.2, random_state=0)    
    return model.fit(tr_x, tr_y, va_x, va_y)

By defining the model in a class and passing it to the learner, it is possible to minimize changes in the source code when using different models.

For example, when you want to use XGBoost, you can replace the model to be learned immediately by rewriting as follows.

class XGBoost:
    def __init__(self, params=None):
        #Initialization process~~~

    def fit(self, tr_x, tr_y, va_x, va_y):
        #Learning process~~~
    def predict(self, x):
        #Evaluation processing~~~

xgboost_model = XGBoost()
model_learning(xgboost_model, preprocessed_train_x, train_y)


lightgbm_model = lightGBM()
model_learning(lightgbm_model, preprocessed_train_x, train_y)
Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
Training until validation scores don't improve for 200.0 rounds
Early stopping, best iteration is:
[172]	train's auc: 0.945026	valid's auc: 0.915613

Learning is complete! It was over soon.

Evaluation of the importance of features

With LightGBM, you can check which of the learned features you used most often. This will give you hints on EDA for the next step. Somehow, ʻAge, Ticket, and Fare are at the top, so it seems that age and seat position are important, and I can see the correlation between ʻAge and Survived, etc. ..

  Value	Feature
0	32	Parch
1	58	SibSp
2	158	Embarked
3	165	Cabin
4	172	Sex
5	206	Pclass
6	1218	Fare
7	1261	Ticket
8	1398	Age

Evaluation & submission file creation

Evaluation of the model. The output result is a probability, but this time it must be either 0 or 1, so format it accordingly.

#Evaluation of the model for testing
proba_ = lightgbm_model.predict(preprocessed_test_x)
proba = list(map(lambda x: 0 if x < 0.5 else 1, proba_))

Format the predicted value according to the submitted data. This is the easiest place to get stuck ...

#Creating test data
submit_df = pd.DataFrame({"Survived": proba})
submit_df.index.name = "PassengerId"
submit_df.index = submit_df.index + len(train_df) + 1

Save the file name in the submit_ {% Y-% m-% d-% H% M% S} format. By doing so, you can prevent accidental overwriting, and you don't have to think about the file name every time, which is convenient.

save_folder = "results"
if not os.path.exists(save_folder):

submit_df.to_csv("{}/submit_{}.csv".format(save_folder, datetime.now().strftime("%Y-%m-%d-%H%M%S")),index=True)

At the end

When I submitted this result, the Public Score was 0.77033, which was 6610th / 20114 people. (As of 08/25/2020) I think it's not a bad template for the purpose of grasping the difficulty and feeling of the competition by turning it for the time being.

I always think that EDA is sweet, so I've been doing EDA more firmly in the future.

