[PYTHON] [For Kaggle beginners] Titanic (LightGBM)

■ Introduction

This time, I worked on the following competition with LigthGBM. I tried to summarize it briefly.

【Overview】 ・ Titanic: Machine Learning from Disaster ・ Based on the passenger information of the sinking ship "Titanic", distinguish between those who are saved and those who are not.

[Target readers] ・ Kaggle beginners ・ Those who want to learn about the basic code of LightGBM

1. Preparation of module


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

## 2. Data preparation

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print(train.shape)
print(test.shape)

# (891, 12)
# (418, 11)

train.head()

image.png 【data item】 ・ PassengerId: Passenger ID ・ Survived: Whether or not you survived (0: not saved, 1: saved) ・ Pclass – Ticket class (1: Upper class, 2: Intermediate class, 3: Lower class) ・ Name: Passenger's name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings / spouses on board the ship ・ Parch: Number of parents / children on board the ship ・ Ticket: Ticket number ・ Fare: Fee ・ Cabin: Room number ・ Embarked: Port on board (C: Cherbourg, Q: Queenstown, S: Southampton)


test.head()

image.png Save the passenger number (PassengerId) of the test data.


PassengerId = test['PassengerId']

Actually, the model is created only with train data, I want to preprocess the train / test data together, so I will consider combining.

The train data has one more item (objective variable: Survived), so it is separated.


y = train['Survived']
train = train[[col for col in train.columns if col != 'Survived']]

print(train.shape)
print(test.shape)

# (891, 11)
# (418, 11)

Now that the number of items (features) in the train data and test data is the same, combine them.


X_total = pd.concat([train, test], axis=0)

print(X_total.shape)
X_total.head()

# (1309, 11)

image.png

3. Pretreatment

First, check how many missing values there are.


print(X_total.isnull().sum())

'''
PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64
'''

With LightGBM, it is possible to create a model with character string data as it is. Preprocessing is performed without performing numerical conversion.


X_total.fillna(value=-999, inplace=True)

print(X_total.isnull().sum())

'''
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
'''

Now, check the column of character string data (hereinafter referred to as categorical).


categorical_col = [col for col in X_total.columns if X_total[col].dtype == 'object']
print('categorical_col:', categorical_col)

# categorical_col: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

Examine the data type of each categorical.


for i in X_total[categorical_col]:
    print('{}: {}'.format(i, X_total[i].dtype))

'''
Name: object
Sex: object
Ticket: object
Cabin: object
Embarked: object
'''

LightGBM can be modeled as string data, Since we need to make it a category type instead of an object type, we will convert the data type.


for i in categorical_col:
    X_total[i] = X_total[i].astype("category")

Let's look at the data type of the total data (X_total).


for i in X_total.columns:
    print('{}: {}'.format(i, X_total[i].dtype))

'''
PassengerId: int64
Pclass: int64
Name: category
Sex: category
Age: float64
SibSp: int64
Parch: int64
Ticket: category
Fare: float64
Cabin: category
Embarked: category
'''

# 4. Creating a model I want to create a model using only train data Since X_total also contains test data, only the necessary part is extracted.
train_rows = train.shape[0]
X = X_total[:train_rows]

print(X.shape)
print(y.shape)

# (891, 11)
# (891,)

Since the features and objective variables corresponding to the train data are available Furthermore, we will create a model by dividing it into training data and test data.


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# (623, 11)
# (623,)
# (268, 11)
# (268,)

Set the parameters and pass them to LGBMClassifier () as dictionary type arguments.


params = {
"random_state": 42
}

cls = lgb.LGBMClassifier(**params)
cls.fit(X_train, y_train, categorical_feature = categorical_col)

'''
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
'''

Next, find the predicted value.

By specifying [:, 1] for y_proba, the probability of becoming Class1 (Survived = 1) is predicted. y_pred is converted to 1 if it is greater than 0.5 and 0 if it is less than 0.5.


y_proba = cls.predict_proba(X_test)[: , 1]
print(y_proba[:5])

y_pred = cls.predict(X_test)
print(y_pred[:5])

# [0.38007409 0.00666063 0.04531554 0.95244042 0.35233708]
# [0 0 0 1 0]

## 5. Performance evaluation Evaluate using ROC curve and AUC.

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label='AUC = %.3f' % (auc_score))
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)

print('accuracy:',accuracy_score(y_test, y_pred))
print('f1_score:',f1_score(y_test, y_pred))

# accuracy: 0.8208955223880597
# f1_score: 0.7446808510638298

image.png

We will also evaluate using the confusion matrix.


classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)

cmdf = pd.DataFrame(cm, index=classes, columns=classes)

sns.heatmap(cmdf, annot=True)
print(classification_report(y_test, y_pred))

'''
              precision    recall  f1-score   support

           0       0.83      0.89      0.86       168
           1       0.80      0.70      0.74       100

    accuracy                           0.82       268
   macro avg       0.81      0.80      0.80       268
weighted avg       0.82      0.82      0.82       268

'''

image.png
6. Submit Since I was able to create and evaluate a model using train data Give the information of the test data and give the predicted value.

First, extract the part corresponding to the test data from the total data (X_total).

X_submit = X_total[train_rows:]

print(X_train.shape)
print(X_submit.shape)

# (623, 11)
# (418, 11)

Compared to the X_train that created the model, it has the same number of features (2431). Submit X_submit into the model to get the predicted value.


y_proba_submit = cls.predict_proba(X_submit)[: , 1]
print(y_proba_submit[:5])

y_pred_submit = cls.predict(X_submit)
print(y_pred_submit[:5])

# [0.00948223 0.02473048 0.01005387 0.50935871 0.45433965]
# [0 0 0 1 0]

Prepare the CSV data to submit to Kaggle.

First, create a data frame with the necessary information.


df_submit = pd.DataFrame(y_pred_submit, index=PassengerId, columns=['Survived'])
df_submit.head()

image.png Then convert it to CSV data.


df_submit.to_csv('titanic_lgb_submit.csv')

This is the end of submitting. image.png

■ Finally

This time, we have compiled an article for Kaggle beginners. I hope it helped you even a little.

Thank you for reading.

Recommended Posts

[For Kaggle beginners] Titanic (LightGBM)
[Kaggle for super beginners] Titanic (Logistic regression)
Challenges for the Titanic Competition for Kaggle Beginners
[For beginners] kaggle exercise (merucari)
Roadmap for beginners
Challenge Kaggle Titanic
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Spacemacs settings (for beginners)
It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners
python textbook for beginners
Dijkstra algorithm for beginners
OpenCV for Python beginners
Kaggle for the first time (kaggle ①)
Learning flow for Python beginners
Linux distribution recommended for beginners
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Python3 environment construction (for beginners)
Overview of Docker (for beginners)
Python #function 2 for super beginners
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
Seaborn basics for beginners ④ pairplot
Basic Python grammar for beginners
100 Pandas knocks for Python beginners
Python #list for super beginners
~ Tips for beginners to Python ③ ~
Reference resource summary (for beginners)
Linux command memorandum [for beginners]
Convenient Linux shortcuts (for beginners)
[Explanation for beginners] TensorFlow tutorial MNIST (for beginners)
Pandas basics for beginners ① Reading & processing
TensorFlow MNIST For ML Beginners Translation
Decision tree (for beginners) -Code edition-
Select models with Kaggle's Titanic (kaggle ④)
Pandas basics for beginners ⑧ Digit processing
Python Exercise for Beginners # 2 [for Statement / While Statement]
[For non-programmers] How to walk Kaggle
Predict Kaggle's Titanic with keras (kaggle ⑦)
Python for super beginners Python # dictionary type 1 for super beginners
Seaborn basics for beginners ② Histogram (distplot)
[For beginners] Django -Development environment construction-
[For beginners] Script within 10 lines (1.folium)
Logistic Regression (for beginners) -Code Edition-
What is scraping? [Summary for beginners]
Python #index for super beginners, slices
<For beginners> python library <For machine learning>
TensorFlow Tutorial MNIST For ML Beginners
Frequently used Linux commands (for beginners)
[Must-see for beginners] Basics of Linux
Python #len function for super beginners
Beginners use Python for web scraping (1)
Run unittests in Python (for beginners)
What is xg boost (1) (for beginners)
Beginners use Python for web scraping (4) ―― 1
Python #Hello World for super beginners
Linear regression (for beginners) -Code edition-
Python for super beginners Python # dictionary type 2 for super beginners
Pandas basics summary link for beginners
[For beginners] Process monitoring using cron
LSTM (1) for time series forecasting (for beginners)