[PYTHON] [Kaggle for super beginners] Titanic (Logistic regression)

■ Introduction

I worked on a competition for Kaggle beginners I tried to summarize it briefly.

【Overview】 ・ Titanic: Machine Learning from Disaster ・ Based on the passenger information of the sinking ship "Titanic", distinguish between those who are saved and those who are not.

This time, we will create a model using logistic regression.

1. Preparation of module

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_selection import RFE

%matplotlib inline

## 2. Data preparation

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


# (891, 12)
# (418, 11)


image.png 【data item】 ・ PassengerId: Passenger ID ・ Survived: Whether or not you survived (0: not saved, 1: saved) ・ Pclass – Ticket class (1: Upper class, 2: Intermediate class, 3: Lower class) ・ Name: Passenger's name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings / spouses on board the ship ・ Parch: Number of parents / children on board the ship ・ Ticket: Ticket number ・ Fare: Fee ・ Cabin: Room number ・ Embarked: Port on board (C: Cherbourg, Q: Queenstown, S: Southampton)


image.png Save the passenger number (PassengerId) of the test data.

PassengerId = test['PassengerId']

Actually, the model is created only with train data, The same features are required when inputting test data to the model.

When using One-Hot-Encoding etc. for preprocessing, Because the number of features of train and test data is different Both data are combined and preprocessed together.

First, the train data has one more item (objective variable: Survived), so it is separated.

y = train['Survived']
train = train[[col for col in train.columns if col != 'Survived']]


# (891, 11)
# (418, 11)

Now that the number of items (features) in the train data and test data is the same, combine them.

X = pd.concat([train, test], axis=0)


# (1309, 11)


3-1. Pretreatment (whole)

First, check how many missing values there are.


PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

Because it is not possible to create a model with the character string data as it is We will convert it to a numerical value at any time.

For gender, convert to "male: 0, female: 1".

def code_transform(x):
    if x == 'male':
        y = 0
        y = 1
    return y

X['Sex'] = X['Sex'].apply(lambda x: code_transform(x))

image.png Converts the port on which the ship is boarded to "0: C, 1: Q, 2: S".

def code_transform(x):
    if x == 'C':
        y = 0
    elif x == 'Q':
        y = 1
        y = 2
    return y

X['Embarked'] = X['Embarked'].apply(lambda x: code_transform(x))

image.png Now let's look at columns that contain only numbers and columns that contain only letters.

numerical_col = [col for col in X.columns if X[col].dtype != 'object']
categorical_col = [col for col in X.columns if X[col].dtype == 'object']


# ['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
# ['Name', 'Ticket', 'Cabin']

Separate the numeric column and the string column because we want to perform separate preprocessing.

X_num = X[numerical_col]
X_cat = X[categorical_col]


# (1309, 8)
# (1309, 3)

3-2. Preprocessing (numerical column)

Check the contents of the data.


image.png Fill in the missing values with the median of each column.

X_num.fillna(X_num.median(), inplace=True)

Check the status of missing values.


PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

3-3. Preprocessing (character string column)

Check the contents of the data.


image.png Missing values consistently include the word'missing'.

X_cat.fillna(value='missing', inplace=True)

Check the status of missing values.


Name      0
Ticket    0
Cabin     0
dtype: int64

Do One-Hot-Encoding to convert all strings to numbers.

X_cat = pd.get_dummies(X_cat)


# (1309, 2422)


3-4. Pretreatment (whole)

Since both X_num and X_cat have no missing values and are only numerical data. Combine and return to the entire data.

X_total = pd.concat([X_num, X_cat], axis=1)


# (1309, 2431)

image.png I want to create a model using only train data Since X_total also contains test data, only the necessary part is extracted.

train_rows = train.shape[0]
X = X_total[:train_rows]

std = StandardScaler()
X = std.fit_transform(X)


# (891, 2431)
# (891,)

# 4. Creating a model Since the features and objective variables corresponding to the train data are available Furthermore, we will create a model by dividing it into training data and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)


# (623, 2431)
# (623,)
# (268, 2431)
# (268,)

logreg = LogisticRegression(class_weight='balanced')
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,

Next, find the predicted value.

By specifying [:, 1] for y_proba, the probability of becoming Class1 (Survived = 1) is predicted. y_pred assigns 1 if it is greater than 0.5 and 0 if it is less than 0.5.

y_proba = logreg.predict_proba(X_test)[: , 1]

y_pred = logreg.predict(X_test)

# [0.90784721 0.09948558 0.36329043 0.18493678 0.43881127]
# [1 0 0 0 0]

## 5. Performance evaluation Evaluate using ROC curve and AUC.

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label='AUC = %.3f' % (auc_score))
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

print('accuracy:',accuracy_score(y_test, y_pred))
print('f1_score:',f1_score(y_test, y_pred))

# accuracy: 0.7723880597014925
# f1_score: 0.6013071895424837

image.png We will also evaluate using the confusion matrix.

classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)

cmdf = pd.DataFrame(cm, index=classes, columns=classes)

sns.heatmap(cmdf, annot=True)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.76      0.95      0.84       170
           1       0.84      0.47      0.60        98

    accuracy                           0.77       268
   macro avg       0.80      0.71      0.72       268
weighted avg       0.79      0.77      0.75       268


6. Submit Since I was able to create and evaluate a model using train data Give the information of the test data and give the predicted value.

First, extract the part corresponding to the test data from the total data (X_total).

X_submit = X_total[train_rows:]
X_submit = std.fit_transform(X_submit)


# (623, 2431)
# (418, 2431)

Compared to the X_train that created the model, it has the same number of features (2431). Submit X_submit into the model to get the predicted value.

y_proba_submit = logreg.predict_proba(X_submit)[: , 1]

y_pred_submit = logreg.predict(X_submit)

# [0.02342065 0.18232356 0.06760457 0.06219097 0.76277487]
# [0 0 0 0 1]

Prepare the CSV data to submit to Kaggle.

First, create a data frame with the necessary information.

df_submit = pd.DataFrame(y_pred_submit, index=PassengerId, columns=['Survived'])

image.png Then convert it to CSV data.


This is the end of submitting. image.png

■ Finally

This time, we have compiled an article for Kaggle beginners. I hope it helped you even a little.

Thank you for reading.

Recommended Posts

[Kaggle for super beginners] Titanic (Logistic regression)
[For Kaggle beginners] Titanic (LightGBM)
Logistic Regression (for beginners) -Code Edition-
Challenges for the Titanic Competition for Kaggle Beginners
[For beginners] kaggle exercise (merucari)
It's okay to stumble on Titanic! Introducing the Kaggle strategy for super beginners
Python #function 2 for super beginners
Python for super beginners Python #functions 1
Python #list for super beginners
Python for super beginners Python # dictionary type 1 for super beginners
Python #index for super beginners, slices
Python #len function for super beginners
Python #Hello World for super beginners
Linear regression (for beginners) -Code edition-
Python for super beginners Python # dictionary type 2 for super beginners
Ridge Regression (for beginners) -Code Edition-
Logistic regression
Logistic regression
Let's put together Python for super beginners
I tried logistic regression analysis for the first time using Titanic data
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_1
Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_2
Python for super beginners Python for super beginners # Easy to get angry
Roadmap for beginners
Challenge Kaggle Titanic
Easy understanding of Python for & arrays (for super beginners)
About Python external module import <For super beginners>
How to convert Python # type for Python super beginners: str
Python # How to check type and type for super beginners
Spacemacs settings (for beginners)
Machine learning logistic regression
python textbook for beginners
Dijkstra algorithm for beginners
OpenCV for Python beginners
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
How to convert Python # type for Python super beginners: int, float