[PYTHON] [Kaggle for super beginners] Titanic (Logistic regression)

■ Introduction

I worked on a competition for Kaggle beginners I tried to summarize it briefly.

【Overview】 ・ Titanic: Machine Learning from Disaster ・ Based on the passenger information of the sinking ship "Titanic", distinguish between those who are saved and those who are not.

This time, we will create a model using logistic regression.

1. Preparation of module

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_selection import RFE

%matplotlib inline

## 2. Data preparation

print(train.shape)
print(test.shape)

# (891, 12)
# (418, 11)

【data item】 ・ PassengerId: Passenger ID ・ Survived: Whether or not you survived (0: not saved, 1: saved) ・ Pclass – Ticket class (1: Upper class, 2: Intermediate class, 3: Lower class) ・ Name: Passenger's name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings / spouses on board the ship ・ Parch: Number of parents / children on board the ship ・ Ticket: Ticket number ・ Fare: Fee ・ Cabin: Room number ・ Embarked: Port on board (C: Cherbourg, Q: Queenstown, S: Southampton)

Save the passenger number (PassengerId) of the test data.

PassengerId = test['PassengerId']

Actually, the model is created only with train data, The same features are required when inputting test data to the model.

When using One-Hot-Encoding etc. for preprocessing, Because the number of features of train and test data is different Both data are combined and preprocessed together.

First, the train data has one more item (objective variable: Survived), so it is separated.

y = train['Survived']
train = train[[col for col in train.columns if col != 'Survived']]

print(train.shape)
print(test.shape)

# (891, 11)
# (418, 11)

Now that the number of items (features) in the train data and test data is the same, combine them.

X = pd.concat([train, test], axis=0)

print(X.shape)

# (1309, 11)

3-1. Pretreatment (whole)

First, check how many missing values there are.

X.isnull().sum()

'''
PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64
'''

Because it is not possible to create a model with the character string data as it is We will convert it to a numerical value at any time.

For gender, convert to "male: 0, female: 1".

def code_transform(x):

if x == 'male':
y = 0
else:
y = 1

return y

X['Sex'] = X['Sex'].apply(lambda x: code_transform(x))

Converts the port on which the ship is boarded to "0: C, 1: Q, 2: S".

def code_transform(x):

if x == 'C':
y = 0
elif x == 'Q':
y = 1
else:
y = 2

return y

X['Embarked'] = X['Embarked'].apply(lambda x: code_transform(x))

Now let's look at columns that contain only numbers and columns that contain only letters.

numerical_col = [col for col in X.columns if X[col].dtype != 'object']
categorical_col = [col for col in X.columns if X[col].dtype == 'object']

print(numerical_col)
print(categorical_col)

# ['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
# ['Name', 'Ticket', 'Cabin']

Separate the numeric column and the string column because we want to perform separate preprocessing.

X_num = X[numerical_col]
X_cat = X[categorical_col]

print(X_num.shape)
print(X_cat.shape)

# (1309, 8)
# (1309, 3)

3-2. Preprocessing (numerical column)

Check the contents of the data.

Fill in the missing values with the median of each column.

X_num.fillna(X_num.median(), inplace=True)

Check the status of missing values.

X_num.isnull().sum()

'''
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64
'''

3-3. Preprocessing (character string column)

Check the contents of the data.

Missing values consistently include the word'missing'.

X_cat.fillna(value='missing', inplace=True)

Check the status of missing values.

X_cat.isnull().sum()

'''
Name      0
Ticket    0
Cabin     0
dtype: int64
'''

Do One-Hot-Encoding to convert all strings to numbers.

X_cat = pd.get_dummies(X_cat)

print(X_cat.shape)

# (1309, 2422)

3-4. Pretreatment (whole)

Since both X_num and X_cat have no missing values and are only numerical data. Combine and return to the entire data.

X_total = pd.concat([X_num, X_cat], axis=1)

print(X_total.shape)

# (1309, 2431)

I want to create a model using only train data Since X_total also contains test data, only the necessary part is extracted.

train_rows = train.shape[0]
X = X_total[:train_rows]

std = StandardScaler()
X = std.fit_transform(X)

print(X.shape)
print(y.shape)

# (891, 2431)
# (891,)

# 4. Creating a model Since the features and objective variables corresponding to the train data are available Furthermore, we will create a model by dividing it into training data and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# (623, 2431)
# (623,)
# (268, 2431)
# (268,)

logreg = LogisticRegression(class_weight='balanced')
logreg.fit(X_train, y_train)

'''
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=None,
max_iter=100, multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
'''

Next, find the predicted value.

By specifying [:, 1] for y_proba, the probability of becoming Class1 (Survived = 1) is predicted. y_pred assigns 1 if it is greater than 0.5 and 0 if it is less than 0.5.

y_proba = logreg.predict_proba(X_test)[: , 1]
print(y_proba[:5])

y_pred = logreg.predict(X_test)
print(y_pred[:5])

# [0.90784721 0.09948558 0.36329043 0.18493678 0.43881127]
# [1 0 0 0 0]

## 5. Performance evaluation Evaluate using ROC curve and AUC.

fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label='AUC = %.3f' % (auc_score))
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)

print('accuracy:',accuracy_score(y_test, y_pred))
print('f1_score:',f1_score(y_test, y_pred))

# accuracy: 0.7723880597014925
# f1_score: 0.6013071895424837

We will also evaluate using the confusion matrix.

classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)

cmdf = pd.DataFrame(cm, index=classes, columns=classes)

sns.heatmap(cmdf, annot=True)
print(classification_report(y_test, y_pred))
'''
precision    recall  f1-score   support

0       0.76      0.95      0.84       170
1       0.84      0.47      0.60        98

accuracy                           0.77       268
macro avg       0.80      0.71      0.72       268
weighted avg       0.79      0.77      0.75       268

'''

6. Submit Since I was able to create and evaluate a model using train data Give the information of the test data and give the predicted value.

First, extract the part corresponding to the test data from the total data (X_total).

X_submit = X_total[train_rows:]
X_submit = std.fit_transform(X_submit)

print(X_train.shape)
print(X_submit.shape)

# (623, 2431)
# (418, 2431)

Compared to the X_train that created the model, it has the same number of features (2431). Submit X_submit into the model to get the predicted value.

y_proba_submit = logreg.predict_proba(X_submit)[: , 1]
print(y_proba_submit[:5])

y_pred_submit = logreg.predict(X_submit)
print(y_pred_submit[:5])

# [0.02342065 0.18232356 0.06760457 0.06219097 0.76277487]
# [0 0 0 0 1]

Prepare the CSV data to submit to Kaggle.

First, create a data frame with the necessary information.

df_submit = pd.DataFrame(y_pred_submit, index=PassengerId, columns=['Survived'])