[PYTHON] I tried logistic regression analysis for the first time using Titanic data

Purpose

Since it is basically an infrastructure spec, I have never touched anything other than server-side programs, but since big data infrastructure etc. are also involved, I thought that I would like to grasp what scientists are doing there, so it is sticky. I tried survival analysis using various titanic data. The purpose is to understand how logistic regression analysis is done in Python.

Environmental condition

--Python operation server - EC2:t2.micro - OS:Red Hat Enterprise Linux 8 (HVM), SSD Volume Type --Disk: General-purpose SSD (GP2) 10GB - Python:3.7

Implementation procedure

(Python operating environment maintenance is omitted)

It was carried out in the following flow.

  1. Data preparation
  2. Confirmation of learning data
  3. Processing of learning data
  4. Model construction
  5. Confirmation of test data
  6. Processing test data
  7. Model prediction

1. Data preparation

The data was downloaded from here and placed on EC2.

2. Confirmation of learning data

variable Definition value Remarks
PassengerId Passenger ID Primary key
Survived Survival result 0=death, 1=Survival Objective variable
Pclass Ticket class 1=1st, 2=2nd, 3=3rd
Name name
Sex sex male=male, female=Female
Age age
SibSp Number of siblings and spouses on board 0,1,2,3,4,5,8
Parch Number of parents and children on board 0,1,2,3,4,5,6,9
Ticket Ticket number
Fare Fee
Cabin Room number
Embarked From which port you boarded C=Cherbourg, Q=Queenstown, S=Southampton

Read data frame

python37
import pandas as pd
df_train = pd.read_csv("train.csv")

Confirmation of missing values

Missing values ​​exist in Age, Cabin, and Embarked.

df_train.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

df_train.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

If you want to use the line containing the missing value without discarding it, you need to do some completion. Since Age is numerical data, it is judged that it is appropriate to supplement with the average value or the median value. Somehow I decided to supplement it with the median this time. (Actually, you should judge while looking at the data properly.) For Embarked and Cabin data, we decided to supplement the missing data with'U'and'Unknown', respectively. However, for Embarked, you can create a dummy variable as it is, but Cabin has many value variations, so It seemed difficult to create a dummy variable as it was, and it seemed that the tendency was different depending on whether it was Unknown or not. I decided to set the value so that it would be 0 for Unknown and 1 for others. The following is the data after supplementing the missing values ​​to confirm the tendency.

pd.crosstab(df_train['Embarked'], df_train['Survived'])

Survived 0 1 Embarked C 75 93 Q 47 30 S 427 217 U 0 2

pd.crosstab(df_train['Cabin'], df_train['Survived'])

Survived 0 1 Cabin A10 1 0 A14 1 0 A16 0 1 A19 1 0 A20 0 1 A23 0 1 A24 1 0 A26 0 1 A31 0 1 A32 1 0 A34 0 1 A36 1 0 A5 1 0 A6 0 1 A7 1 0 B101 0 1 B102 1 0 B18 0 2 B19 1 0 B20 0 2 B22 1 1 B28 0 2 B3 0 1 B30 1 0 B35 0 2 B37 1 0 B38 1 0 B39 0 1 B4 0 1 B41 0 1 ... ... ... E121 0 2 E17 0 1 E24 0 2 E25 0 2 E31 1 0 E33 0 2 E34 0 1 E36 0 1 E38 1 0 E40 0 1 E44 1 1 E46 1 0 E49 0 1 E50 0 1 E58 1 0 E63 1 0 E67 1 1 E68 0 1 E77 1 0 E8 0 2 F E69 0 1 F G63 1 0 F G73 2 0 F2 1 2 F33 0 3 F38 1 0 F4 0 2 G6 2 2 T 1 0 Unknown 481 206

3. Processing of learning data

Data processing will be carried out in accordance with the above policy.

##Complement the missing value of Age with the median
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].median())
##Complement missing values ​​in Cabin with Unknown
df_train['Cabin'] = df_train['Cabin'].fillna('Unknown')
##Complement Embarked missing values ​​with U (Unknown)
df_train['Embarked'] = df_train['Embarked'].fillna('U')

Create a dummy variable using the completed data

df_train_dummies = pd.get_dummies(df_train, columns=['Sex','Pclass','SibSp','Parch','Embarked'])
df_train_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 30 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 dtypes: float64(2), int64(2), object(3), uint8(23) memory usage: 68.8+ KB

Standardized numerical data of Age and Fare

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_train_dummies['Age_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Age']])
df_train_dummies['Fare_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Fare']])
df_train_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 32 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 Age_scale 891 non-null float64 Fare_scale 891 non-null float64 dtypes: float64(4), int64(2), object(3), uint8(23) memory usage: 82.7+ KB

For Cabin, generate variables so that Unknown is 0 and others are 1.

df_train_dummies['Cabin_New'] = df_train_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)
df_train_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 33 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 Age_scale 891 non-null float64 Fare_scale 891 non-null float64 Cabin_New 891 non-null int64 dtypes: float64(4), int64(3), object(3), uint8(23) memory usage: 89.7+ KB

4. Model construction

Build a model using the generated variables

from sklearn.linear_model import LogisticRegression
##Setting explanatory variables
X = df_train_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]
##Setting the objective variable
y = df_train_dummies['Survived']
model = LogisticRegression()
result = model.fit(X, y)

Perform model evaluation

##Confirmation of coefficient
result.coef_

array([[ 1.08240551, -1.49531858, 0.44672894, 0.1123565 , -0.9719985 , 0.76315352, 0.84866325, 0.44745114, -0.8278121 , -0.42535544, -0.51058583, -0.70842761, 0.22631714, 0.45736976, 0.05137953, 0.30703108, -0.71378859, -0.41660858, -0.3246134 , -0.05695347, -0.04812828, -0.47921984, 0.17138853, -0.47504073, 0.08458894, 0.83782699]])

It was found that gender information had a great influence.

model.score(X, y)

0.8181818181818182

5. Confirmation of test data

df_test = pd.read_csv("test.csv")

Confirmation of missing values

There are missing values ​​in Age, Cabin, and Fare.

df_test.isnull().sum()

PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64

df_test.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB

6. Processing test data

Processed in the same way as learning data. Fare is complemented by the median as well as Age.

##Complement the missing value of Age with the median
df_test['Age'] = df_test['Age'].fillna(df_train['Age'].median())
##Complement missing values ​​in Cabin with Unknown
df_test['Cabin'] = df_test['Cabin'].fillna('Unknown')
##Complement missing Fare values ​​with median
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].median())

Create a dummy variable using the completed data

df_test_dummies = pd.get_dummies(df_test, columns=['Sex','Pclass','SibSp','Parch','Embarked'])

Since Embarked has no missing values, Embarked_U inputs all 0 values.

df_test_dummies['Embarked_U'] = 0
df_test_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 29 columns): PassengerId 418 non-null int64 Name 418 non-null object Age 418 non-null float64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 418 non-null object Sex_female 418 non-null uint8 Sex_male 418 non-null uint8 Pclass_1 418 non-null uint8 Pclass_2 418 non-null uint8 Pclass_3 418 non-null uint8 SibSp_0 418 non-null uint8 SibSp_1 418 non-null uint8 SibSp_2 418 non-null uint8 SibSp_3 418 non-null uint8 SibSp_4 418 non-null uint8 SibSp_5 418 non-null uint8 SibSp_8 418 non-null uint8 Parch_0 418 non-null uint8 Parch_1 418 non-null uint8 Parch_2 418 non-null uint8 Parch_3 418 non-null uint8 Parch_4 418 non-null uint8 Parch_5 418 non-null uint8 Parch_6 418 non-null uint8 Parch_9 418 non-null uint8 Embarked_C 418 non-null uint8 Embarked_Q 418 non-null uint8 Embarked_S 418 non-null uint8 dtypes: float64(2), int64(1), object(3), uint8(23) memory usage: 29.1+ KB

Standardized numerical data of Age and Fare

df_test_dummies['Age_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Age']])
df_test_dummies['Fare_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Fare']])
df_test_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 31 columns): PassengerId 418 non-null int64 Name 418 non-null object Age 418 non-null float64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 418 non-null object Sex_female 418 non-null uint8 Sex_male 418 non-null uint8 Pclass_1 418 non-null uint8 Pclass_2 418 non-null uint8 Pclass_3 418 non-null uint8 SibSp_0 418 non-null uint8 SibSp_1 418 non-null uint8 SibSp_2 418 non-null uint8 SibSp_3 418 non-null uint8 SibSp_4 418 non-null uint8 SibSp_5 418 non-null uint8 SibSp_8 418 non-null uint8 Parch_0 418 non-null uint8 Parch_1 418 non-null uint8 Parch_2 418 non-null uint8 Parch_3 418 non-null uint8 Parch_4 418 non-null uint8 Parch_5 418 non-null uint8 Parch_6 418 non-null uint8 Parch_9 418 non-null uint8 Embarked_C 418 non-null uint8 Embarked_Q 418 non-null uint8 Embarked_S 418 non-null uint8 Age_scale 418 non-null float64 Fare_scale 418 non-null float64 dtypes: float64(4), int64(1), object(3), uint8(23) memory usage: 35.6+ KB

For Cabin, generate variables so that Unknown is 0 and others are 1.

df_test_dummies['Cabin_New'] = df_test_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)

7. Model prediction

##Define the data used for forecasting
df_test_dummies_x = df_test_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]
##Execution of prediction processing
predict = model.predict(df_test_dummies_x)

CSV output of prediction results

output_csv = pd.concat([df_test_dummies['PassengerId'], pd.Series(predict)], axis=1)
output_csv.columns = ['PassengerId', 'Survived']
output_csv.to_csv('./submition.csv', index=False)

Finally

I tried it for the time being! It feels like, but I posted it on Kaggle. The score is 0.76076 As of December 30, 2020, it was ranked 14284.

Recommended Posts

I tried logistic regression analysis for the first time using Titanic data
I tried using scrapy for the first time
I tried tensorflow for the first time
I tried python programming for the first time.
I tried Mind Meld for the first time
I tried Python on Mac for the first time.
I tried python on heroku for the first time
AI Gaming I tried it for the first time
I tried the Google Cloud Vision API for the first time
I tried factor analysis with Titanic data!
The first time a programming beginner tried simple data analysis by programming
vprof --I tried using the profiler for Python
Before the coronavirus, I first tried SARS analysis
I tried principal component analysis with Titanic data!
What I got into Python for the first time
I tried to predict the J-League match (data analysis)
For the first time, I learned about Unix (Linux).
I tried clustering ECG data using the K-Shape method
I tried using the API of the salmon data project
Kaguru for the first time
[First data science ⑤] I tried to help my friend find the first property by data analysis.
I tried running PIFuHD on Windows for the time being
[Understand in the shortest time] Python basics for data analysis
Miscellaneous notes that I tried using python for the matter
[Python] I tried collecting data using the API of wikipedia
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
[For beginners] I tried using the Tensorflow Object Detection API
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ②
I tried the same data analysis with kaggle notebook (python) and Power BI at the same time ①
[For self-learning] Go2 for the first time
Data analysis for improving POG 3-Regression analysis-
See python for the first time
Start Django for the first time
I tried using the checkio API
I tried using the python module Kwant for quantum transport calculation
I tried using GLM (generalized linear model) for stock price data
I tried to create serverless batch processing for the first time with DynamoDB and Step Functions
A Python beginner first tried a quick and easy analysis of weather data for the last 10 years.
I tried multiple regression analysis with polynomial regression
[Kaggle for super beginners] Titanic (Logistic regression)
I tried time series analysis! (AR model)
I tried using YOUTUBE Data API V3
MongoDB for the first time in Python
Let's try Linux for the first time
I tried using the BigQuery Storage API
Instantly illustrate the predominant period in time series data using spectrum analysis
[Text classification] I tried using the Attention mechanism for Convolutional Neural Networks.
For the first time in Numpy, I will update it from time to time
I tried to perform a cluster analysis of customers using purchasing data
[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.
A useful note when using Python for the first time in a while
Since I'm free, the front-end engineer tried Python (v3.7.5) for the first time.
I checked the library for using the Gracenote API
[Python] I tried substituting the function name for the function name
I tried cluster analysis of the weather map
How to use MkDocs for the first time
Shortening the analysis time of Openpose using sound
Data analysis Titanic 2
I searched for railway senryu from the data
I tried to save the data with discord
I tried to explain multiple regression analysis as easily as possible using concrete examples.