Purpose

Since it is basically an infrastructure spec, I have never touched anything other than server-side programs, but since big data infrastructure etc. are also involved, I thought that I would like to grasp what scientists are doing there, so it is sticky. I tried survival analysis using various titanic data. The purpose is to understand how logistic regression analysis is done in Python.

Environmental condition

--Python operation server - EC2：t2.micro - OS：Red Hat Enterprise Linux 8 (HVM), SSD Volume Type --Disk: General-purpose SSD (GP2) 10GB - Python：3.7

Implementation procedure

(Python operating environment maintenance is omitted)

It was carried out in the following flow.

Data preparation
Confirmation of learning data
Processing of learning data
Model construction
Confirmation of test data
Processing test data
Model prediction

1. Data preparation

The data was downloaded from here and placed on EC2.

2. Confirmation of learning data

variable	Definition	value	Remarks
PassengerId	Passenger ID		Primary key
Survived	Survival result	0=death, 1=Survival	Objective variable
Pclass	Ticket class	1=1st, 2=2nd, 3=3rd
Name	name
Sex	sex	male=male, female=Female
Age	age
SibSp	Number of siblings and spouses on board	0,1,2,3,4,5,8
Parch	Number of parents and children on board	0,1,2,3,4,5,6,9
Ticket	Ticket number
Fare	Fee
Cabin	Room number
Embarked	From which port you boarded	C=Cherbourg, Q=Queenstown, S=Southampton

Read data frame

python37

import pandas as pd

df_train = pd.read_csv("train.csv")

Confirmation of missing values

Missing values exist in Age, Cabin, and Embarked.

df_train.isnull().sum()

PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64

df_train.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

If you want to use the line containing the missing value without discarding it, you need to do some completion. Since Age is numerical data, it is judged that it is appropriate to supplement with the average value or the median value. Somehow I decided to supplement it with the median this time. (Actually, you should judge while looking at the data properly.) For Embarked and Cabin data, we decided to supplement the missing data with'U'and'Unknown', respectively. However, for Embarked, you can create a dummy variable as it is, but Cabin has many value variations, so It seemed difficult to create a dummy variable as it was, and it seemed that the tendency was different depending on whether it was Unknown or not. I decided to set the value so that it would be 0 for Unknown and 1 for others. The following is the data after supplementing the missing values to confirm the tendency.

pd.crosstab(df_train['Embarked'], df_train['Survived'])

Survived 0 1 Embarked C 75 93 Q 47 30 S 427 217 U 0 2

pd.crosstab(df_train['Cabin'], df_train['Survived'])

Survived 0 1 Cabin A10 1 0 A14 1 0 A16 0 1 A19 1 0 A20 0 1 A23 0 1 A24 1 0 A26 0 1 A31 0 1 A32 1 0 A34 0 1 A36 1 0 A5 1 0 A6 0 1 A7 1 0 B101 0 1 B102 1 0 B18 0 2 B19 1 0 B20 0 2 B22 1 1 B28 0 2 B3 0 1 B30 1 0 B35 0 2 B37 1 0 B38 1 0 B39 0 1 B4 0 1 B41 0 1 ... ... ... E121 0 2 E17 0 1 E24 0 2 E25 0 2 E31 1 0 E33 0 2 E34 0 1 E36 0 1 E38 1 0 E40 0 1 E44 1 1 E46 1 0 E49 0 1 E50 0 1 E58 1 0 E63 1 0 E67 1 1 E68 0 1 E77 1 0 E8 0 2 F E69 0 1 F G63 1 0 F G73 2 0 F2 1 2 F33 0 3 F38 1 0 F4 0 2 G6 2 2 T 1 0 Unknown 481 206

3. Processing of learning data

Data processing will be carried out in accordance with the above policy.

##Complement the missing value of Age with the median
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].median())

##Complement missing values in Cabin with Unknown
df_train['Cabin'] = df_train['Cabin'].fillna('Unknown')

##Complement Embarked missing values with U (Unknown)
df_train['Embarked'] = df_train['Embarked'].fillna('U')

Create a dummy variable using the completed data

df_train_dummies = pd.get_dummies(df_train, columns=['Sex','Pclass','SibSp','Parch','Embarked'])

df_train_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 30 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 dtypes: float64(2), int64(2), object(3), uint8(23) memory usage: 68.8+ KB

Standardized numerical data of Age and Fare

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df_train_dummies['Age_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Age']])

df_train_dummies['Fare_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Fare']])

df_train_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 32 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 Age_scale 891 non-null float64 Fare_scale 891 non-null float64 dtypes: float64(4), int64(2), object(3), uint8(23) memory usage: 82.7+ KB

For Cabin, generate variables so that Unknown is 0 and others are 1.

df_train_dummies['Cabin_New'] = df_train_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)

df_train_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 33 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 Age_scale 891 non-null float64 Fare_scale 891 non-null float64 Cabin_New 891 non-null int64 dtypes: float64(4), int64(3), object(3), uint8(23) memory usage: 89.7+ KB

4. Model construction

Build a model using the generated variables

from sklearn.linear_model import LogisticRegression

##Setting explanatory variables
X = df_train_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]

##Setting the objective variable
y = df_train_dummies['Survived']

model = LogisticRegression()

result = model.fit(X, y)

Perform model evaluation

##Confirmation of coefficient
result.coef_

array([[ 1.08240551, -1.49531858, 0.44672894, 0.1123565 , -0.9719985 , 0.76315352, 0.84866325, 0.44745114, -0.8278121 , -0.42535544, -0.51058583, -0.70842761, 0.22631714, 0.45736976, 0.05137953, 0.30703108, -0.71378859, -0.41660858, -0.3246134 , -0.05695347, -0.04812828, -0.47921984, 0.17138853, -0.47504073, 0.08458894, 0.83782699]])

It was found that gender information had a great influence.

model.score(X, y)

0.8181818181818182

5. Confirmation of test data

df_test = pd.read_csv("test.csv")

Confirmation of missing values

There are missing values in Age, Cabin, and Fare.

df_test.isnull().sum()

PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64

df_test.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB

6. Processing test data

Processed in the same way as learning data. Fare is complemented by the median as well as Age.

##Complement the missing value of Age with the median
df_test['Age'] = df_test['Age'].fillna(df_train['Age'].median())

##Complement missing values in Cabin with Unknown
df_test['Cabin'] = df_test['Cabin'].fillna('Unknown')

##Complement missing Fare values with median
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].median())

Create a dummy variable using the completed data

df_test_dummies = pd.get_dummies(df_test, columns=['Sex','Pclass','SibSp','Parch','Embarked'])

Since Embarked has no missing values, Embarked_U inputs all 0 values.

df_test_dummies['Embarked_U'] = 0

df_test_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 29 columns): PassengerId 418 non-null int64 Name 418 non-null object Age 418 non-null float64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 418 non-null object Sex_female 418 non-null uint8 Sex_male 418 non-null uint8 Pclass_1 418 non-null uint8 Pclass_2 418 non-null uint8 Pclass_3 418 non-null uint8 SibSp_0 418 non-null uint8 SibSp_1 418 non-null uint8 SibSp_2 418 non-null uint8 SibSp_3 418 non-null uint8 SibSp_4 418 non-null uint8 SibSp_5 418 non-null uint8 SibSp_8 418 non-null uint8 Parch_0 418 non-null uint8 Parch_1 418 non-null uint8 Parch_2 418 non-null uint8 Parch_3 418 non-null uint8 Parch_4 418 non-null uint8 Parch_5 418 non-null uint8 Parch_6 418 non-null uint8 Parch_9 418 non-null uint8 Embarked_C 418 non-null uint8 Embarked_Q 418 non-null uint8 Embarked_S 418 non-null uint8 dtypes: float64(2), int64(1), object(3), uint8(23) memory usage: 29.1+ KB

Standardized numerical data of Age and Fare

df_test_dummies['Age_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Age']])

df_test_dummies['Fare_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Fare']])

df_test_dummies.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 31 columns): PassengerId 418 non-null int64 Name 418 non-null object Age 418 non-null float64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 418 non-null object Sex_female 418 non-null uint8 Sex_male 418 non-null uint8 Pclass_1 418 non-null uint8 Pclass_2 418 non-null uint8 Pclass_3 418 non-null uint8 SibSp_0 418 non-null uint8 SibSp_1 418 non-null uint8 SibSp_2 418 non-null uint8 SibSp_3 418 non-null uint8 SibSp_4 418 non-null uint8 SibSp_5 418 non-null uint8 SibSp_8 418 non-null uint8 Parch_0 418 non-null uint8 Parch_1 418 non-null uint8 Parch_2 418 non-null uint8 Parch_3 418 non-null uint8 Parch_4 418 non-null uint8 Parch_5 418 non-null uint8 Parch_6 418 non-null uint8 Parch_9 418 non-null uint8 Embarked_C 418 non-null uint8 Embarked_Q 418 non-null uint8 Embarked_S 418 non-null uint8 Age_scale 418 non-null float64 Fare_scale 418 non-null float64 dtypes: float64(4), int64(1), object(3), uint8(23) memory usage: 35.6+ KB

For Cabin, generate variables so that Unknown is 0 and others are 1.

df_test_dummies['Cabin_New'] = df_test_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)

7. Model prediction

##Define the data used for forecasting
df_test_dummies_x = df_test_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]

##Execution of prediction processing
predict = model.predict(df_test_dummies_x)

CSV output of prediction results

output_csv = pd.concat([df_test_dummies['PassengerId'], pd.Series(predict)], axis=1)

output_csv.columns = ['PassengerId', 'Survived']

output_csv.to_csv('./submition.csv', index=False)

Finally

I tried it for the time being! It feels like, but I posted it on Kaggle. The score is 0.76076 As of December 30, 2020, it was ranked 14284.

[PYTHON] I tried logistic regression analysis for the first time using Titanic data

Purpose

Environmental condition

Implementation procedure

1. Data preparation

2. Confirmation of learning data

Read data frame

Confirmation of missing values

3. Processing of learning data

4. Model construction

5. Confirmation of test data

Confirmation of missing values

6. Processing test data

7. Model prediction

Finally