Since it is basically an infrastructure spec, I have never touched anything other than server-side programs, but since big data infrastructure etc. are also involved, I thought that I would like to grasp what scientists are doing there, so it is sticky. I tried survival analysis using various titanic data. The purpose is to understand how logistic regression analysis is done in Python.
--Python operation server - EC2:t2.micro - OS:Red Hat Enterprise Linux 8 (HVM), SSD Volume Type --Disk: General-purpose SSD (GP2) 10GB - Python:3.7
(Python operating environment maintenance is omitted)
It was carried out in the following flow.
The data was downloaded from here and placed on EC2.
variable | Definition | value | Remarks |
---|---|---|---|
PassengerId | Passenger ID | Primary key | |
Survived | Survival result | 0=death, 1=Survival | Objective variable |
Pclass | Ticket class | 1=1st, 2=2nd, 3=3rd | |
Name | name | ||
Sex | sex | male=male, female=Female | |
Age | age | ||
SibSp | Number of siblings and spouses on board | 0,1,2,3,4,5,8 | |
Parch | Number of parents and children on board | 0,1,2,3,4,5,6,9 | |
Ticket | Ticket number | ||
Fare | Fee | ||
Cabin | Room number | ||
Embarked | From which port you boarded | C=Cherbourg, Q=Queenstown, S=Southampton |
python37
import pandas as pd
df_train = pd.read_csv("train.csv")
Missing values exist in Age, Cabin, and Embarked.
df_train.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
If you want to use the line containing the missing value without discarding it, you need to do some completion. Since Age is numerical data, it is judged that it is appropriate to supplement with the average value or the median value. Somehow I decided to supplement it with the median this time. (Actually, you should judge while looking at the data properly.) For Embarked and Cabin data, we decided to supplement the missing data with'U'and'Unknown', respectively. However, for Embarked, you can create a dummy variable as it is, but Cabin has many value variations, so It seemed difficult to create a dummy variable as it was, and it seemed that the tendency was different depending on whether it was Unknown or not. I decided to set the value so that it would be 0 for Unknown and 1 for others. The following is the data after supplementing the missing values to confirm the tendency.
pd.crosstab(df_train['Embarked'], df_train['Survived'])
Survived 0 1 Embarked C 75 93 Q 47 30 S 427 217 U 0 2
pd.crosstab(df_train['Cabin'], df_train['Survived'])
Survived 0 1 Cabin A10 1 0 A14 1 0 A16 0 1 A19 1 0 A20 0 1 A23 0 1 A24 1 0 A26 0 1 A31 0 1 A32 1 0 A34 0 1 A36 1 0 A5 1 0 A6 0 1 A7 1 0 B101 0 1 B102 1 0 B18 0 2 B19 1 0 B20 0 2 B22 1 1 B28 0 2 B3 0 1 B30 1 0 B35 0 2 B37 1 0 B38 1 0 B39 0 1 B4 0 1 B41 0 1 ... ... ... E121 0 2 E17 0 1 E24 0 2 E25 0 2 E31 1 0 E33 0 2 E34 0 1 E36 0 1 E38 1 0 E40 0 1 E44 1 1 E46 1 0 E49 0 1 E50 0 1 E58 1 0 E63 1 0 E67 1 1 E68 0 1 E77 1 0 E8 0 2 F E69 0 1 F G63 1 0 F G73 2 0 F2 1 2 F33 0 3 F38 1 0 F4 0 2 G6 2 2 T 1 0 Unknown 481 206
Data processing will be carried out in accordance with the above policy.
##Complement the missing value of Age with the median
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].median())
##Complement missing values in Cabin with Unknown
df_train['Cabin'] = df_train['Cabin'].fillna('Unknown')
##Complement Embarked missing values with U (Unknown)
df_train['Embarked'] = df_train['Embarked'].fillna('U')
Create a dummy variable using the completed data
df_train_dummies = pd.get_dummies(df_train, columns=['Sex','Pclass','SibSp','Parch','Embarked'])
df_train_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 30 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 dtypes: float64(2), int64(2), object(3), uint8(23) memory usage: 68.8+ KB
Standardized numerical data of Age and Fare
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_train_dummies['Age_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Age']])
df_train_dummies['Fare_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Fare']])
df_train_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 32 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 Age_scale 891 non-null float64 Fare_scale 891 non-null float64 dtypes: float64(4), int64(2), object(3), uint8(23) memory usage: 82.7+ KB
For Cabin, generate variables so that Unknown is 0 and others are 1.
df_train_dummies['Cabin_New'] = df_train_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)
df_train_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 33 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Name 891 non-null object Age 891 non-null float64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 891 non-null object Sex_female 891 non-null uint8 Sex_male 891 non-null uint8 Pclass_1 891 non-null uint8 Pclass_2 891 non-null uint8 Pclass_3 891 non-null uint8 SibSp_0 891 non-null uint8 SibSp_1 891 non-null uint8 SibSp_2 891 non-null uint8 SibSp_3 891 non-null uint8 SibSp_4 891 non-null uint8 SibSp_5 891 non-null uint8 SibSp_8 891 non-null uint8 Parch_0 891 non-null uint8 Parch_1 891 non-null uint8 Parch_2 891 non-null uint8 Parch_3 891 non-null uint8 Parch_4 891 non-null uint8 Parch_5 891 non-null uint8 Parch_6 891 non-null uint8 Embarked_C 891 non-null uint8 Embarked_Q 891 non-null uint8 Embarked_S 891 non-null uint8 Embarked_U 891 non-null uint8 Age_scale 891 non-null float64 Fare_scale 891 non-null float64 Cabin_New 891 non-null int64 dtypes: float64(4), int64(3), object(3), uint8(23) memory usage: 89.7+ KB
Build a model using the generated variables
from sklearn.linear_model import LogisticRegression
##Setting explanatory variables
X = df_train_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]
##Setting the objective variable
y = df_train_dummies['Survived']
model = LogisticRegression()
result = model.fit(X, y)
Perform model evaluation
##Confirmation of coefficient
result.coef_
array([[ 1.08240551, -1.49531858, 0.44672894, 0.1123565 , -0.9719985 , 0.76315352, 0.84866325, 0.44745114, -0.8278121 , -0.42535544, -0.51058583, -0.70842761, 0.22631714, 0.45736976, 0.05137953, 0.30703108, -0.71378859, -0.41660858, -0.3246134 , -0.05695347, -0.04812828, -0.47921984, 0.17138853, -0.47504073, 0.08458894, 0.83782699]])
It was found that gender information had a great influence.
model.score(X, y)
0.8181818181818182
df_test = pd.read_csv("test.csv")
There are missing values in Age, Cabin, and Fare.
df_test.isnull().sum()
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 332 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 417 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
Processed in the same way as learning data. Fare is complemented by the median as well as Age.
##Complement the missing value of Age with the median
df_test['Age'] = df_test['Age'].fillna(df_train['Age'].median())
##Complement missing values in Cabin with Unknown
df_test['Cabin'] = df_test['Cabin'].fillna('Unknown')
##Complement missing Fare values with median
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].median())
Create a dummy variable using the completed data
df_test_dummies = pd.get_dummies(df_test, columns=['Sex','Pclass','SibSp','Parch','Embarked'])
Since Embarked has no missing values, Embarked_U inputs all 0 values.
df_test_dummies['Embarked_U'] = 0
df_test_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 29 columns): PassengerId 418 non-null int64 Name 418 non-null object Age 418 non-null float64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 418 non-null object Sex_female 418 non-null uint8 Sex_male 418 non-null uint8 Pclass_1 418 non-null uint8 Pclass_2 418 non-null uint8 Pclass_3 418 non-null uint8 SibSp_0 418 non-null uint8 SibSp_1 418 non-null uint8 SibSp_2 418 non-null uint8 SibSp_3 418 non-null uint8 SibSp_4 418 non-null uint8 SibSp_5 418 non-null uint8 SibSp_8 418 non-null uint8 Parch_0 418 non-null uint8 Parch_1 418 non-null uint8 Parch_2 418 non-null uint8 Parch_3 418 non-null uint8 Parch_4 418 non-null uint8 Parch_5 418 non-null uint8 Parch_6 418 non-null uint8 Parch_9 418 non-null uint8 Embarked_C 418 non-null uint8 Embarked_Q 418 non-null uint8 Embarked_S 418 non-null uint8 dtypes: float64(2), int64(1), object(3), uint8(23) memory usage: 29.1+ KB
Standardized numerical data of Age and Fare
df_test_dummies['Age_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Age']])
df_test_dummies['Fare_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Fare']])
df_test_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 31 columns): PassengerId 418 non-null int64 Name 418 non-null object Age 418 non-null float64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 418 non-null object Sex_female 418 non-null uint8 Sex_male 418 non-null uint8 Pclass_1 418 non-null uint8 Pclass_2 418 non-null uint8 Pclass_3 418 non-null uint8 SibSp_0 418 non-null uint8 SibSp_1 418 non-null uint8 SibSp_2 418 non-null uint8 SibSp_3 418 non-null uint8 SibSp_4 418 non-null uint8 SibSp_5 418 non-null uint8 SibSp_8 418 non-null uint8 Parch_0 418 non-null uint8 Parch_1 418 non-null uint8 Parch_2 418 non-null uint8 Parch_3 418 non-null uint8 Parch_4 418 non-null uint8 Parch_5 418 non-null uint8 Parch_6 418 non-null uint8 Parch_9 418 non-null uint8 Embarked_C 418 non-null uint8 Embarked_Q 418 non-null uint8 Embarked_S 418 non-null uint8 Age_scale 418 non-null float64 Fare_scale 418 non-null float64 dtypes: float64(4), int64(1), object(3), uint8(23) memory usage: 35.6+ KB
For Cabin, generate variables so that Unknown is 0 and others are 1.
df_test_dummies['Cabin_New'] = df_test_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)
##Define the data used for forecasting
df_test_dummies_x = df_test_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]
##Execution of prediction processing
predict = model.predict(df_test_dummies_x)
CSV output of prediction results
output_csv = pd.concat([df_test_dummies['PassengerId'], pd.Series(predict)], axis=1)
output_csv.columns = ['PassengerId', 'Survived']
output_csv.to_csv('./submition.csv', index=False)
I tried it for the time being! It feels like, but I posted it on Kaggle. The score is 0.76076 As of December 30, 2020, it was ranked 14284.
Recommended Posts