Since it is basically an infrastructure spec, I have never touched anything other than server-side programs, but since big data infrastructure etc. are also involved, I thought that I would like to grasp what scientists are doing there, so it is sticky. I tried survival analysis using various titanic data. The purpose is to understand how logistic regression analysis is done in Python.
--Python operation server - EC2:t2.micro - OS:Red Hat Enterprise Linux 8 (HVM), SSD Volume Type --Disk: General-purpose SSD (GP2) 10GB - Python:3.7
(Python operating environment maintenance is omitted)
It was carried out in the following flow.
The data was downloaded from here and placed on EC2.
| variable | Definition | value | Remarks | 
|---|---|---|---|
| PassengerId | Passenger ID | Primary key | |
| Survived | Survival result | 0=death, 1=Survival | Objective variable | 
| Pclass | Ticket class | 1=1st, 2=2nd, 3=3rd | |
| Name | name | ||
| Sex | sex | male=male, female=Female | |
| Age | age | ||
| SibSp | Number of siblings and spouses on board | 0,1,2,3,4,5,8 | |
| Parch | Number of parents and children on board | 0,1,2,3,4,5,6,9 | |
| Ticket | Ticket number | ||
| Fare | Fee | ||
| Cabin | Room number | ||
| Embarked | From which port you boarded | C=Cherbourg, Q=Queenstown, S=Southampton | 
python37
import pandas as pd
df_train = pd.read_csv("train.csv")
Missing values exist in Age, Cabin, and Embarked.
df_train.isnull().sum()
PassengerId      0 Survived         0 Pclass           0 Name             0 Sex              0 Age            177 SibSp            0 Parch            0 Ticket           0 Fare             0 Cabin          687 Embarked         2 dtype: int64
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId    891 non-null int64 Survived       891 non-null int64 Pclass         891 non-null int64 Name           891 non-null object Sex            891 non-null object Age            714 non-null float64 SibSp          891 non-null int64 Parch          891 non-null int64 Ticket         891 non-null object Fare           891 non-null float64 Cabin          204 non-null object Embarked       889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
If you want to use the line containing the missing value without discarding it, you need to do some completion. Since Age is numerical data, it is judged that it is appropriate to supplement with the average value or the median value. Somehow I decided to supplement it with the median this time. (Actually, you should judge while looking at the data properly.) For Embarked and Cabin data, we decided to supplement the missing data with'U'and'Unknown', respectively. However, for Embarked, you can create a dummy variable as it is, but Cabin has many value variations, so It seemed difficult to create a dummy variable as it was, and it seemed that the tendency was different depending on whether it was Unknown or not. I decided to set the value so that it would be 0 for Unknown and 1 for others. The following is the data after supplementing the missing values to confirm the tendency.
pd.crosstab(df_train['Embarked'], df_train['Survived'])
 Survived    0    1 Embarked           C          75   93 Q          47   30 S         427  217 U           0    2
pd.crosstab(df_train['Cabin'], df_train['Survived'])
 Survived    0    1 Cabin              A10         1    0 A14         1    0 A16         0    1 A19         1    0 A20         0    1 A23         0    1 A24         1    0 A26         0    1 A31         0    1 A32         1    0 A34         0    1 A36         1    0 A5          1    0 A6          0    1 A7          1    0 B101        0    1 B102        1    0 B18         0    2 B19         1    0 B20         0    2 B22         1    1 B28         0    2 B3          0    1 B30         1    0 B35         0    2 B37         1    0 B38         1    0 B39         0    1 B4          0    1 B41         0    1 ...       ...  ... E121        0    2 E17         0    1 E24         0    2 E25         0    2 E31         1    0 E33         0    2 E34         0    1 E36         0    1 E38         1    0 E40         0    1 E44         1    1 E46         1    0 E49         0    1 E50         0    1 E58         1    0 E63         1    0 E67         1    1 E68         0    1 E77         1    0 E8          0    2 F E69       0    1 F G63       1    0 F G73       2    0 F2          1    2 F33         0    3 F38         1    0 F4          0    2 G6          2    2 T           1    0 Unknown   481  206
Data processing will be carried out in accordance with the above policy.
##Complement the missing value of Age with the median
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].median())
##Complement missing values in Cabin with Unknown
df_train['Cabin'] = df_train['Cabin'].fillna('Unknown')
##Complement Embarked missing values with U (Unknown)
df_train['Embarked'] = df_train['Embarked'].fillna('U')
Create a dummy variable using the completed data
df_train_dummies = pd.get_dummies(df_train, columns=['Sex','Pclass','SibSp','Parch','Embarked'])
df_train_dummies.info()
 <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 30 columns): PassengerId    891 non-null int64 Survived       891 non-null int64 Name           891 non-null object Age            891 non-null float64 Ticket         891 non-null object Fare           891 non-null float64 Cabin          891 non-null object Sex_female     891 non-null uint8 Sex_male       891 non-null uint8 Pclass_1       891 non-null uint8 Pclass_2       891 non-null uint8 Pclass_3       891 non-null uint8 SibSp_0        891 non-null uint8 SibSp_1        891 non-null uint8 SibSp_2        891 non-null uint8 SibSp_3        891 non-null uint8 SibSp_4        891 non-null uint8 SibSp_5        891 non-null uint8 SibSp_8        891 non-null uint8 Parch_0        891 non-null uint8 Parch_1        891 non-null uint8 Parch_2        891 non-null uint8 Parch_3        891 non-null uint8 Parch_4        891 non-null uint8 Parch_5        891 non-null uint8 Parch_6        891 non-null uint8 Embarked_C     891 non-null uint8 Embarked_Q     891 non-null uint8 Embarked_S     891 non-null uint8 Embarked_U     891 non-null uint8 dtypes: float64(2), int64(2), object(3), uint8(23) memory usage: 68.8+ KB
Standardized numerical data of Age and Fare
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_train_dummies['Age_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Age']])
df_train_dummies['Fare_scale'] = scaler.fit_transform(df_train_dummies.loc[:, ['Fare']])
df_train_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 32 columns): PassengerId    891 non-null int64 Survived       891 non-null int64 Name           891 non-null object Age            891 non-null float64 Ticket         891 non-null object Fare           891 non-null float64 Cabin          891 non-null object Sex_female     891 non-null uint8 Sex_male       891 non-null uint8 Pclass_1       891 non-null uint8 Pclass_2       891 non-null uint8 Pclass_3       891 non-null uint8 SibSp_0        891 non-null uint8 SibSp_1        891 non-null uint8 SibSp_2        891 non-null uint8 SibSp_3        891 non-null uint8 SibSp_4        891 non-null uint8 SibSp_5        891 non-null uint8 SibSp_8        891 non-null uint8 Parch_0        891 non-null uint8 Parch_1        891 non-null uint8 Parch_2        891 non-null uint8 Parch_3        891 non-null uint8 Parch_4        891 non-null uint8 Parch_5        891 non-null uint8 Parch_6        891 non-null uint8 Embarked_C     891 non-null uint8 Embarked_Q     891 non-null uint8 Embarked_S     891 non-null uint8 Embarked_U     891 non-null uint8 Age_scale      891 non-null float64 Fare_scale     891 non-null float64 dtypes: float64(4), int64(2), object(3), uint8(23) memory usage: 82.7+ KB
For Cabin, generate variables so that Unknown is 0 and others are 1.
df_train_dummies['Cabin_New'] = df_train_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)
df_train_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 33 columns): PassengerId    891 non-null int64 Survived       891 non-null int64 Name           891 non-null object Age            891 non-null float64 Ticket         891 non-null object Fare           891 non-null float64 Cabin          891 non-null object Sex_female     891 non-null uint8 Sex_male       891 non-null uint8 Pclass_1       891 non-null uint8 Pclass_2       891 non-null uint8 Pclass_3       891 non-null uint8 SibSp_0        891 non-null uint8 SibSp_1        891 non-null uint8 SibSp_2        891 non-null uint8 SibSp_3        891 non-null uint8 SibSp_4        891 non-null uint8 SibSp_5        891 non-null uint8 SibSp_8        891 non-null uint8 Parch_0        891 non-null uint8 Parch_1        891 non-null uint8 Parch_2        891 non-null uint8 Parch_3        891 non-null uint8 Parch_4        891 non-null uint8 Parch_5        891 non-null uint8 Parch_6        891 non-null uint8 Embarked_C     891 non-null uint8 Embarked_Q     891 non-null uint8 Embarked_S     891 non-null uint8 Embarked_U     891 non-null uint8 Age_scale      891 non-null float64 Fare_scale     891 non-null float64 Cabin_New      891 non-null int64 dtypes: float64(4), int64(3), object(3), uint8(23) memory usage: 89.7+ KB
Build a model using the generated variables
from sklearn.linear_model import LogisticRegression
##Setting explanatory variables
X = df_train_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]
##Setting the objective variable
y = df_train_dummies['Survived']
model = LogisticRegression()
result = model.fit(X, y)
Perform model evaluation
##Confirmation of coefficient
result.coef_
array([[ 1.08240551, -1.49531858,  0.44672894,  0.1123565 , -0.9719985 ,          0.76315352,  0.84866325,  0.44745114, -0.8278121 , -0.42535544,         -0.51058583, -0.70842761,  0.22631714,  0.45736976,  0.05137953,          0.30703108, -0.71378859, -0.41660858, -0.3246134 , -0.05695347,         -0.04812828, -0.47921984,  0.17138853, -0.47504073,  0.08458894,          0.83782699]])
It was found that gender information had a great influence.
model.score(X, y)
0.8181818181818182
df_test = pd.read_csv("test.csv")
There are missing values in Age, Cabin, and Fare.
df_test.isnull().sum()
PassengerId      0 Pclass           0 Name             0 Sex              0 Age             86 SibSp            0 Parch            0 Ticket           0 Fare             1 Cabin          327 Embarked         0 dtype: int64 
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId    418 non-null int64 Pclass         418 non-null int64 Name           418 non-null object Sex            418 non-null object Age            332 non-null float64 SibSp          418 non-null int64 Parch          418 non-null int64 Ticket         418 non-null object Fare           417 non-null float64 Cabin          91 non-null object Embarked       418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB 
Processed in the same way as learning data. Fare is complemented by the median as well as Age.
##Complement the missing value of Age with the median
df_test['Age'] = df_test['Age'].fillna(df_train['Age'].median())
##Complement missing values in Cabin with Unknown
df_test['Cabin'] = df_test['Cabin'].fillna('Unknown')
##Complement missing Fare values with median
df_test['Fare'] = df_test['Fare'].fillna(df_test['Fare'].median())
Create a dummy variable using the completed data
df_test_dummies = pd.get_dummies(df_test, columns=['Sex','Pclass','SibSp','Parch','Embarked'])
Since Embarked has no missing values, Embarked_U inputs all 0 values.
df_test_dummies['Embarked_U'] = 0
df_test_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 29 columns): PassengerId    418 non-null int64 Name           418 non-null object Age            418 non-null float64 Ticket         418 non-null object Fare           418 non-null float64 Cabin          418 non-null object Sex_female     418 non-null uint8 Sex_male       418 non-null uint8 Pclass_1       418 non-null uint8 Pclass_2       418 non-null uint8 Pclass_3       418 non-null uint8 SibSp_0        418 non-null uint8 SibSp_1        418 non-null uint8 SibSp_2        418 non-null uint8 SibSp_3        418 non-null uint8 SibSp_4        418 non-null uint8 SibSp_5        418 non-null uint8 SibSp_8        418 non-null uint8 Parch_0        418 non-null uint8 Parch_1        418 non-null uint8 Parch_2        418 non-null uint8 Parch_3        418 non-null uint8 Parch_4        418 non-null uint8 Parch_5        418 non-null uint8 Parch_6        418 non-null uint8 Parch_9        418 non-null uint8 Embarked_C     418 non-null uint8 Embarked_Q     418 non-null uint8 Embarked_S     418 non-null uint8 dtypes: float64(2), int64(1), object(3), uint8(23) memory usage: 29.1+ KB 
Standardized numerical data of Age and Fare
df_test_dummies['Age_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Age']])
df_test_dummies['Fare_scale'] = scaler.fit_transform(df_test_dummies.loc[:, ['Fare']])
df_test_dummies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 31 columns): PassengerId    418 non-null int64 Name           418 non-null object Age            418 non-null float64 Ticket         418 non-null object Fare           418 non-null float64 Cabin          418 non-null object Sex_female     418 non-null uint8 Sex_male       418 non-null uint8 Pclass_1       418 non-null uint8 Pclass_2       418 non-null uint8 Pclass_3       418 non-null uint8 SibSp_0        418 non-null uint8 SibSp_1        418 non-null uint8 SibSp_2        418 non-null uint8 SibSp_3        418 non-null uint8 SibSp_4        418 non-null uint8 SibSp_5        418 non-null uint8 SibSp_8        418 non-null uint8 Parch_0        418 non-null uint8 Parch_1        418 non-null uint8 Parch_2        418 non-null uint8 Parch_3        418 non-null uint8 Parch_4        418 non-null uint8 Parch_5        418 non-null uint8 Parch_6        418 non-null uint8 Parch_9        418 non-null uint8 Embarked_C     418 non-null uint8 Embarked_Q     418 non-null uint8 Embarked_S     418 non-null uint8 Age_scale      418 non-null float64 Fare_scale     418 non-null float64 dtypes: float64(4), int64(1), object(3), uint8(23) memory usage: 35.6+ KB 
For Cabin, generate variables so that Unknown is 0 and others are 1.
df_test_dummies['Cabin_New'] = df_test_dummies['Cabin'].map(lambda x: 0 if x == 'Unknown' else 1).astype(int)
##Define the data used for forecasting
df_test_dummies_x = df_test_dummies[['Sex_female','Sex_male','Pclass_1','Pclass_2','Pclass_3','SibSp_0','SibSp_1','SibSp_2','SibSp_3','SibSp_4','SibSp_5','SibSp_8','Parch_0','Parch_1','Parch_2','Parch_3','Parch_4','Parch_5','Parch_6','Embarked_C','Embarked_Q','Embarked_S','Embarked_U','Age_scale','Fare_scale','Cabin_New']]
##Execution of prediction processing
predict = model.predict(df_test_dummies_x)
CSV output of prediction results
output_csv = pd.concat([df_test_dummies['PassengerId'], pd.Series(predict)], axis=1)
output_csv.columns = ['PassengerId', 'Survived']
output_csv.to_csv('./submition.csv', index=False)
I tried it for the time being! It feels like, but I posted it on Kaggle. The score is 0.76076 As of December 30, 2020, it was ranked 14284.
Recommended Posts