[PYTHON] A story of a high school graduate technician trying to predict the survival of the Titanic


** Site that I referred to when doing this time ** https://yolo-kiyoshi.com/2020/01/22/post-1588/ https://www.codexa.net/kaggle-titanic-beginner/ https://qiita.com/suzumi/items/8ce18bc90c942663d1e6


Thoughts / Bias

――It seems difficult to survive in a frigid environment. .. Did women and children on the lifeboat get preferential treatment?

――Isn't the person with high social status given preferential treatment?

Data confirmation

Check with info () 2020-04-17 (2).png From info (), you can see that both Age and Cabin are missing. image.png

What to do with missing values?

It is like this when I personally summarize my experience of doing data analysis for a long time. How are you guys?

image.png

In this case, Age with medium deficiency is average or median, and Cabin is large deficiency, so it is not used. .. I would like to say that it is done in an easy-to-understand manner by other people, so let's do this part muddy

Age deficiency

Regarding the lack of Age, I think the important thing is Name. Among them, it is valuable information such as what is called honorific title, men and women, adults and children, married / unmarried if it is a woman, high status, etc.

In particular, it is a ship that is likely to be used by people in their 30s on average. The presence of children and older people in high positions reduces the accuracy of Age.

So, first of all, I will extract the title.

in: #Show name
train_data['Name']

out:
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

If you look at the data from a bird's-eye view, there is a title between "," and "."

in: #I want to get the title out of the name
#Merge train test
train_data1 = train_data.copy()
test_data1 = test_data.copy()
train_data1['train_or_test'] = 'train' 
test_data1['train_or_test'] = 'test' 
test_data1['Survived'] = np.nan #Set Survived column to NaN for testing
all_data = pd.concat(
    [
        train_data1,
        test_data1
    ],
    sort=False,
    axis=0 #Train in column direction_data1、test_Combine data1
).reset_index(drop=True)
#all_Extract titles from data+Calculate average age
honorific=all_data['honorific'] = all_data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
Average_age=all_data['Age'].groupby(all_data['honorific']).agg(['count','mean','median','std'])
Average_age

Doing this will give you an average age for each title 2020-04-19 (2).png

Now, let's enter the average age from the titles of the missing data based on this data.

in:Apply the average age for each title to the missing values
f = lambda x: x.fillna(x.mean())
age_complement = all_data.groupby('honorific').transform(f)
age_complement.info()


out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
PassengerId    1309 non-null int64
Survived       1308 non-null float64
Pclass         1309 non-null int64
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Fare           1309 non-null float64
dtypes: float64(3), int64(4)
memory usage: 71.7 KB

This completes the age deficiency. But for some reason the column disappeared Transfer to the original all_data Age

del(all_data['Age']) #all_Delete Age of data
all_data['Age']=age_complement['Age'] #Create an Age column and put the Age data of the person who complemented the missing area in it.

This completes the processing of missing values in Age data.

Variations in Fare that I was worried about while manipulating data

Looking at the data, I was worried about the variation in the price of Fare, so this is not the price per person, but the total price of the group is substituted ,? I thought. Fare / We will count the number of duplicate tickets and return it to the reasonable price per person

in:# 1.Create a dictionary array that represents the number of duplicate Tickets
double_check_dict = all_data['Ticket'].value_counts().to_dict()

# 2.Add multiple columns to a DataFrame
all_data['double_check'] = all_data['Ticket'].apply(lambda x: double_check_dict[x] if x in double_check_dict else 0)
all_data['Fare']=all_data['Fare']/all_data['double_check']
all_data

2020-04-19 (5).png

Double_check is the number of duplicate tickets, but after all, the price of Fare is soaring where there are many tickets, so divide Fare by the number of tickets.

in:
all_data['Fare']=all_data['Fare']/all_data['double_check']
all_data

image.png

With this, we were able to suppress price variations.

** If the Cabin rooms are now stratified by price, why not get most of the rooms from Fare? I think **

Think about Cabin

We will decompose the data that contains the Cabin data.

--Extract the acronym of Cabin data and decompose it into A, B, C, D .. --The group that stays in multiple rooms is separated by "C12 C13 C14" in blank units, so calculate how many rooms you rented with "Blank +1".

in:
cabin_data=all_data.dropna() #Extract only those with Cabin data
cabin_data['Cabin_id'] = cabin_data['Cabin'].map(lambda x:x[0]) #Cabin_Put the acronym Cabin in id
cabin_data['room']=cabin_data['Cabin'].map(lambda x:x.count(' '))+1
cabin_data.head(50) 

image.png

Think about whether you can calculate the room charge separately for each P class

To see if there is a price difference in the rank of each room, make one whole thing, divided by P_class 1 2 3 and think about the price of each room

cabin_data_1=cabin_data.query('Pclass == "1"')
cabin_data_2=cabin_data.query('Pclass == "2"')
cabin_data_3=cabin_data.query('Pclass == "3"')

Average_ageC=cabin_data['Fare'].groupby(cabin_data['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Click here for the overall price image.png

Average_ageC=cabin_data_1['Fare'].groupby(cabin_data_1['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Pclass1 image.png

Average_ageC=cabin_data_2['Fare'].groupby(cabin_data_2['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Pclass2 image.png

Average_ageC=cabin_data_3['Fare'].groupby(cabin_data_3['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Pclass3 image.png

It looks like this when you look at it roughly, but for example, if the usage fee is different for each room in the same class band, it is possible to estimate the missing value of Cabin, but it is almost the same and it seems that it can not be estimated exactly. Hmm, sorry image.png

Get the number of family members from SibSp and Parch

all_data['famiry_size']=all_data['SibSp']+all_data['Parch']+1

Divide Fare into 11

I divided it into 11 parts, but I feel like I can still do deep digging here.

#Fare split
all_data['Fare_bin'] = pd.qcut(all_data.Fare, 11)#Fare to Fare_Divide into 11 as bin

Divide Sex

sex_col = ['Sex']
le = LabelEncoder()
for col in sex_col:
    all_data[col] = le.fit_transform(all_data[col]) #Divide Sex male famale into 0 and 1

Replace'Pclass',' Embarked',' honorific',' Fare_bin','famiry_size', with categorical variables

cat_col = ['Pclass','Embarked','honorific','Fare_bin','famiry_size',]
all_data = pd.get_dummies(all_data, drop_first=True, columns=cat_col)#'Pclass','Embarked','honorific','Fare_bin','famiry_size'To a dummy variable with 0 and 1

That's all for dividing each column.

Divide all_data into train and test data

from sklearn.model_selection import train_test_split
 
train = all_data.query('train_or_test == "train"')#'train_or_test extracts train
test = all_data.query('train_or_test == "test"')
#Define target variables and columns unnecessary for learning
target_col = 'Survived'
drop_col = ['PassengerId','Survived', 'Name', 'Fare', 'Ticket', 'Cabin', 'train_or_test','Parch','SibSp','honorific_Jonkheer','honorific_Mme','honorific_Dona','honorific_Lady','honorific_Ms',]
#Holds only the features required for learning
train_feature = train.drop(columns=drop_col)
test_feature = test.drop(columns=drop_col)
train_tagert = train[target_col]
#Split train data
X_train, X_test, y_train, y_test = train_test_split(
    train_feature, train_tagert, test_size=0.3, random_state=0, stratify=train_tagert)

Learning

As a library RandomForestClassifier SVC LogisticRegression CatBoostClassifier Use

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier


rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
print('='*20)
print('RandomForestClassifier')
print(f'accuracy of train set: {rfc.score(X_train, y_train)}')
print(f'accuracy of test set: {rfc.score(X_test, y_test)}')


lr = LogisticRegression(random_state=0)
lr.fit(X_train, y_train)
print('='*20)
print('LogisticRegression')
print(f'accuracy of train set: {lr.score(X_train, y_train)}')
print(f'accuracy of train set: {lr.score(X_test, y_test)}')

svc = SVC(random_state=0)
svc.fit(X_train, y_train)
print('='*20)
print('SVC')
print(f'accuracy of train set: {svc.score(X_train, y_train)}')
print(f'accuracy of train set: {svc.score(X_test, y_test)}')


cat = CatBoostClassifier(random_state=0)
cat.fit(X_train, y_train)
print('='*20)
print('CAT')
print(f'accuracy of train set: {cat.score(X_train, y_train)}')
print(f'accuracy of train set: {cat.score(X_test, y_test)}')

As the out side

RandomForestClassifier
accuracy of train set: 0.9678972712680578
accuracy of test set: 0.8246268656716418
LogisticRegression
accuracy of train set: 0.8426966292134831
accuracy of train set: 0.832089552238806
SVC
accuracy of train set: 0.8330658105939005
accuracy of train set: 0.8134328358208955
CAT
accuracy of train set: 0.9085072231139647
accuracy of train set: 0.8507462686567164

Adjust hyperparameters with Oputuna and Kfold tests

RandomForestClassifier

from sklearn.model_selection import StratifiedKFold, cross_validate
import optuna

cv = 10
def objective(trial):
    
    param_grid_rfc = {
        "max_depth": trial.suggest_int("max_depth", 5, 15),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
        'min_samples_split': trial.suggest_int("min_samples_split", 7, 15),
        "criterion": trial.suggest_categorical("criterion", ["gini", "entropy"]),
        'max_features': trial.suggest_int("max_features", 3, 10),
        "random_state": 0
    }
 
    model = RandomForestClassifier(**param_grid_rfc)
    
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()
 
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
rfc_best_param = study.best_params


SVC

import warnings
warnings.filterwarnings('ignore')

def objective(trial):
    
    param_grid_lr = {
        'C' : trial.suggest_int("C", 1, 100),
        "random_state": 0
    }

    model = LogisticRegression(**param_grid_lr)
    
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
lr_best_param = study.best_params 

LogisticRegression

import warnings
warnings.filterwarnings('ignore')

def objective(trial):
    
    param_grid_svc = {
        'C' : trial.suggest_int("C", 50, 200),
        'gamma': trial.suggest_loguniform("gamma", 1e-4, 1.0),
        "random_state": 0,
        'kernel': 'rbf'
    }

    model = SVC(**param_grid_svc)
    
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
svc_best_param = study.best_params

CatBoostClassifier

from sklearn.model_selection import train_test_split
from catboost import Pool
import sklearn.metrics
X = train_feature
y = test_feature
categorical_features_indices = np.where(X.dtypes != np.float)[0]
def objective(trial):
    #Separate training data and test data
    X_train, X_test, y_train, y_test = train_test_split(
    train_feature, train_tagert, test_size=0.35, random_state=0, stratify=train_tagert)
    train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
    test_pool = Pool(X_test,y_test, cat_features=categorical_features_indices)

    #Parameter specification
    params = {
        'iterations' : trial.suggest_int('iterations', 50, 300),                         
        'depth' : trial.suggest_int('depth', 4, 10),                                       
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 0.3),               
        'random_strength' :trial.suggest_int('random_strength', 0, 100),                       
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00), 
        'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
        'od_wait' :trial.suggest_int('od_wait', 10, 50)
    }

    #Learning
    model = CatBoostClassifier(**params)
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)

    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()

if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=5)
    cat_best_param = study.best_params
    print(study.best_value)
    print(cat_best_param)


Verify accuracy again with adjusted parameter

# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

rfc_best = RandomForestClassifier(**rfc_best_param)
print('RandomForestClassifier')
scores = cross_validate(rfc_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

lr_best = LogisticRegression(**lr_best_param)
print('LogisticRegression')
scores = cross_validate(lr_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

svc_best = SVC(**svc_best_param)
print('SVC')
scores = cross_validate(svc_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

cat_best =CatBoostClassifier
print('CAT')
scores = cross_validate(cat, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

as a result

RandomForestClassifier
mean:0.827152263628423, std:0.029935476082608138
LogisticRegression
mean:0.8294309818436642, std:0.03568888547665349
SVC
mean:0.826034945192669, std:0.03392425879847107
CAT
mean:0.8249241166088076, std:0.030217830226771592

have become. Logistic Regression is the most accurate, but it also has a high std value.

# RandomForest
rfc_best = RandomForestClassifier(**rfc_best_param)
rfc_best.fit(train_feature, train_tagert)

# LogisticRegression
lr_best = LogisticRegression(**lr_best_param)
lr_best.fit(train_feature, train_tagert)

# SVC
svc_best = SVC(**svc_best_param)
svc_best.fit(train_feature, train_tagert)

#CatBoostClassifier
cat_best = CatBoostClassifier(**cat_best_param)
cat_best.fit(train_feature, train_tagert)


#Predict each
pred = {
    'rfc': rfc_best.predict(test_feature).astype(int),
    'lr': lr_best.predict(test_feature).astype(int),
    'svc': svc_best.predict(test_feature).astype(int),
    'cat': cat_best.predict(test_feature).astype(int)
}


#File output
for key, value in pred.items():
    pd.concat(
        [
            pd.DataFrame(test.PassengerId, columns=['PassengerId']).reset_index(drop=True),
            pd.DataFrame(value, columns=['Survived'])
       ],
        axis=1
    ).to_csv(f'output_{key}.csv', index=False)

result

When submitted to Kaggle, SVC gave the highest results. image.png

After that, I adjusted various columns and hyperparameters, but at present this score is the highest result. I think that it will change depending on how to apply preprocessing, so if you come up with a good method in the future, we will verify it while actually operating it.

Recommended Posts

A story of a high school graduate technician trying to predict the survival of the Titanic
The story of trying to reconnect the client
I can't find the clocksource tsc! ?? The story of trying to write a kernel patch
The story of writing a program
The story of the algorithm drawing a ridiculous conclusion when trying to solve the traveling salesman problem properly
The story of adding MeCab to ubuntu 16.04
A story of a deep learning beginner trying to classify guitars on CNN
The story of trying deep3d and losing
The story of blackjack A processing (python)
Try to solve a set problem of high school math with Python
The story of pep8 changing to pycodestyle
I tried the common story of using Deep Learning to predict the Nikkei 225
The story of trying to push SSH_AUTH_SOCK obsolete on screen with LD_PRELOAD
The story of IPv6 address that I want to keep at a minimum
I tried to refactor the code of Python beginner (junior high school student)
The story of making a lie news generator
Story of trying to use tensorboard with pytorch
I tried to predict Titanic survival with PyCaret
The story of trying Sourcetrail × macOS × VS Code
The story of making a mel icon generator
The story of moving from Pipenv to Poetry
I tried to predict the price of ETF
A story that failed when trying to remove the suffix from the string with rstrip
The story of Airflow's webserver and DAG, which takes a long time to load
A story that got stuck when trying to upgrade the Python version on GCE
The story of Linux that I want to teach myself half a year ago
A story of trial and error trying to create a dynamic user group in Slack
The story of switching from WoSign to Let's Encrypt for a free SSL certificate
The story of trying to contribute to COVID-19 analysis with AWS free tier and failing
A memo of misunderstanding when trying to load the entire self-made module with Python3
How to change the generated image of GAN to a high quality one to your liking
python beginners tried to predict the number of criminals
The story of launching a Minecraft server from Discord
A story that reduces the effort of operation / maintenance
A memo to visually understand the axis of pandas.Panel
The story of making a music generation neural network
Steps to calculate the likelihood of a normal distribution
A story about changing the master name of BlueZ
Zip 4 Gbyte problem is a story of the past
A story that analyzed the delivery of Nico Nama.
The story of wanting to buy Ring Fit Adventure
The story of using circleci to build manylinux wheels
Python Note: The mystery of assigning a variable to a variable
Now in Singapore The story of creating a LineBot and wanting to do a memorable job
From zero knowledge of Python to making AI in the first grade of junior high school
The story of sys.path.append ()
The story of creating a VIP channel for in-house chatwork
[Ubuntu] How to delete the entire contents of a directory
The story of introducing jedi (python auto-completion package) to emacs
The story of a Django model field disappearing from a class
I made a function to check the model of DCGAN
A story about how to deal with the CORS problem
The story of copying data from S3 to Google's TeamDrive
How to find the scaling factor of a biorthogonal wavelet
Technology that supports jupyter: traitlets (story of trying to decipher)
After all, the story of returning from Linux to Windows
A story about trying to implement a private variable in Python.
The story of creating a database using the Google Analytics API
The story of making a question box bot with discord.py
Is there a secret to the frequency of pi numbers?
How to connect the contents of a list into a string