** Site that I referred to when doing this time ** https://yolo-kiyoshi.com/2020/01/22/post-1588/ https://www.codexa.net/kaggle-titanic-beginner/ https://qiita.com/suzumi/items/8ce18bc90c942663d1e6

Thoughts / Bias

――It seems difficult to survive in a frigid environment. .. Did women and children on the lifeboat get preferential treatment?

――Isn't the person with high social status given preferential treatment?

Data confirmation

Check with info () 2020-04-17 (2).png From info (), you can see that both Age and Cabin are missing.

What to do with missing values?

It is like this when I personally summarize my experience of doing data analysis for a long time. How are you guys?

In this case, Age with medium deficiency is average or median, and Cabin is large deficiency, so it is not used. .. I would like to say that it is done in an easy-to-understand manner by other people, so let's do this part muddy

Age deficiency

Regarding the lack of Age, I think the important thing is Name. Among them, it is valuable information such as what is called honorific title, men and women, adults and children, married / unmarried if it is a woman, high status, etc.

In particular, it is a ship that is likely to be used by people in their 30s on average. The presence of children and older people in high positions reduces the accuracy of Age.

So, first of all, I will extract the title.

in: #Show name
train_data['Name']

out:
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

If you look at the data from a bird's-eye view, there is a title between "," and "."

in: #I want to get the title out of the name
#Merge train test
train_data1 = train_data.copy()
test_data1 = test_data.copy()
train_data1['train_or_test'] = 'train' 
test_data1['train_or_test'] = 'test' 
test_data1['Survived'] = np.nan #Set Survived column to NaN for testing
all_data = pd.concat(
    [
        train_data1,
        test_data1
    ],
    sort=False,
    axis=0 #Train in column direction_data1、test_Combine data1
).reset_index(drop=True)
#all_Extract titles from data+Calculate average age
honorific=all_data['honorific'] = all_data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
Average_age=all_data['Age'].groupby(all_data['honorific']).agg(['count','mean','median','std'])
Average_age

Doing this will give you an average age for each title 2020-04-19 (2).png

Now, let's enter the average age from the titles of the missing data based on this data.

in:Apply the average age for each title to the missing values
f = lambda x: x.fillna(x.mean())
age_complement = all_data.groupby('honorific').transform(f)
age_complement.info()


out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
PassengerId    1309 non-null int64
Survived       1308 non-null float64
Pclass         1309 non-null int64
Age            1309 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Fare           1309 non-null float64
dtypes: float64(3), int64(4)
memory usage: 71.7 KB

This completes the age deficiency. But for some reason the column disappeared Transfer to the original all_data Age

del(all_data['Age']) #all_Delete Age of data
all_data['Age']=age_complement['Age']　#Create an Age column and put the Age data of the person who complemented the missing area in it.

This completes the processing of missing values in Age data.

Variations in Fare that I was worried about while manipulating data

Looking at the data, I was worried about the variation in the price of Fare, so this is not the price per person, but the total price of the group is substituted ,? I thought. Fare / We will count the number of duplicate tickets and return it to the reasonable price per person

in:# 1.Create a dictionary array that represents the number of duplicate Tickets
double_check_dict = all_data['Ticket'].value_counts().to_dict()

# 2.Add multiple columns to a DataFrame
all_data['double_check'] = all_data['Ticket'].apply(lambda x: double_check_dict[x] if x in double_check_dict else 0)
all_data['Fare']=all_data['Fare']/all_data['double_check']
all_data

2020-04-19 (5).png

Double_check is the number of duplicate tickets, but after all, the price of Fare is soaring where there are many tickets, so divide Fare by the number of tickets.

in:
all_data['Fare']=all_data['Fare']/all_data['double_check']
all_data

With this, we were able to suppress price variations.

** If the Cabin rooms are now stratified by price, why not get most of the rooms from Fare? I think **

Think about Cabin

We will decompose the data that contains the Cabin data.

--Extract the acronym of Cabin data and decompose it into A, B, C, D .. --The group that stays in multiple rooms is separated by "C12 C13 C14" in blank units, so calculate how many rooms you rented with "Blank +1".

in:
cabin_data=all_data.dropna() #Extract only those with Cabin data
cabin_data['Cabin_id'] = cabin_data['Cabin'].map(lambda x:x[0]) #Cabin_Put the acronym Cabin in id
cabin_data['room']=cabin_data['Cabin'].map(lambda x:x.count(' '))+1
cabin_data.head(50)

Think about whether you can calculate the room charge separately for each P class

To see if there is a price difference in the rank of each room, make one whole thing, divided by P_class 1 2 3 and think about the price of each room

cabin_data_1=cabin_data.query('Pclass == "1"')
cabin_data_2=cabin_data.query('Pclass == "2"')
cabin_data_3=cabin_data.query('Pclass == "3"')

Average_ageC=cabin_data['Fare'].groupby(cabin_data['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Click here for the overall price

Average_ageC=cabin_data_1['Fare'].groupby(cabin_data_1['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Pclass1

Average_ageC=cabin_data_2['Fare'].groupby(cabin_data_2['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Pclass2

Average_ageC=cabin_data_3['Fare'].groupby(cabin_data_3['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC

Pclass3

It looks like this when you look at it roughly, but for example, if the usage fee is different for each room in the same class band, it is possible to estimate the missing value of Cabin, but it is almost the same and it seems that it can not be estimated exactly. Hmm, sorry

Get the number of family members from SibSp and Parch

all_data['famiry_size']=all_data['SibSp']+all_data['Parch']+1

Divide Fare into 11

I divided it into 11 parts, but I feel like I can still do deep digging here.

#Fare split
all_data['Fare_bin'] = pd.qcut(all_data.Fare, 11)#Fare to Fare_Divide into 11 as bin

Divide Sex

sex_col = ['Sex']
le = LabelEncoder()
for col in sex_col:
    all_data[col] = le.fit_transform(all_data[col]) #Divide Sex male famale into 0 and 1

Replace'Pclass',' Embarked',' honorific',' Fare_bin','famiry_size', with categorical variables

cat_col = ['Pclass','Embarked','honorific','Fare_bin','famiry_size',]
all_data = pd.get_dummies(all_data, drop_first=True, columns=cat_col)#'Pclass','Embarked','honorific','Fare_bin','famiry_size'To a dummy variable with 0 and 1

That's all for dividing each column.

Divide all_data into train and test data

from sklearn.model_selection import train_test_split
 
train = all_data.query('train_or_test == "train"')#'train_or_test extracts train
test = all_data.query('train_or_test == "test"')
#Define target variables and columns unnecessary for learning
target_col = 'Survived'
drop_col = ['PassengerId','Survived', 'Name', 'Fare', 'Ticket', 'Cabin', 'train_or_test','Parch','SibSp','honorific_Jonkheer','honorific_Mme','honorific_Dona','honorific_Lady','honorific_Ms',]
#Holds only the features required for learning
train_feature = train.drop(columns=drop_col)
test_feature = test.drop(columns=drop_col)
train_tagert = train[target_col]
#Split train data
X_train, X_test, y_train, y_test = train_test_split(
    train_feature, train_tagert, test_size=0.3, random_state=0, stratify=train_tagert)

Learning

As a library RandomForestClassifier SVC LogisticRegression CatBoostClassifier Use

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier


rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
print('='*20)
print('RandomForestClassifier')
print(f'accuracy of train set: {rfc.score(X_train, y_train)}')
print(f'accuracy of test set: {rfc.score(X_test, y_test)}')


lr = LogisticRegression(random_state=0)
lr.fit(X_train, y_train)
print('='*20)
print('LogisticRegression')
print(f'accuracy of train set: {lr.score(X_train, y_train)}')
print(f'accuracy of train set: {lr.score(X_test, y_test)}')

svc = SVC(random_state=0)
svc.fit(X_train, y_train)
print('='*20)
print('SVC')
print(f'accuracy of train set: {svc.score(X_train, y_train)}')
print(f'accuracy of train set: {svc.score(X_test, y_test)}')


cat = CatBoostClassifier(random_state=0)
cat.fit(X_train, y_train)
print('='*20)
print('CAT')
print(f'accuracy of train set: {cat.score(X_train, y_train)}')
print(f'accuracy of train set: {cat.score(X_test, y_test)}')

As the out side

RandomForestClassifier
accuracy of train set: 0.9678972712680578
accuracy of test set: 0.8246268656716418
LogisticRegression
accuracy of train set: 0.8426966292134831
accuracy of train set: 0.832089552238806
SVC
accuracy of train set: 0.8330658105939005
accuracy of train set: 0.8134328358208955
CAT
accuracy of train set: 0.9085072231139647
accuracy of train set: 0.8507462686567164

Adjust hyperparameters with Oputuna and Kfold tests

RandomForestClassifier

from sklearn.model_selection import StratifiedKFold, cross_validate
import optuna

cv = 10
def objective(trial):
    
    param_grid_rfc = {
        "max_depth": trial.suggest_int("max_depth", 5, 15),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
        'min_samples_split': trial.suggest_int("min_samples_split", 7, 15),
        "criterion": trial.suggest_categorical("criterion", ["gini", "entropy"]),
        'max_features': trial.suggest_int("max_features", 3, 10),
        "random_state": 0
    }
 
    model = RandomForestClassifier(**param_grid_rfc)
    
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()
 
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
rfc_best_param = study.best_params

SVC

import warnings
warnings.filterwarnings('ignore')

def objective(trial):
    
    param_grid_lr = {
        'C' : trial.suggest_int("C", 1, 100),
        "random_state": 0
    }

    model = LogisticRegression(**param_grid_lr)
    
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
lr_best_param = study.best_params

LogisticRegression

import warnings
warnings.filterwarnings('ignore')

def objective(trial):
    
    param_grid_svc = {
        'C' : trial.suggest_int("C", 50, 200),
        'gamma': trial.suggest_loguniform("gamma", 1e-4, 1.0),
        "random_state": 0,
        'kernel': 'rbf'
    }

    model = SVC(**param_grid_svc)
    
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
svc_best_param = study.best_params

CatBoostClassifier

from sklearn.model_selection import train_test_split
from catboost import Pool
import sklearn.metrics
X = train_feature
y = test_feature
categorical_features_indices = np.where(X.dtypes != np.float)[0]
def objective(trial):
    #Separate training data and test data
    X_train, X_test, y_train, y_test = train_test_split(
    train_feature, train_tagert, test_size=0.35, random_state=0, stratify=train_tagert)
    train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
    test_pool = Pool(X_test,y_test, cat_features=categorical_features_indices)

    #Parameter specification
    params = {
        'iterations' : trial.suggest_int('iterations', 50, 300),                         
        'depth' : trial.suggest_int('depth', 4, 10),                                       
        'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 0.3),               
        'random_strength' :trial.suggest_int('random_strength', 0, 100),                       
        'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00), 
        'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
        'od_wait' :trial.suggest_int('od_wait', 10, 50)
    }

    #Learning
    model = CatBoostClassifier(**params)
    # 5-Fold CV /Evaluate the model with Accuracy
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    scores = cross_validate(model, X=X_train, y=y_train, cv=kf)

    #1 because it is minimized.Subtract the score from 0
    return scores['test_score'].mean()

if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=5)
    cat_best_param = study.best_params
    print(study.best_value)
    print(cat_best_param)

Verify accuracy again with adjusted parameter

# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

rfc_best = RandomForestClassifier(**rfc_best_param)
print('RandomForestClassifier')
scores = cross_validate(rfc_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

lr_best = LogisticRegression(**lr_best_param)
print('LogisticRegression')
scores = cross_validate(lr_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

svc_best = SVC(**svc_best_param)
print('SVC')
scores = cross_validate(svc_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

cat_best =CatBoostClassifier
print('CAT')
scores = cross_validate(cat, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')

as a result

RandomForestClassifier
mean:0.827152263628423, std:0.029935476082608138
LogisticRegression
mean:0.8294309818436642, std:0.03568888547665349
SVC
mean:0.826034945192669, std:0.03392425879847107
CAT
mean:0.8249241166088076, std:0.030217830226771592

have become. Logistic Regression is the most accurate, but it also has a high std value.

# RandomForest
rfc_best = RandomForestClassifier(**rfc_best_param)
rfc_best.fit(train_feature, train_tagert)

# LogisticRegression
lr_best = LogisticRegression(**lr_best_param)
lr_best.fit(train_feature, train_tagert)

# SVC
svc_best = SVC(**svc_best_param)
svc_best.fit(train_feature, train_tagert)

#CatBoostClassifier
cat_best = CatBoostClassifier(**cat_best_param)
cat_best.fit(train_feature, train_tagert)


#Predict each
pred = {
    'rfc': rfc_best.predict(test_feature).astype(int),
    'lr': lr_best.predict(test_feature).astype(int),
    'svc': svc_best.predict(test_feature).astype(int),
    'cat': cat_best.predict(test_feature).astype(int)
}


#File output
for key, value in pred.items():
    pd.concat(
        [
            pd.DataFrame(test.PassengerId, columns=['PassengerId']).reset_index(drop=True),
            pd.DataFrame(value, columns=['Survived'])
       ],
        axis=1
    ).to_csv(f'output_{key}.csv', index=False)

result

When submitted to Kaggle, SVC gave the highest results.

After that, I adjusted various columns and hyperparameters, but at present this score is the highest result. I think that it will change depending on how to apply preprocessing, so if you come up with a good method in the future, we will verify it while actually operating it.

[PYTHON] A story of a high school graduate technician trying to predict the survival of the Titanic