** Site that I referred to when doing this time ** https://yolo-kiyoshi.com/2020/01/22/post-1588/ https://www.codexa.net/kaggle-titanic-beginner/ https://qiita.com/suzumi/items/8ce18bc90c942663d1e6
――It seems difficult to survive in a frigid environment. .. Did women and children on the lifeboat get preferential treatment?
――Isn't the person with high social status given preferential treatment?
Check with info () From info (), you can see that both Age and Cabin are missing.
It is like this when I personally summarize my experience of doing data analysis for a long time. How are you guys?
In this case, Age with medium deficiency is average or median, and Cabin is large deficiency, so it is not used. .. I would like to say that it is done in an easy-to-understand manner by other people, so let's do this part muddy
Regarding the lack of Age, I think the important thing is Name. Among them, it is valuable information such as what is called honorific title, men and women, adults and children, married / unmarried if it is a woman, high status, etc.
In particular, it is a ship that is likely to be used by people in their 30s on average. The presence of children and older people in high positions reduces the accuracy of Age.
So, first of all, I will extract the title.
in: #Show name
train_data['Name']
out:
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
If you look at the data from a bird's-eye view, there is a title between "," and "."
in: #I want to get the title out of the name
#Merge train test
train_data1 = train_data.copy()
test_data1 = test_data.copy()
train_data1['train_or_test'] = 'train'
test_data1['train_or_test'] = 'test'
test_data1['Survived'] = np.nan #Set Survived column to NaN for testing
all_data = pd.concat(
[
train_data1,
test_data1
],
sort=False,
axis=0 #Train in column direction_data1、test_Combine data1
).reset_index(drop=True)
#all_Extract titles from data+Calculate average age
honorific=all_data['honorific'] = all_data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
Average_age=all_data['Age'].groupby(all_data['honorific']).agg(['count','mean','median','std'])
Average_age
Doing this will give you an average age for each title
Now, let's enter the average age from the titles of the missing data based on this data.
in:Apply the average age for each title to the missing values
f = lambda x: x.fillna(x.mean())
age_complement = all_data.groupby('honorific').transform(f)
age_complement.info()
out:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
PassengerId 1309 non-null int64
Survived 1308 non-null float64
Pclass 1309 non-null int64
Age 1309 non-null float64
SibSp 1309 non-null int64
Parch 1309 non-null int64
Fare 1309 non-null float64
dtypes: float64(3), int64(4)
memory usage: 71.7 KB
This completes the age deficiency. But for some reason the column disappeared Transfer to the original all_data Age
del(all_data['Age']) #all_Delete Age of data
all_data['Age']=age_complement['Age'] #Create an Age column and put the Age data of the person who complemented the missing area in it.
This completes the processing of missing values in Age data.
Looking at the data, I was worried about the variation in the price of Fare, so this is not the price per person, but the total price of the group is substituted ,? I thought. Fare / We will count the number of duplicate tickets and return it to the reasonable price per person
in:# 1.Create a dictionary array that represents the number of duplicate Tickets
double_check_dict = all_data['Ticket'].value_counts().to_dict()
# 2.Add multiple columns to a DataFrame
all_data['double_check'] = all_data['Ticket'].apply(lambda x: double_check_dict[x] if x in double_check_dict else 0)
all_data['Fare']=all_data['Fare']/all_data['double_check']
all_data
Double_check is the number of duplicate tickets, but after all, the price of Fare is soaring where there are many tickets, so divide Fare by the number of tickets.
in:
all_data['Fare']=all_data['Fare']/all_data['double_check']
all_data
With this, we were able to suppress price variations.
** If the Cabin rooms are now stratified by price, why not get most of the rooms from Fare? I think **
We will decompose the data that contains the Cabin data.
--Extract the acronym of Cabin data and decompose it into A, B, C, D .. --The group that stays in multiple rooms is separated by "C12 C13 C14" in blank units, so calculate how many rooms you rented with "Blank +1".
in:
cabin_data=all_data.dropna() #Extract only those with Cabin data
cabin_data['Cabin_id'] = cabin_data['Cabin'].map(lambda x:x[0]) #Cabin_Put the acronym Cabin in id
cabin_data['room']=cabin_data['Cabin'].map(lambda x:x.count(' '))+1
cabin_data.head(50)
To see if there is a price difference in the rank of each room, make one whole thing, divided by P_class 1 2 3 and think about the price of each room
cabin_data_1=cabin_data.query('Pclass == "1"')
cabin_data_2=cabin_data.query('Pclass == "2"')
cabin_data_3=cabin_data.query('Pclass == "3"')
Average_ageC=cabin_data['Fare'].groupby(cabin_data['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC
Click here for the overall price
Average_ageC=cabin_data_1['Fare'].groupby(cabin_data_1['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC
Pclass1
Average_ageC=cabin_data_2['Fare'].groupby(cabin_data_2['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC
Pclass2
Average_ageC=cabin_data_3['Fare'].groupby(cabin_data_3['Cabin_id']).agg(['count','mean','median','std','max','min'])
Average_ageC
Pclass3
It looks like this when you look at it roughly, but for example, if the usage fee is different for each room in the same class band, it is possible to estimate the missing value of Cabin, but it is almost the same and it seems that it can not be estimated exactly. Hmm, sorry
all_data['famiry_size']=all_data['SibSp']+all_data['Parch']+1
I divided it into 11 parts, but I feel like I can still do deep digging here.
#Fare split
all_data['Fare_bin'] = pd.qcut(all_data.Fare, 11)#Fare to Fare_Divide into 11 as bin
sex_col = ['Sex']
le = LabelEncoder()
for col in sex_col:
all_data[col] = le.fit_transform(all_data[col]) #Divide Sex male famale into 0 and 1
cat_col = ['Pclass','Embarked','honorific','Fare_bin','famiry_size',]
all_data = pd.get_dummies(all_data, drop_first=True, columns=cat_col)#'Pclass','Embarked','honorific','Fare_bin','famiry_size'To a dummy variable with 0 and 1
That's all for dividing each column.
from sklearn.model_selection import train_test_split
train = all_data.query('train_or_test == "train"')#'train_or_test extracts train
test = all_data.query('train_or_test == "test"')
#Define target variables and columns unnecessary for learning
target_col = 'Survived'
drop_col = ['PassengerId','Survived', 'Name', 'Fare', 'Ticket', 'Cabin', 'train_or_test','Parch','SibSp','honorific_Jonkheer','honorific_Mme','honorific_Dona','honorific_Lady','honorific_Ms',]
#Holds only the features required for learning
train_feature = train.drop(columns=drop_col)
test_feature = test.drop(columns=drop_col)
train_tagert = train[target_col]
#Split train data
X_train, X_test, y_train, y_test = train_test_split(
train_feature, train_tagert, test_size=0.3, random_state=0, stratify=train_tagert)
As a library RandomForestClassifier SVC LogisticRegression CatBoostClassifier Use
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
print('='*20)
print('RandomForestClassifier')
print(f'accuracy of train set: {rfc.score(X_train, y_train)}')
print(f'accuracy of test set: {rfc.score(X_test, y_test)}')
lr = LogisticRegression(random_state=0)
lr.fit(X_train, y_train)
print('='*20)
print('LogisticRegression')
print(f'accuracy of train set: {lr.score(X_train, y_train)}')
print(f'accuracy of train set: {lr.score(X_test, y_test)}')
svc = SVC(random_state=0)
svc.fit(X_train, y_train)
print('='*20)
print('SVC')
print(f'accuracy of train set: {svc.score(X_train, y_train)}')
print(f'accuracy of train set: {svc.score(X_test, y_test)}')
cat = CatBoostClassifier(random_state=0)
cat.fit(X_train, y_train)
print('='*20)
print('CAT')
print(f'accuracy of train set: {cat.score(X_train, y_train)}')
print(f'accuracy of train set: {cat.score(X_test, y_test)}')
As the out side
RandomForestClassifier
accuracy of train set: 0.9678972712680578
accuracy of test set: 0.8246268656716418
LogisticRegression
accuracy of train set: 0.8426966292134831
accuracy of train set: 0.832089552238806
SVC
accuracy of train set: 0.8330658105939005
accuracy of train set: 0.8134328358208955
CAT
accuracy of train set: 0.9085072231139647
accuracy of train set: 0.8507462686567164
RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
import optuna
cv = 10
def objective(trial):
param_grid_rfc = {
"max_depth": trial.suggest_int("max_depth", 5, 15),
"min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
'min_samples_split': trial.suggest_int("min_samples_split", 7, 15),
"criterion": trial.suggest_categorical("criterion", ["gini", "entropy"]),
'max_features': trial.suggest_int("max_features", 3, 10),
"random_state": 0
}
model = RandomForestClassifier(**param_grid_rfc)
# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
#1 because it is minimized.Subtract the score from 0
return scores['test_score'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
rfc_best_param = study.best_params
SVC
import warnings
warnings.filterwarnings('ignore')
def objective(trial):
param_grid_lr = {
'C' : trial.suggest_int("C", 1, 100),
"random_state": 0
}
model = LogisticRegression(**param_grid_lr)
# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
#1 because it is minimized.Subtract the score from 0
return scores['test_score'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
lr_best_param = study.best_params
LogisticRegression
import warnings
warnings.filterwarnings('ignore')
def objective(trial):
param_grid_svc = {
'C' : trial.suggest_int("C", 50, 200),
'gamma': trial.suggest_loguniform("gamma", 1e-4, 1.0),
"random_state": 0,
'kernel': 'rbf'
}
model = SVC(**param_grid_svc)
# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
#1 because it is minimized.Subtract the score from 0
return scores['test_score'].mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
print(study.best_value)
svc_best_param = study.best_params
CatBoostClassifier
from sklearn.model_selection import train_test_split
from catboost import Pool
import sklearn.metrics
X = train_feature
y = test_feature
categorical_features_indices = np.where(X.dtypes != np.float)[0]
def objective(trial):
#Separate training data and test data
X_train, X_test, y_train, y_test = train_test_split(
train_feature, train_tagert, test_size=0.35, random_state=0, stratify=train_tagert)
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
test_pool = Pool(X_test,y_test, cat_features=categorical_features_indices)
#Parameter specification
params = {
'iterations' : trial.suggest_int('iterations', 50, 300),
'depth' : trial.suggest_int('depth', 4, 10),
'learning_rate' : trial.suggest_loguniform('learning_rate', 0.01, 0.3),
'random_strength' :trial.suggest_int('random_strength', 0, 100),
'bagging_temperature' :trial.suggest_loguniform('bagging_temperature', 0.01, 100.00),
'od_type': trial.suggest_categorical('od_type', ['IncToDec', 'Iter']),
'od_wait' :trial.suggest_int('od_wait', 10, 50)
}
#Learning
model = CatBoostClassifier(**params)
# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_validate(model, X=X_train, y=y_train, cv=kf)
#1 because it is minimized.Subtract the score from 0
return scores['test_score'].mean()
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(objective, n_trials=5)
cat_best_param = study.best_params
print(study.best_value)
print(cat_best_param)
# 5-Fold CV /Evaluate the model with Accuracy
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
rfc_best = RandomForestClassifier(**rfc_best_param)
print('RandomForestClassifier')
scores = cross_validate(rfc_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')
lr_best = LogisticRegression(**lr_best_param)
print('LogisticRegression')
scores = cross_validate(lr_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')
svc_best = SVC(**svc_best_param)
print('SVC')
scores = cross_validate(svc_best, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')
cat_best =CatBoostClassifier
print('CAT')
scores = cross_validate(cat, X=train_feature, y=train_tagert, cv=kf)
print(f'mean:{scores["test_score"].mean()}, std:{scores["test_score"].std()}')
as a result
RandomForestClassifier
mean:0.827152263628423, std:0.029935476082608138
LogisticRegression
mean:0.8294309818436642, std:0.03568888547665349
SVC
mean:0.826034945192669, std:0.03392425879847107
CAT
mean:0.8249241166088076, std:0.030217830226771592
have become. Logistic Regression is the most accurate, but it also has a high std value.
# RandomForest
rfc_best = RandomForestClassifier(**rfc_best_param)
rfc_best.fit(train_feature, train_tagert)
# LogisticRegression
lr_best = LogisticRegression(**lr_best_param)
lr_best.fit(train_feature, train_tagert)
# SVC
svc_best = SVC(**svc_best_param)
svc_best.fit(train_feature, train_tagert)
#CatBoostClassifier
cat_best = CatBoostClassifier(**cat_best_param)
cat_best.fit(train_feature, train_tagert)
#Predict each
pred = {
'rfc': rfc_best.predict(test_feature).astype(int),
'lr': lr_best.predict(test_feature).astype(int),
'svc': svc_best.predict(test_feature).astype(int),
'cat': cat_best.predict(test_feature).astype(int)
}
#File output
for key, value in pred.items():
pd.concat(
[
pd.DataFrame(test.PassengerId, columns=['PassengerId']).reset_index(drop=True),
pd.DataFrame(value, columns=['Survived'])
],
axis=1
).to_csv(f'output_{key}.csv', index=False)
When submitted to Kaggle, SVC gave the highest results.
After that, I adjusted various columns and hyperparameters, but at present this score is the highest result. I think that it will change depending on how to apply preprocessing, so if you come up with a good method in the future, we will verify it while actually operating it.
Recommended Posts