As a tutorial on learning machine learning, I'll take a reminder of how I've done to predict the Titanic survivors, which is a must-have for everyone.
About the version used
The data used was downloaded from here after registering with Kaggle. https://www.kaggle.com/c/titanic
import pandas as pd
import numpy as np
df = pd.read_csv('train.csv')
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Defined as data frame df. Now let's look at the first five lines of the read data.
df.head()
Survived is 1 for survival and 0 for death. Please see the official website for what other factors refer to.
Next, let's look at the histogram.
df.hist(figsize=(12,12))
plt.show()
You can see that most of them are in their 20s and 30s, and most of them have a Pclass of 3 (the cheapest).
plt.figure(figsize = (15,15))
sns.heatmap(df.corr(),annot = True)
Indicators with a high correlation coefficient for Survived are 0.26: Fare and -0.34: Pclass. Fare: If the Pclass grade is decided, the fare will be decided naturally, but the two are different indicators. ..
df.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
Age: Age and Cabin: You can see that there are many amounts without the crew number. For the handling of missing values, refer to the following. https://qiita.com/0NE_shoT_/items/8db6d909e8b48adcb203
This time, we decided to substitute the median for age. Also, Embarked: The boarding place is substituting the most S. Other missing values have been removed.
from sklearn.model_selection import train_test_split
#Missing value processing
df['Fare'] = df['Fare'].fillna(df['Fare'].median())
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna('S')
#Conversion of categorical variables
df['Sex'] = df['Sex'].apply(lambda x: 1 if x == 'male' else 0)
df['Embarked'] = df['Embarked'].map( {'S': 0 , 'C':1 , 'Q':2}).astype(int)
#Delete unnecessary columns
df = df.drop(['Cabin','Name','PassengerId','Ticket'],axis =1)
train_X = df.drop('Survived',axis = 1)
train_y = df.Survived
(train_X , test_X , train_y , test_y) = train_test_split(train_X, train_y , test_size = 0.3 , random_state = 0)
This time, the test data is 30% (test_size = 0.3).
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', random_state = 0)
clf = clf.fit(train_X , train_y)
pred = clf.predict(test_X)
#Calculation of correct answer rate
from sklearn.metrics import (roc_curve , auc ,accuracy_score)
pred = clf.predict(test_X)
fpr, tpr, thresholds = roc_curve(test_y , pred,pos_label = 1)
auc(fpr,tpr)
accuracy_score(pred,test_y)
I referred to here for the handling of decision tree parameters. http://data-analysis-stats.jp/2019/01/14/%E6%B1%BA%E5%AE%9A%E6%9C%A8%E5%88%86%E6%9E%90%E3%81%AE%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E8%A7%A3%E8%AA%AC/
criterion='gini', 0.7798507462686567 criterion='entropy', 0.7910447761194029
The correct answer rate increased slightly for entropy. According to the reference url
As a point of proper use, it is said that the Gini coefficient is better at continuous data, and entropy is better at categorical data. The Gini coefficient minimizes misclassification, while entropy explores the reference value.
... apparently ... This time, there was a lot of category data such as gender and boarding place, so entropy may be appropriate.
This is also helpful because it has detailed information on the decision tree. https://qiita.com/3000manJPY/items/ef7495960f472ec14377
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 10,max_depth=5,random_state = 0) #Create an instance of Random Forest.
clf = clf.fit(train_X , train_y) #Train the model with the fit method using teacher labels and teacher data
pred = clf.predict(test_X)
fpr, tpr , thresholds = roc_curve(test_y,pred,pos_label = 1)
auc(fpr,tpr)
accuracy_score(pred,test_y)
0.8283582089552238
Random forests are available for classification (Classifier) and for regression analysis (Regressor). This time, the purpose is to classify whether it is alive or dead, so we will use it for classification.
I tried to move the learning parameters and arguments of the RandomForestClassifier class while looking at this. https://data-science.gr.jp/implementation/iml_sklearn_random_forest.html
As I increased n_estimators, the accuracy rate increased. It is said that n_estimators and epochs used in neural networks often end up in trouble even if they are calculated with large values. Please see below for the contents and countermeasures. https://amalog.hateblo.jp/entry/hyper-parameter-search
From the previous results, it was found that Random Forest has a higher accuracy rate than decision trees. Therefore, we decided to use Random Forest for prediction this time.
fin = pd.read_csv('test.csv')
fin.head()
passsengerid = fin['PassengerId']
fin.isnull().sum()
fin['Fare'] = fin['Fare'].fillna(fin['Fare'].median())
fin['Age'] = fin['Age'].fillna(fin['Age'].median())
fin['Embarked'] = fin['Embarked'].fillna('S')
#Conversion of categorical variables
fin['Sex'] = fin['Sex'].apply(lambda x: 1 if x == 'male' else 0)
fin['Embarked'] = fin['Embarked'].map( {'S': 0 , 'C':1 , 'Q':2}).astype(int)
#Delete unnecessary columns
fin= fin.drop(['Cabin','Name','Ticket','PassengerId'],axis =1)
#Predicted in Random Forest
predictions = clf.predict(fin)
submission = pd.DataFrame({'PassengerId':passsengerid, 'Survived':predictions})
submission.to_csv('submission.csv' , index = False)
It was 4900th / 16000th. I would like to devote myself more.
Recommended Posts