[PYTHON] [Note] I want to completely preprocess the data of the Titanic issue-Age version-

First posted article is a revenge because I couldn't write until the end because I was exhausted on the way. I'm editing it to make it a little easier to understand.


Since there are many missing values ​​of Age of Titanic, I think that accuracy will improve if all are filled. I thought.

Given data

The libraries imported this time are as follows.

import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import RandomForestRegressor as RFR

I always omit long names on my own, so if you don't understand something on the way Please think that is the case.

Putting the original data in a DataFrame looks like this.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

It's a little annoying to write, so only the beginning and the end. This is the learning data for the Titanic that everyone knows. See the information at df.info ().

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

All data are 12 × 891, Age, Cabin, Embarked and missing Name, Sex, Ticket, Cabin, Embarked are Object (character string), so they cannot be used as they are.

[Pretreatment 1] Sex and Embarked

It's a process that everyone probably does. Sex and Embarked are Objects, but because they have only two or three elements Replace with a simple number. Also, Embarked has two defects, but since it is small and the numerical value is biased, it is supplemented with the mode.

#Count the number of elements in Embarked
S    644
C    168
Q     77
#Quantify Sex
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
#Quantify Embarked
df['Embarked'] = df['Embarked'].map({'S':0, 'C':1, 'Q':2})
#Embarked is the most S for the time being(0)Complement with
df['Embarked'] = df['Embarked'].fillna(0)

[Criteria] Age is complemented by the average value

This time, I would like to compare the age deficiency with the average value. Then run it into a random forest and use it as a reference.

#Actually df so as not to mess up later.copy()I'm making another DF.
#Age fills in the missing values, so for the time being, complement with the average value
df['Age'] = df['Age'].fillna(df['Age'].mean())

#Name as Object,Ticket,Cabin creates data without any time
df_data = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
df_label = df[['Survived']]

(train_data, test_data, train_label, test_label) = tts(df_data, df_label, test_size=0.3, random_state=0)
# train_test_split=As tts

clf = RFC() # RandomForestClassifier=As RFC
clf.fit(train_data, train_label)
clf.score(test_data, test_label)

This time I want to see before and after only preprocessing the data Parameters and so on are left as default.

Result is···


The standard was a little high. .. Can we aim for higher heights? .. ..

[Pretreatment 2] Observation of Age features

I would like to believe that life and death are different depending on age. For the time being, I thought it would be great if I could find out the approximate age from some data.

First, let's look at the correlation with the known Age.

Let's look at the correlation coefficient with df.corr () after removing the missing value of Age with dropna.

PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked
Age 0.033207 -0.069809 -0.331339 -0.084153 1.000000 -0.232625 -0.179191 0.091566 0.007461

Looking at this, Pclass, SibSp, and Parch seem to have a high correlation. It is a negative correlation whether Pclass is upgraded as we get older. SibSp and Parch have a lot of children in a big family, so that's the relationship. The reason why Fare is not so highly correlated is that it is added up for each family.

What is the relationship with Name?

At this stage, the character string data is still Name, Ticket, Cabin. In this, Name contains titles such as Mr and Miss. I learned Miss for unmarried women and Mrs for married women in English classes. Maybe if you analyze this, it will be related to age to some extent? ?? I thought.

[Pre-processing 3-1] Extract the title from Name

The Name is sandwiched between "," and "." Like "Braund, Mr. Owen Harris". Take out only this sandwiched area and save it with the column name "Honorific".

When extracting a character string, I did it with apply () this time, but even if I did it with map (same in parentheses), the exact same result was obtained. Attention here! !! ** There is a half-width space between "," and the title **. What if I didn't put it in first and didn't recognize the characters? ?? I thought that I ate a lot of time there.

#Don't forget the half-width space!
df['Honorific'] = df['Name'].apply(lambda x: x.split(', ')[1].split('.')[0])
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Lady              1
Ms                1
the Countess      1
Don               1
Mme               1
Jonkheer          1
Sir               1
Capt              1
Name: Honorific, dtype: int64

There seems to be only one honorific title. Let's look at the statistical data with describe ().


The result looks like this.

Honorific count mean std min 25% 50% 75% max
Capt 1.0 70.000000 NaN 70.00 70.000 70.0 70.00 70.0
Col 2.0 58.000000 2.828427 56.00 57.000 58.0 59.00 60.0
Don 1.0 40.000000 NaN 40.00 40.000 40.0 40.00 40.0
Dr 6.0 42.000000 12.016655 23.00 35.000 46.5 49.75 54.0
Jonkheer 1.0 38.000000 NaN 38.00 38.000 38.0 38.00 38.0
Lady 1.0 48.000000 NaN 48.00 48.000 48.0 48.00 48.0
Major 2.0 48.500000 4.949747 45.00 46.750 48.5 50.25 52.0
Master 36.0 4.574167 3.619872 0.42 1.000 3.5 8.00 12.0
Miss 146.0 21.773973 12.990292 0.75 14.125 21.0 30.00 63.0
Mlle 2.0 24.000000 0.000000 24.00 24.000 24.0 24.00 24.0
Mme 1.0 24.000000 NaN 24.00 24.000 24.0 24.00 24.0
Mr 398.0 32.368090 12.708793 11.00 23.000 30.0 39.00 80.0
Mrs 108.0 35.898148 11.433628 14.00 27.750 35.0 44.00 63.0
Ms 1.0 28.000000 NaN 28.00 28.000 28.0 28.00 28.0
Rev 6.0 43.166667 13.136463 27.00 31.500 46.5 53.25 57.0
Sir 1.0 49.000000 NaN 49.00 49.000 49.0 49.00 49.0
the Countess 1.0 33.000000 NaN 33.00 33.000 33.0 33.00 33.0

Master seems to be attached to a little boy. But there are Master children at the age of 12 and Mr children at the age of 11. .. Also, Mrs seems to be a married woman, but the youngest is 14 years old. ···Really? ?? Also, std (standard deviation) is NaN if you are alone. That's right.

All titles are quantified here for use as data. However, it is troublesome to quantify everything.

Looking at the table above, the titles of only one person are all ages, so I would like to ignore them this time. Put the title to be used in df_name, and save the one you don't use because you will need it later. It seems that if you put "~" in front of the element, it will be something else.

#Data to use
df_name = df[df['Honorific'].isin(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Major', 'Mlle', 'Col'])]
#Unused data
df_unneed_name = df[~df['Honorific'].isin(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Major', 'Mlle', 'Col'])]
#Quantify Honorific
df_name['Honorific'] = df_name['Honorific'].map({'Mr':0, 'Miss':1, 'Mrs':2, 'Master':3, 'Dr':4, 'Rev':5, 'Major':6, 'Mlle':7, 'Col':8})
0    517
1    182
2    125
3     40
4      7
5      6
8      2
7      2
6      2
Name: Honorific, dtype: int64

[Pretreatment 3-2] Predict Age with other features

Alright, I'm finally ready, so let's make a prediction. Use the data containing Age for training and the missing data for testing.

df_Agefill = df_name.dropna(subset=['Age'])
df_Agefill_data = df_Agefill[['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked','Honorific']]    #Training data
df_Agefill_label = df_Agefill[['Age']]    #Learning label

df_Agenull = df_name[df_name['Age'].isnull()]
df_Agenull_data = df_Agenull[['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked','Honorific']]  #Test data
df_Agenull_label = df_Agenull[['Age']]  #Test label

By the way, if you look at the correlation coefficient of the training data with df_Agefill.corr (), The correlation coefficient between Age and Honorific is


was. Even though I worked so hard ... No, I won't lose! It's linear until you get tired of it, so it may just not match the quantified numbers of the titles!

#Let's call RandomForestRegressor an RFR.
clf = RFR()
clf.fit(df_Agefill_data, df_Agefill_label)

#Store the answer in a label (assignment)
age_answer = clf.predict(df_Agenull_data)
df_Agenull_label['Age'] = age_answer

5   	37.987776
17  	31.422079
19  	26.808000
26  	32.879936
28  	20.253988
... 	...
859 	24.929030
863 	15.495167
868 	25.716969
878 	27.344498
888 	7.838333

Since Age is different from classification, we use the regression analysis tool Random Forest Regressor. It's floating point, but isn't it nice? ?? However, I don't know if it fits because there is no answer to this. ..

[Result] Try applying Age complement to Random Forest

Combines the data that complements Age with the original data.

#Combine labels and data
df_Agenull['Age'] = df_Agenull_label
#Combine data that originally contained Age, newly added data, and data omitted by title
df_Age = pd.concat([df_Agefill, df_Agenull, df_unneed_name])
#If this is left as it is, the index will be different, so restore it
df_Age = df_Age.sort_index()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    float64
 12  Honorific    891 non-null    object 
dtypes: float64(3), int64(6), object(4)
memory usage: 97.5+ KB

It's back to normal except for the new Honorific. Age is also complete. Now, let's see the correct answer rate in Random Forest! !!

#[Criteria] Age is exactly the same as the average value complement processing
df_data = df_Age[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
df_label = df_Age[['Survived']]

(train_data, test_data, train_label, test_label) = tts(df_data, df_label, test_size=0.3, random_state=0)

clf = RFC()
clf.fit(train_data, train_label)
clf.score(test_data, test_label)

Result is······


Very delicate! !! But it went up a little ... ??

The rate of increase was not commensurate with this effort Well, I'm glad I just went up a little! (Lol

Emi-chan, it went up slightly! (Manzai King)

Recommended Posts

[Note] I want to completely preprocess the data of the Titanic issue-Age version-
I want to read the html version of "OpenCV-Python Tutorials" OpenCV 3.1 version
I want to get League of Legends data ③
I want to get League of Legends data ②
I want to customize the appearance of zabbix
I want to get League of Legends data ①
I want to grep the execution result of strace
I want to fully understand the basics of Bokeh
I want to increase the security of ssh connections
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
Keras I want to get the output of any layer !!
I want to know the legend of the IT technology world
I sent the data of Raspberry Pi to GCP (free)
I want to get the name of the function / method being executed
I want to manually assign the training parameters of the [Pytorch] model
I want to output the beginning of the next month with Python
I want to check the position of my face with OpenCV!
I want to know the population of each country in the world.
I want to pin Spyder to the taskbar
I want to output to the console coolly
How to check the version of Django
I want to handle the rhyme part1
I want to handle the rhyme part3
I want to display the progress bar
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I don't want to admit it ... The dynamical representation of Neural Networks
Python Note: When you want to know the attributes of an object
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
(Python Selenium) I want to check the settings of the download destination of WebDriver
I want to batch convert the result of "string" .split () in Python
I want to explain the abstract class (ABCmeta) of Python in detail.
I want to sort a list in the order of other lists
I want to express my feelings with the lyrics of Mr. Children
I want to analyze the emotions of people who want to meet and tremble
I want to use the Qore SDK to predict the success of NBA players
I want to leave an arbitrary command in the command history of Shell
I want to stop the automatic deletion of the tmp area with RHEL7
Python: I want to measure the processing time of a function neatly
I want to get the path of the directory where the running file is stored.
I want to visualize the transfer status of the 2020 J League, what should I do?
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
The story of IPv6 address that I want to keep at a minimum
I want to use Python in the environment of pyenv + pipenv on Windows 10
I tried to save the data with discord
Anyway, I want to check JSON data easily
I want to knock 100 data sciences with Colaboratory
I want to handle the rhyme part7 (BOW)
I tried to touch the API of ebay
I tried to correct the keystone of the image
I want to use PyTorch to generate something like the lyrics of Japari Park
A note about the python version of python virtualenv
Try the free version of Progate [Python I]
I want to set a life cycle in the task definition of ECS
I want to add silence to the beginning of a wav file for 1 second
I want to see a list of WebDAV files in the Requests module
I want to create a web application that uses League of Legends data ①