[PYTHON] [Note] I want to completely preprocess the data of the Titanic issue-Age version-

First posted article is a revenge because I couldn't write until the end because I was exhausted on the way. I'm editing it to make it a little easier to understand.

Purpose

Since there are many missing values of Age of Titanic, I think that accuracy will improve if all are filled. I thought.

Given data

The libraries imported this time are as follows.

import pandas as pd
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import RandomForestRegressor as RFR

I always omit long names on my own, so if you don't understand something on the way Please think that is the case.

Putting the original data in a DataFrame looks like this.

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

It's a little annoying to write, so only the beginning and the end. This is the learning data for the Titanic that everyone knows. See the information at df.info ().

df.info()
#result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

All data are 12 × 891, Age, Cabin, Embarked and missing Name, Sex, Ticket, Cabin, Embarked are Object (character string), so they cannot be used as they are.

[Pretreatment 1] Sex and Embarked

It's a process that everyone probably does. Sex and Embarked are Objects, but because they have only two or three elements Replace with a simple number. Also, Embarked has two defects, but since it is small and the numerical value is biased, it is supplemented with the mode.

#Count the number of elements in Embarked
df.Embarked.value_counts()
#result
S    644
C    168
Q     77

#Quantify Sex
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
#Quantify Embarked
df['Embarked'] = df['Embarked'].map({'S':0, 'C':1, 'Q':2})
#Embarked is the most S for the time being(0)Complement with
df['Embarked'] = df['Embarked'].fillna(0)

[Criteria] Age is complemented by the average value

This time, I would like to compare the age deficiency with the average value. Then run it into a random forest and use it as a reference.

#Actually df so as not to mess up later.copy()I'm making another DF.
#Age fills in the missing values, so for the time being, complement with the average value
df['Age'] = df['Age'].fillna(df['Age'].mean())

#Name as Object,Ticket,Cabin creates data without any time
df_data = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
df_label = df[['Survived']]

(train_data, test_data, train_label, test_label) = tts(df_data, df_label, test_size=0.3, random_state=0)
# train_test_split=As tts

clf = RFC()　# RandomForestClassifier=As RFC
clf.fit(train_data, train_label)
clf.score(test_data, test_label)

This time I want to see before and after only preprocessing the data Parameters and so on are left as default.

Result is···

0.8134328358208955

The standard was a little high. .. Can we aim for higher heights? .. ..

[Pretreatment 2] Observation of Age features

I would like to believe that life and death are different depending on age. For the time being, I thought it would be great if I could find out the approximate age from some data.

First, let's look at the correlation with the known Age.

Let's look at the correlation coefficient with df.corr () after removing the missing value of Age with dropna.

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
Age	0.033207	-0.069809	-0.331339	-0.084153	1.000000	-0.232625	-0.179191	0.091566	0.007461

Looking at this, Pclass, SibSp, and Parch seem to have a high correlation. It is a negative correlation whether Pclass is upgraded as we get older. SibSp and Parch have a lot of children in a big family, so that's the relationship. The reason why Fare is not so highly correlated is that it is added up for each family.

What is the relationship with Name?

At this stage, the character string data is still Name, Ticket, Cabin. In this, Name contains titles such as Mr and Miss. I learned Miss for unmarried women and Mrs for married women in English classes. Maybe if you analyze this, it will be related to age to some extent? ?? I thought.

[Pre-processing 3-1] Extract the title from Name

The Name is sandwiched between "," and "." Like "Braund, Mr. Owen Harris". Take out only this sandwiched area and save it with the column name "Honorific".

When extracting a character string, I did it with apply () this time, but even if I did it with map (same in parentheses), the exact same result was obtained. Attention here! !! ** There is a half-width space between "," and the title **. What if I didn't put it in first and didn't recognize the characters? ?? I thought that I ate a lot of time there.

#Don't forget the half-width space!
df['Honorific'] = df['Name'].apply(lambda x: x.split(', ')[1].split('.')[0])
df['Honorific'].value_counts()
#result
Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Col               2
Major             2
Lady              1
Ms                1
the Countess      1
Don               1
Mme               1
Jonkheer          1
Sir               1
Capt              1
Name: Honorific, dtype: int64

There seems to be only one honorific title. Let's look at the statistical data with describe ().

df.groupby('Honorific').describe()['Age']

The result looks like this.

Honorific	count	mean	std	min	25%	50%	75%	max
Capt	1.0	70.000000	NaN	70.00	70.000	70.0	70.00	70.0
Col	2.0	58.000000	2.828427	56.00	57.000	58.0	59.00	60.0
Don	1.0	40.000000	NaN	40.00	40.000	40.0	40.00	40.0
Dr	6.0	42.000000	12.016655	23.00	35.000	46.5	49.75	54.0
Jonkheer	1.0	38.000000	NaN	38.00	38.000	38.0	38.00	38.0
Lady	1.0	48.000000	NaN	48.00	48.000	48.0	48.00	48.0
Major	2.0	48.500000	4.949747	45.00	46.750	48.5	50.25	52.0
Master	36.0	4.574167	3.619872	0.42	1.000	3.5	8.00	12.0
Miss	146.0	21.773973	12.990292	0.75	14.125	21.0	30.00	63.0
Mlle	2.0	24.000000	0.000000	24.00	24.000	24.0	24.00	24.0
Mme	1.0	24.000000	NaN	24.00	24.000	24.0	24.00	24.0
Mr	398.0	32.368090	12.708793	11.00	23.000	30.0	39.00	80.0
Mrs	108.0	35.898148	11.433628	14.00	27.750	35.0	44.00	63.0
Ms	1.0	28.000000	NaN	28.00	28.000	28.0	28.00	28.0
Rev	6.0	43.166667	13.136463	27.00	31.500	46.5	53.25	57.0
Sir	1.0	49.000000	NaN	49.00	49.000	49.0	49.00	49.0
the Countess	1.0	33.000000	NaN	33.00	33.000	33.0	33.00	33.0

Master seems to be attached to a little boy. But there are Master children at the age of 12 and Mr children at the age of 11. .. Also, Mrs seems to be a married woman, but the youngest is 14 years old. ···Really? ?? Also, std (standard deviation) is NaN if you are alone. That's right.

All titles are quantified here for use as data. However, it is troublesome to quantify everything.

Looking at the table above, the titles of only one person are all ages, so I would like to ignore them this time. Put the title to be used in df_name, and save the one you don't use because you will need it later. It seems that if you put "~" in front of the element, it will be something else.

#Data to use
df_name = df[df['Honorific'].isin(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Major', 'Mlle', 'Col'])]
#Unused data
df_unneed_name = df[~df['Honorific'].isin(['Mr', 'Miss', 'Mrs', 'Master', 'Dr', 'Rev', 'Major', 'Mlle', 'Col'])]
#Quantify Honorific
df_name['Honorific'] = df_name['Honorific'].map({'Mr':0, 'Miss':1, 'Mrs':2, 'Master':3, 'Dr':4, 'Rev':5, 'Major':6, 'Mlle':7, 'Col':8})
#result
0    517
1    182
2    125
3     40
4      7
5      6
8      2
7      2
6      2
Name: Honorific, dtype: int64

[Pretreatment 3-2] Predict Age with other features

Alright, I'm finally ready, so let's make a prediction. Use the data containing Age for training and the missing data for testing.

df_Agefill = df_name.dropna(subset=['Age'])
df_Agefill_data = df_Agefill[['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked','Honorific']]    #Training data
df_Agefill_label = df_Agefill[['Age']]    #Learning label

df_Agenull = df_name[df_name['Age'].isnull()]
df_Agenull_data = df_Agenull[['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked','Honorific']]  #Test data
df_Agenull_label = df_Agenull[['Age']]  #Test label

By the way, if you look at the correlation coefficient of the training data with df_Agefill.corr (), The correlation coefficient between Age and Honorific is

-0.095726

was. Even though I worked so hard ... No, I won't lose! It's linear until you get tired of it, so it may just not match the quantified numbers of the titles!

#Let's call RandomForestRegressor an RFR.
clf = RFR()
clf.fit(df_Agefill_data, df_Agefill_label)

#Store the answer in a label (assignment)
age_answer = clf.predict(df_Agenull_data)
df_Agenull_label['Age'] = age_answer
df_Agenull_label

#result
        Age
5   	37.987776
17  	31.422079
19  	26.808000
26  	32.879936
28  	20.253988
... 	...
859 	24.929030
863 	15.495167
868 	25.716969
878 	27.344498
888 	7.838333

Since Age is different from classification, we use the regression analysis tool Random Forest Regressor. It's floating point, but isn't it nice? ?? However, I don't know if it fits because there is no answer to this. ..

[Result] Try applying Age complement to Random Forest

Combines the data that complements Age with the original data.

#Combine labels and data
df_Agenull['Age'] = df_Agenull_label
#Combine data that originally contained Age, newly added data, and data omitted by title
df_Age = pd.concat([df_Agefill, df_Agenull, df_unneed_name])
#If this is left as it is, the index will be different, so restore it
df_Age = df_Age.sort_index()
df_Age.info()

#result
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    float64
 12  Honorific    891 non-null    object 
dtypes: float64(3), int64(6), object(4)
memory usage: 97.5+ KB

It's back to normal except for the new Honorific. Age is also complete. Now, let's see the correct answer rate in Random Forest! !!

#[Criteria] Age is exactly the same as the average value complement processing
df_data = df_Age[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
df_label = df_Age[['Survived']]

(train_data, test_data, train_label, test_label) = tts(df_data, df_label, test_size=0.3, random_state=0)

clf = RFC()
clf.fit(train_data, train_label)
clf.score(test_data, test_label)

Result is······

0.832089552238806

Very delicate! !! But it went up a little ... ??

The rate of increase was not commensurate with this effort Well, I'm glad I just went up a little! (Lol

Emi-chan, it went up slightly! (Manzai King)