Python: Ship Survival Prediction Part 2

I will write after that one.

Analysis and next action

We compared the mean values from the previous pivot table. As a result, we found the following.

Visualize and understand data

Now let's visualize the data and see some assumptions.

Correlation of numerical data Start by understanding the correlation between numerical data and goals (results). Histograms can be used to create a distribution of data using a specified evenly spaced range. This helps answer questions related to a particular range (For example, by dividing Age by range, you can answer the question, "Is the survival rate of infants better?") Note that the vertical axis in the histogram represents the number of data (passengers).

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

train_df = pd.read_csv('./8010_titanic_data/train.csv')
test_df = pd.read_csv('./8010_titanic_data/test.csv')
combine = [train_df, test_df]


#1.Remove missing values for Age
Age_se = train_df["Age"].dropna()

#2.Visualize Age using a histogram.
train_df["Age"].hist(bins=20) #The fineness of the range (class) is 20
plt.show()

#3.Age Survived=0 and Survived=Visualization in case of 1.
#Supplement You can easily compare by specifying by in the parameter of hist method.
train_df["Age"].hist(by=train_df["Survived"])
plt.show()

image.png

as a result

next action

1, You need to consider the Age contained in the dataset. Complements Age missing values for a few reasons 3, Create a feature that divides Age by range.

Correlation between category values and features with numerical values

#1.the deceased(Survived=0)In the case of, create a histogram of Age for each Pclass.
train_df[train_df["Survived"]==0]["Age"].hist(by=train_df["Pclass"])
plt.show()

#2.Survivor(Suvived=1)In the case of, create a histogram of Age for each Pclass.
train_df[train_df["Survived"]==1]["Age"].hist(by=train_df["Pclass"])
plt.show()

image.png

result

Decision Consider Pclass as a model feature.

Correlation of features (category values)

Check the correlation between features with category values. Here, we will confirm the relationship between Survived and Pclass / Sex, centering on Embarked.

# 1.The port of departure is Cherbourg (Embarked)='C'), Create a PivotTable for Survived and Pclass / Sex.
cor_category = train_df[train_df["Embarked"]=='C'][["Pclass", "Survived", "Sex"]].groupby(["Pclass", "Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)

#Sex splits the data into male and female,
cor_category_male = cor_category[cor_category["Sex"]=='male'].sort_values('Pclass')
cor_category_female = cor_category[cor_category["Sex"]=='female'].sort_values('Pclass')

#Add graphs using the Matplotlib library.
plt.plot(['1', '2', '3'], cor_category_male["Survived"], label="male")
plt.plot(['1', '2', '3'], cor_category_female["Survived"], label="female")

plt.legend()
plt.xlabel("Pclass")
plt.ylabel("Survived")
plt.show()


# 2.The port of departure is Queenstown (Embarked)='Q'), Create a PivotTable for Survived and Pclass / Sex.
cor_category = train_df[train_df["Embarked"]=='Q'][["Pclass", "Survived", "Sex"]].groupby(["Pclass", "Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)

#Sex splits the data into male and female,
cor_category_male = cor_category[cor_category["Sex"]=='male'].sort_values('Pclass')
cor_category_female = cor_category[cor_category["Sex"]=='female'].sort_values('Pclass')

#Add graphs using the Matplotlib library.
plt.plot(['1', '2', '3'], cor_category_male["Survived"], label="male")
plt.plot(['1', '2', '3'], cor_category_female["Survived"], label="female")

plt.legend()
plt.xlabel("Pclass")
plt.ylabel("Survived")
plt.show()

# 3.The port of departure is Southampton (Embarked)='S'), Create a PivotTable for Survived and Pclass / Sex.
cor_category = train_df[train_df["Embarked"]=='S'][["Pclass", "Survived", "Sex"]].groupby(["Pclass", "Sex"], as_index=False).mean().sort_values(by="Survived", ascending=False)

#Sex splits data into male and female,
cor_category_male = cor_category[cor_category["Sex"]=='male'].sort_values('Pclass')
cor_category_female = cor_category[cor_category["Sex"]=='female'].sort_values('Pclass')

#Add graphs using the Matplotlib library.
plt.plot(['1', '2', '3'], cor_category_male["Survived"], label="male")
plt.plot(['1', '2', '3'], cor_category_female["Survived"], label="female")

plt.legend()
plt.xlabel("Pclass")
plt.ylabel("Survived")
plt.show()

result 1, Female passengers showed much better survival than males. The survival rate of men at 2, Embarked = C was higher than that of other Embarked values. 3, This is the correlation between Pclass and Embarked, and the correlation between Pclass and Survived, not necessarily the direct correlation between Embarked and 4, Survived. 5, Male survival rates were higher for Embarked = C and S compared to Embarked = Q.

Decision 1, Add Sex to model creation. 2, Add Embarked to model creation.

Data shaping

We will proceed with conversion, creation, and complementation of features.

#1. train_Removed Ticket and Cabin features from df
train_df = train_df.drop(["Ticket", "Cabin"], axis=1)

#2. test_Removed Ticket and Cabin features from df
test_df = test_df.drop(["Ticket", "Cabin"], axis=1)

print ("-"*25+ "After"+ "-"*25)

#3.Current train_Check the number of rows and columns of df
print(train_df.shape)

# 4.Current test_Check the number of rows and columns of df
print(test_df.shape)

Create a new feature from an existing feature

combine = [train_df, test_df]
# 1. train_df and test_For each df, a dot in a new feature called Title(.)Stores earlier titles.
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

#2.Cross tabulation by Sex and Title. Crosstab is used for cross tabulation.

pd.crosstab(train_df['Title'], train_df['Sex'])

image.png

It turns out that there are titles such as Master, Mr, Miss, Mrs. Now replace the non-frequent values with a value called Rare.

for dataset in combine:
    # 1. train_df, test_In the title of df,'Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'For items other than'Rare'Rewritten to.
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    # 2.Similarly, Mile is rewritten to Miss.
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    # 3.Mme is rewritten to Mrs.
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

# 4.For Title and Survived, calculate the average value of Survived by totaling by Title.
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

image.png

Convert these titles into ordered data to make it easier to make a predictive model.

# 1. train_df and test_With df{"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}Conversion to.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}

for dataset in combine:
    dataset["Title"] = dataset["Title"] .map({"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5})
    dataset['Title'] = dataset['Title'].fillna(0)

# 2. train_Output the first first line with df.
train_df.head()

image.png

Finally, remove the Name and PassengerId

# 1.Train Name and Passanger Id_Removed from df.
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)

# 2.Test Name_Removed from df.
test_df = test_df.drop(['Name'], axis=1)

combine = [train_df, test_df]
# train_df and test_Check the number of rows and columns of df.
print(train_df.shape, test_df.shape)

Conversion of multi-valued data

Converts multi-valued data to binary data.


# 1.Sex female=1, male=-Convert to 0.
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

#Extract the first line
train_df.head()

Complement continuous numerical data

Now we need to estimate and complete the features with missing or null values. First, let's take a look at Age.

There are three ways to complement continuous numerical data.

1, How to generate a random number between the average

2, A more accurate way to infer missing values is to use other correlated features. So far, we know that there is a correlation between Age, Sex, and Pclass. In other words, infer the Age value from the combination of Pclass and Gender. For example, use intermediate ages such as Pclass = 1, Gender = 0, Pclass = 1, Gender = 1.

It is a method of combining 3, 1 and 2. So instead of guessing Age based on the median, we use a random number between the mean and standard deviation based on the combination of Pclass and Gender.

# 1. Pclass=When 1, create Age histogram for Sex.
train_df[train_df["Pclass"]==1]["Age"].hist(by=train_df["Sex"], bins=20)
plt.show()

# 2. Pclass=At 2, create an Age histogram for Sex.
train_df[train_df["Pclass"]==2]["Age"].hist(by=train_df["Sex"], bins=20)
plt.show()

# 3. Pclass=At 3, create an Age histogram for Sex.
train_df[train_df["Pclass"]==3]["Age"].hist(by=train_df["Sex"], bins=20)
plt.show()

image.png

image.png

Prepare an empty storage array containing inferred Age estimates based on a combination of Pclassx and Gender

# 1.Create an array to store the Age value.
guess_ages = np.zeros((2,3))
#Display the contents of the array.
print(guess_ages)

# 2.Age guess in the array(Middle age)Store.
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna()

            #In case of method 3
            #import random
            #age_mean = guess_df.mean()
            #age_std = guess_df.std()
            #age_guess = random.uniform(age_mean - age_std, age_mean + age_std)
            
            age_guess = guess_df.median()
 
            # 0.Round in 5 units
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)

#Display the first 5 lines.
print(guess_ages)
train_df.head()

Next, create a feature called AgeBand that divides Age into 5 layers.

# 1.Age, which is a continuous value, is divided by an arbitrary number and converted to a discrete value.
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)

# 2.Created Age Band and Survived PivotTables.
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)

Replace Age with a discrete value (order data).

# 1.If the Age value is 16 or less, it is converted to 0, if it is 16 or more and 32 or less, it is converted to 1, if it is 32 or more and 48 or less, it is converted to 2, and if it is 48 or more and 64 or less, it is converted to 3.
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
    
#Display the first 5 lines
train_df.head()
# 1. train_Deleted because AgeBand is not used from df.
train_df = train_df.drop(['AgeBand'], axis=1)

#Display the first 5 lines
train_df.head()

Create new features by combining existing features

Created as a new feature of Family Size that combines Parch and SibSp.

combine = [train_df, test_df]
# 1.A new feature called Family Size is created by adding Parch and SibSp.
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
    
# 2.The average values of Family Size and Survived are aggregated in groups.
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

image.png

Next, create a new feature called isAlone. This feature contains 1 for singles and 0 for family members.

for dataset in combine:
    # 1.Created with all features called IsAlone set to 0.
    dataset['IsAlone'] = 0
    # 2.When FamilySize is 1, isAlone is 1.
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
    
# 3.IsAlone and Survived are grouped together and the average value of Survived is output.
train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()
# train_df and test_Remove Parch, SibSp, and FamilySize from df.
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)

combine = [train_df, test_df]

train_df.head()

You can also create artificial features that combine Pclass and Age. Create a variable called Age * Class. This feature stores a value weighted by age and cabin grade.

# 1. Age*Create a new feature called Class and store the value obtained by multiplying Age and Pclass.
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

image.png

Complements features with values

Embarked takes values of S, Q and C based on the port of boarding. There are two missing values in the training dataset. Replace these missing values with the most frequently occurring values.

# 1.Remove na with Embarked and change mode to new feature freq_Stored in port.
freq_port = train_df.Embarked.dropna().mode()[0]
print(freq_port)
# 1. train_df and test_Missing value of df, freq_Replaced with the value of port.
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)

# 2.Embarked and Survived are grouped together and the average value of Survived is output.
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

image.png

# 1.Embarked{'S': 0, 'C': 1, 'Q': 2}Replaced with.
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

train_df.head()

image.png

Completion of numerical data

There is only one missing value in fare in test_df. Substitute median for this missing value.

# 1. test_Fill in the missing value of Fare of df with median of Fare.
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)

test_df.head()

image.png

Next, create a feature called FareBand that divides Fare into four layers.

# 1.Divide the Fare, which is a continuous value, into four and convert it to a discrete value.
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)

# 2.Create a FareBand and Survived PivotTable.
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

image.png

Replace Fare with discrete values (order data).

# 1.Fare value is 7.0, 7 if 91 or less.Over 91 14.1, 14 for 454 or less.If it exceeds 454 and 31 or less, it is converted to 2 and if it exceeds 31 and 3
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
    
train_df.head(10)

Recommended Posts

Python: Ship Survival Prediction Part 2
Python: Ship Survival Prediction Part 1
Python: Ship Survival Prediction Part 3
QGIS + Python Part 2
QGIS + Python Part 1
Python: Scraping Part 1
Python3 Beginning Part 1
Python: Scraping Part 2
Feature Prediction Statistics python
Python basic memorandum part 2
Python basic memo --Part 2
Python basic memo --Part 1
Image processing with Python (Part 2)
Studying Python with freeCodeCamp part1
Bordering images with python Part 1
Python application: Pandas Part 1: Basic
Python application: Pandas Part 2: Series
Scraping with Selenium + Python Part 1
Python: Supervised Learning: Hyperparameters Part 1
Python Basic Grammar Memo (Part 1)
Studying Python with freeCodeCamp part2
Image processing with Python (Part 1)
Solving Sudoku with Python (Part 2)
Image processing with Python (Part 3)
Python: Stock Price Forecast Part 2
UI Automation Part 2 in Python
Python: Supervised Learning: Hyperparameters Part 2
Scraping with Selenium + Python Part 2
Basics of Python × GIS (Part 1)
Python: Stock Price Forecast Part 1
Transpose CSV files in Python Part 1
Basics of Python x GIS (Part 3)
Playing handwritten numbers with python Part 1
perl objects and python class part 2.
Python Application: Data Cleansing Part 1: Python Notation
[Automation with python! ] Part 1: Setting file
Python Application: Data Handling Part 3: Data Format
Introduction to Python Hands On Part 1
Studying Python Part.1 Creating an environment
[Blender x Python] Particle Animation (Part 1)
Python application: Numpy Part 3: Double array
Basics of Python x GIS (Part 2)
perl objects and python class part 1.
Automate simple tasks with Python Part0
Python application: data visualization part 1: basic
[Automation with python! ] Part 2: File operation
Excel aggregation with Python pandas Part 1
Create a survival prediction model for Kaggle Titanic passengers without using Python