[PYTHON] Let's make the analysis of the Titanic sinking data like that

Purpose

As a review of the Python lecture at Udemy, I investigated the Titanic wreck. Environment is Windows 10, Python 3.5.2 Everything was written in jupyter notebook.

python.PNG


Now let's start by importing pandas.

import pandas as pd
from pandas import Series, DataFrame

First, download the train.csv file from Kaggle to the site with the data (https://www.kaggle.com/c/titanic) to get the data on the sinking of the Titanic.

titanic.PNG

#Read csv file
titanic_df = pd.read_csv('train.csv')
#View the beginning of the file and check the dataset
titanic_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The following are important points in this analysis.


About passengers

##Import numpy and seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#The data has an age column but no child status column
#Here, children under 10 years old

def male_female_child(passenger):
    age, sex = passenger
    if age < 10:
        return 'child'
    else:
        return sex

#Added a new column called person for men and women and children
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child, axis=1)

#Count the number of passengers (including survivors)
sns.countplot('Pclass', data=titanic_df, hue='person')

output_8_1.png

Compared to the first-class guest rooms, the number of children is overwhelmingly large. Let's look at the age distribution here.

titanic_df['Age'].hist(bins=70)

output_10_1.png

titanic_df['Age'].mean()
29.69911764705882

It seems that the average age of the whole was about 30 years old. Now let's use FacetGrid to see the age group for each room class.

fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

output_13_2.png

I've got an idea of what the passengers are doing.


About the harbor

By the way, the column of Embarked (Embark ... boarding) contains three values of "C", "Q" and "S". This means the ports of Cherbourg, Queenstown and Southhampton, respectively, to see the Kaggle page. Since none of them are known to the land, I will deduce from the guest room classes of the people who boarded from that port.

sns.countplot('Embarked', data=titanic_df, hue='Pclass', color ='g')

output_17_1.png

Apparently Southhampton Port is the largest of these. It is also speculated that Cherbourg Port, was a richer land than Queenstown Port.


About boarding costs

#See the distribution of boarding costs by guest room class using FacetGrid
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=3)
fig.map(sns.kdeplot, 'Fare', shade=True)
highest = titanic_df['Fare'].max()
fig.set(xlim=(0,highest))
fig.add_legend()

output_20_2.png

I got a graph that looks crazy.

titanic_df['Fare'].max()
512.32920000000001
titanic_df['Fare'].mean()
32.2042079685746

There are people who are obviously paying a lot of money ...

titanic_df[titanic_df['Fare']>300]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked person
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C female
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C male
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C male

It seems that these three people paid an extremely high boarding cost compared to other passengers. (And, of course, everyone has survived the accident.)

For those who are interested, I will post the site links for the above three people.

I will draw a graph with only these three people pulled out.

drop_idx = [258, 679, 737]
titanic_df2 = titanic_df.drop(drop_idx)

fig = sns.FacetGrid(titanic_df2, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Fare', shade=True)
highest = titanic_df2['Fare'].max()
fig.set(xlim=(0,highest))
fig.add_legend()

output_27_2.png

It didn't look very easy ... Let's take a look at the histogram of boarding costs.

titanic_df['Fare'].hist(bins=70)

output_29_1.png


About survival rate

First, let's take a quick look at the overall survival rate.

titanic_df['Survivor'] = titanic_df.Survived.map({0:'Dead', 1:'Alive'})
sns.countplot('Survivor', data=titanic_df, palette='husl')

output_31_1.png

Next, about room class and survival rate.

sns.factorplot('Pclass', 'Survived', data=titanic_df)

output_32_1.png

So far, the result seems to be reasonable.

Let's take a closer look using hue.

sns.factorplot('Pclass', 'Survived', hue='person', data=titanic_df, aspect=2)

output_34_1.png

It became a question for a moment, but as I checked above, there were almost no children in the first-class room.

Let's draw a regression line on the survival rate graph for each age.

generations = [10,20,30,40,50,60,70,80]
sns.lmplot('Age', 'Survived', hue='Pclass', data=titanic_df,
           hue_order=[1,2,3], x_bins=generations)

output_36_1.png

You can see that there is a difference in survival rate of 10 to 20% depending on the rank of the guest room regardless of age, although there are variations.

sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='summer',
           x_bins=generations)

output_38_1.png

However, when compared by gender, it was found that the survival rate of women increased with age.


Impressions

jupyter notebook I'm not used to it yet, but for the time being, I wrote a lot this time and found that I could change it to Markdown with the shortcut of ʻesc +'M'`.

Recommended Posts

Let's make the analysis of the Titanic sinking data like that
Data analysis Titanic 2
Data analysis Titanic 1
Data analysis Titanic 3
Let's analyze the questionnaire survey data [4th: Sentiment analysis]
Make a BOT that shortens the URL of Discord
Let's look at the scatter plot before data analysis
Let's use the open data of "Mamebus" in Python
Let's utilize the railway data of national land numerical information
Let's make a robot that solves the Rubik's Cube! 2 Algorithm
Data analysis based on the election results of the Tokyo Governor's election (2020)
Data processing that eliminates the effects of confounding factors (theory)
Let's make a robot that solves the Rubik's Cube! 3 Software
Let's make a robot that solves the Rubik's Cube! 1 Overview
Let's make a map of the new corona infection site [FastAPI / PostGIS / deck.gl (React)] (Data processing)
Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~
Make the theme of Pythonista 3 like Monokai (how to make your own theme)
Find the sensor installation location that maximizes the amount of acquired data
About Boxplot and Violinplot that visualize the variability of independent data
Embedding method DensMAP that reflects the density of distribution of high-dimensional data
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data handling 2 Analysis of various data formats
Let's decide the winner of bingo
Let's play with the corporate analysis data set "CoARiJ" created by TIS ①
I tried logistic regression analysis for the first time using Titanic data
[Note] I want to completely preprocess the data of the Titanic issue-Age version-
I wrote a corpus reader that reads the results of MeCab analysis
A summary of Python e-books that are useful for free-to-read data analysis
Summary of probability distributions that often appear in statistics and data analysis
Let's play with the corporate analysis data set "CoARiJ" created by TIS ②
Explain the mechanism of PEP557 data class
This and that of the inclusion notation.
The story of verifying the open data of COVID-19
Get the column list & data list of CASTable
I tried factor analysis with Titanic data!
Let's claim the possibility of pyenv-virtualenv in 2021
Data analysis before kaggle's titanic feature generation
Let's summarize the construction of NFS server
[Data analysis] Let's analyze US automobile stocks
Let's investigate the mechanism of Kaiji's cee-loline
Visualize the export data of Piyo log
Make the default value of the argument immutable
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
Let's check the population transition of Matsue City, Shimane Prefecture with open data
Find the index of items that match the conditions in the pandas data frame / series
Let's do clustering that gives a nice bird's-eye view of the text dataset
Summary of scikit-learn data sources that can be used when writing analysis articles
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
Organize Python tools to speed up the initial movement of data analysis competitions
An introduction to data analysis using Python-To increase the number of video views-
How to make a Raspberry Pi that speaks the tweets of the specified user
I tried to make OneHotEncoder, which is often used for data analysis, so that it can reach the itch.