As a review of the Python lecture at Udemy, I investigated the Titanic wreck. Environment is Windows 10, Python 3.5.2 Everything was written in jupyter notebook.
Now let's start by importing pandas.
import pandas as pd
from pandas import Series, DataFrame
First, download the train.csv file from Kaggle to the site with the data (https://www.kaggle.com/c/titanic) to get the data on the sinking of the Titanic.
#Read csv file
titanic_df = pd.read_csv('train.csv')
#View the beginning of the file and check the dataset
titanic_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
The following are important points in this analysis.
##Import numpy and seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#The data has an age column but no child status column
#Here, children under 10 years old
def male_female_child(passenger):
age, sex = passenger
if age < 10:
return 'child'
else:
return sex
#Added a new column called person for men and women and children
titanic_df['person'] = titanic_df[['Age','Sex']].apply(male_female_child, axis=1)
#Count the number of passengers (including survivors)
sns.countplot('Pclass', data=titanic_df, hue='person')
Compared to the first-class guest rooms, the number of children is overwhelmingly large. Let's look at the age distribution here.
titanic_df['Age'].hist(bins=70)
titanic_df['Age'].mean()
29.69911764705882
It seems that the average age of the whole was about 30 years old. Now let's use FacetGrid to see the age group for each room class.
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()
I've got an idea of what the passengers are doing.
By the way, the column of Embarked (Embark ... boarding) contains three values of "C", "Q" and "S". This means the ports of Cherbourg, Queenstown and Southhampton, respectively, to see the Kaggle page. Since none of them are known to the land, I will deduce from the guest room classes of the people who boarded from that port.
sns.countplot('Embarked', data=titanic_df, hue='Pclass', color ='g')
Apparently Southhampton Port is the largest of these. It is also speculated that Cherbourg Port, was a richer land than Queenstown Port.
#See the distribution of boarding costs by guest room class using FacetGrid
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=3)
fig.map(sns.kdeplot, 'Fare', shade=True)
highest = titanic_df['Fare'].max()
fig.set(xlim=(0,highest))
fig.add_legend()
I got a graph that looks crazy.
titanic_df['Fare'].max()
512.32920000000001
titanic_df['Fare'].mean()
32.2042079685746
There are people who are obviously paying a lot of money ...
titanic_df[titanic_df['Fare']>300]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | person | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C | female |
679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C | male |
737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C | male |
It seems that these three people paid an extremely high boarding cost compared to other passengers. (And, of course, everyone has survived the accident.)
For those who are interested, I will post the site links for the above three people.
I will draw a graph with only these three people pulled out.
drop_idx = [258, 679, 737]
titanic_df2 = titanic_df.drop(drop_idx)
fig = sns.FacetGrid(titanic_df2, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Fare', shade=True)
highest = titanic_df2['Fare'].max()
fig.set(xlim=(0,highest))
fig.add_legend()
It didn't look very easy ... Let's take a look at the histogram of boarding costs.
titanic_df['Fare'].hist(bins=70)
First, let's take a quick look at the overall survival rate.
titanic_df['Survivor'] = titanic_df.Survived.map({0:'Dead', 1:'Alive'})
sns.countplot('Survivor', data=titanic_df, palette='husl')
Next, about room class and survival rate.
sns.factorplot('Pclass', 'Survived', data=titanic_df)
So far, the result seems to be reasonable.
Let's take a closer look using hue.
sns.factorplot('Pclass', 'Survived', hue='person', data=titanic_df, aspect=2)
It became a question for a moment, but as I checked above, there were almost no children in the first-class room.
Let's draw a regression line on the survival rate graph for each age.
generations = [10,20,30,40,50,60,70,80]
sns.lmplot('Age', 'Survived', hue='Pclass', data=titanic_df,
hue_order=[1,2,3], x_bins=generations)
You can see that there is a difference in survival rate of 10 to 20% depending on the rank of the guest room regardless of age, although there are variations.
sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='summer',
x_bins=generations)
However, when compared by gender, it was found that the survival rate of women increased with age.
jupyter notebook I'm not used to it yet, but for the time being, I wrote a lot this time and found that I could change it to Markdown with the shortcut of ʻesc +'M'`.
Recommended Posts