[PYTHON] Check raw data with Kaggle's Titanic (kaggle ⑥)

Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Try all scikit-learn models with Kaggle's Titanic", perform "cross-validation" on the scikit-learn model. I was able to raise the score a little. This time, I would like to do "check the raw data" that should have been done first.

table of contents

  1. Significance of checking raw data
  2. Result
  3. Check the raw data
  4. Learn
  5. All codes
  6. Summary

History

1. Significance of checking raw data

I read a book called The Power of Analysis that Changes the Company. One of the contents of the book says, "Let's check the raw data before analyzing the data." Outliers cannot be found without looking at the raw data. Before starting data analysis, first visualize the raw data to see if there are any outliers. It is said that you should acquire such a habit. Check the raw data, check for abnormal values, and check how to use the data again.

2. Result

According to the result, by scrutinizing the input data, the score increased a little and became "0.80382". The result is the top 9% (as of January 7, 2020). I would like to see the flow up to submission.

3. Check the raw data

Let's check some raw data.

Fare

Let's make a scatter plot of fares for each pclass (ticket class). It became as follows.

20200109_01.png

The horizontal axis is pclass. Fares of "1" tend to be high. As for the ticket class, the grade seems to improve in the order of 1> 2> 3. From the scatter plot, you can see that the fare "0" is in each pclass. Let's take a look at the raw data. Sort by fare (Fare) in ascending order.

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
180 0 3 male 36 0 0 LINE 0 S
264 0 1 male 40 0 0 112059 0 B94 S
272 1 3 male 25 0 0 LINE 0 S
278 0 2 male 0 0 239853 0 S
303 0 3 male 19 0 0 LINE 0 S
414 0 2 male 0 0 239853 0 S
467 0 2 male 0 0 239853 0 S
482 0 2 male 0 0 239854 0 S
598 0 3 male 49 0 0 LINE 0 S
634 0 1 male 0 0 112052 0 S
675 0 2 male 0 0 239856 0 S
733 0 2 male 0 0 239855 0 S
807 0 1 male 39 0 0 112050 0 A36 S
816 0 1 male 0 0 112058 0 B102 S
823 0 1 male 38 0 0 19972 0 S
379 0 3 male 20 0 0 2648 4.0125 C
873 0 1 male 33 0 0 695 5 B51 B53 B55 S
327 0 3 male 61 0 0 345364 6.2375 S
844 0 3 male 34.5 0 0 2683 6.4375 C

In ascending order of Fare, Fare is "0" and PClass is 1, 2, and 3. Fare "0" is not free and seems to mean "fare unknown". Let's exclude Fare "0" from the training data. If you exclude Fare "0" and create a scatter plot again, it will be as follows.

20200109_02.png

It's a little easier to see. I am also concerned about the small point of pclass "1". Looking at the table above, there is data for Fare "5" with pclass "1". This may also be an outlier, so let's exclude it.

20200109_03.png

It is a scatter plot with a certain range of fares for each pclass.

Ticket

The ticket number is a nominal scale. I will sort them in ascending order of ticket numbers.

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
258 1 1 female 30 0 0 110152 86.5 B77 S
505 1 1 female 16 0 0 110152 86.5 B79 S
760 1 1 female 33 0 0 110152 86.5 B77 S
263 0 1 male 52 1 1 110413 79.65 E67 S
559 1 1 female 39 1 1 110413 79.65 E67 S
586 1 1 female 18 0 2 110413 79.65 E68 S
111 0 1 male 47 0 0 110465 52 C110 S
476 0 1 male 0 0 110465 52 A14 S
431 1 1 male 28 0 0 110564 26.55 C52 S
367 1 1 female 60 1 0 110813 75.25 D37 C

If you look at the ticket number, you can't read the regularity, whether it's just numbers or a combination of letters and numbers. You can also see that there are people with the same ticket number. People with the same ticket number often have the same surname when looking at their names. I think it's a family. Also, when compared with Survived of people with the same ticket number, Survived tends to be the same with the same ticket number. We will consider the policy of labeling with the ticket number. The image below.

PassengerId Survived Ticket Ticket (label)
505 1 110152 Ticket A
258 1 110152 Ticket A
760 1 110152 Ticket A
586 1 110413 Ticket B
559 1 110413 Ticket B
263 0 110413 Ticket B
111 0 110465 Ticket C
476 0 110465 Ticket C
431 1 110564 NaN
367 1 110813 NaN

We want to group the same ticket numbers, so we'll use "NaN" for unique tickets. Tickets A and B can be digitized as they are, but one-hot encoding is used to clearly indicate that they are labeled. The image is as follows. The source code will be described later, but you can do One-Hot encoding with pandas.get_dummies.

PassengerId Survived Ticket A Ticket B Ticket C
505 1 1 0 0
258 1 1 0 0
760 1 1 0 0
586 1 0 1 0
559 1 0 1 0
263 0 0 1 0
111 0 0 0 1
476 0 0 0 1
431 1 0 0 0
367 1 0 0 0

sibsp (number of siblings / spouse) / parch (number of parents / children)

sibsp and parch also graphed previously, but let's graph it again. 20200119_02.png 20200119_01.png There was no significant difference in the correlation coefficient, but the graphs for both sibsp and parch show the following. ・ When sibsp and parch are 0, Survived is more often 0 (about twice). · If sibsp, parch is 1 or 2, Survived 0s and 1s are about the same ・ The population parameter is small when sibsp and parch are 3 or more.

Last time, I excluded it from the training data because the correlation coefficient is small, but it seems that 0, 1, and 2 can be used as label data. When only the data with sibsp and parch of 3 or less are extracted and the correlation coefficient is confirmed, the result is as follows.

#Check the number of Kramer correlations when SibSp is less than 3
df_SibSp = df[df['SibSp'] < 3]
cramersV(df_SibSp['Survived'], df_SibSp['SibSp'])

#Survived and SibSp(Less than 3)Display the cross-tabulation table of
cross_sibsp = pandas.crosstab(df_SibSp['Survived'], df_SibSp['SibSp'])
cross_sibsp

cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
0.16260950922794606

20200120_01.png

The correlation coefficient was 0.16, which was "weakly correlated". I'll omit it, but Parch has similar results. So, like Ticket, let's try One-Hot encoding for SibSp and Parch. The image is as follows.

PassengerId Survived SibSp_1 SibSp_2 SibSp_3 SibSp_4 SibSp_5 SibSp_8
505 1 0 0 0 0 0 0
258 1 0 0 0 0 0 0
760 1 0 0 0 0 0 0
586 1 0 0 0 0 0 0
559 1 1 0 0 0 0 0
263 0 1 0 0 0 0 0
111 0 0 0 0 0 0 0
476 0 0 0 0 0 0 0
431 1 0 0 0 0 0 0
367 1 1 0 0 0 0 0

Cabin (room number)

Let's check Cabin. Of the 900 verification data (train.csv), about 200 Cabin. Cabin is a nominal scale. When the first character is regarded as the same group and grouped, it becomes as follows.

20200120_03.png

The result is that there are many Survived "1" s in each case. The label data of the first character seems to be useful. Cabin will also try One-Hot encoding the first character. The image is as follows.

PassengerId Survived Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T
505 1 0 1 0 0 0 0 0 0
258 1 0 1 0 0 0 0 0 0
760 1 0 1 0 0 0 0 0 0
586 1 0 0 0 0 1 0 0 0
559 1 0 0 0 0 1 0 0 0
263 0 0 0 0 0 1 0 0 0
111 0 0 0 1 0 0 0 0 0
476 0 1 0 0 0 0 0 0 0
431 1 0 0 1 0 0 0 0 0
367 1 0 0 0 1 0 0 0 0

4. Learn

Let's learn based on the situation so far. The input data is as follows.

No item name item name(Japanese) Conversion method
1 Pclass Ticket class Standardization
2 Sex sex Quantify
3 SibSp Brother/spouse one-hot encoding
4 Parch parent/children one-hot encoding
5 Ticket Ticket number one-hot encoding
6 Fare fare Standardization
7 Cabin Room number The first character is one-hot encoding

Try all models of kaggle⑤, and also with the model by grid search of kaggle④ When I tried the parameters, I got the following:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='exponential', max_depth=6,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=1, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

5. All codes

The full code is below. However, when I actually trained, the score did not increase when "Cabin" was included, so I finally excluded Cabin.

import numpy
import pandas 
import matplotlib.pyplot as plt

######################################
#Number of Klamer correlations
# Cramer's coefficient of association
# 0.5  >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1  >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def cramersV(x, y):
    """
    Calc Cramer's V.

    Parameters
    ----------
    x : {numpy.ndarray, pandas.Series}
    y : {numpy.ndarray, pandas.Series}
    """
    table = numpy.array(pandas.crosstab(x, y)).astype(numpy.float32)
    n = table.sum()
    colsum = table.sum(axis=0)
    rowsum = table.sum(axis=1)
    expect = numpy.outer(rowsum, colsum) / n
    chisq = numpy.sum((table - expect) ** 2 / expect)
    return numpy.sqrt(chisq / (n * (numpy.min(table.shape) - 1)))

######################################
#Correlation ratio
# Correlation ratio
# 0.5  >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1  >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def CorrelationV(x, y):
    """
    Calc Correlation ratio 

    Parameters
    ----------
    x : nominal scale {numpy.ndarray, pandas.Series}
    y : ratio   scale {numpy.ndarray, pandas.Series}
    """
    variation = ((y - y.mean()) ** 2).sum()
    inter_class = sum([((y[x == i] - y[x == i].mean()) ** 2).sum() for i in numpy.unique(x)])
    correlation_ratio = inter_class / variation
    return 1 - correlation_ratio

# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')

# test.load csv
# Load test.csv
df_test = pandas.read_csv('/kaggle/input/titanic/test.csv')

# 'PassengerId'To extract(To combine with the result)
# Extract 'PassengerId'(To combine with the result)
df_test_index = df_test[['PassengerId']]

df_all = pandas.concat([df, df_test], sort=False)

##############################
#Data preprocessing
#Extract the required items
# Data preprocessing 
# Extract necessary items
##############################
df = df[['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']]
df_test = df_test[['Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']]

##############################
#Plot a scatter plot of Fare and pclass
# Draw scatter plot of Fare and pclass
##############################
plt.scatter(df['Pclass'], df['Fare'])
plt.xticks(numpy.linspace(1, 3, 3))
plt.ylim(0, 300)
plt.show()

##############################
#Exclude Fare 0
# Exclude Fare 0
##############################
df = df[df['Fare'] != 0].reset_index(drop=True)

##############################
#Plot a scatter plot of Fare and pclass
# Draw scatter plot of Fare and pclass
##############################
plt.scatter(df['Pclass'], df['Fare'])
plt.xticks(numpy.linspace(1, 3, 3))
#plt.xlim(1, 3)
plt.ylim(0, 300)
plt.show()

##############################
#Exclude Fare 0
# Exclude Fare 0
##############################
df = df[df['Fare'] != 5].reset_index(drop=True)

##############################
#Plot a scatter plot of Fare and pclass
# Draw scatter plot of Fare and pclass
##############################
plt.scatter(df['Pclass'], df['Fare'])
plt.xticks(numpy.linspace(1, 3, 3))
plt.ylim(0, 300)
plt.show()

##############################
#View Survived and Age crosstabs
# Display Survived and Age crosstabulation table
##############################
cross_age = pandas.crosstab(df_all['Survived'], round(df_all['Age'],-1))
cross_age

cross_age.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()

##############################
#View Survived and SibSp crosstabulation
# Display Survived and SibSp crosstabulation table
##############################
cross_sibsp = pandas.crosstab(df['Survived'], df['SibSp'])
cross_sibsp

cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()

#Check the number of Kramer correlations when SibSp is less than 3
# Check Cramer's coefficient of association when SibSp is less than 3
df_SibSp = df[df['SibSp'] < 3]
cramersV(df_SibSp['Survived'], df_SibSp['SibSp'])

##############################
#Survived and SibSp(Less than 3)Display the cross-tabulation table of
# Display a crosstabulation of Survived and SibSp (less than 3)
##############################
cross_sibsp = pandas.crosstab(df_SibSp['Survived'], df_SibSp['SibSp'])
cross_sibsp

cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()

##############################
#View Survived and Parch crosstabs
# Display Survived and Parch crosstabulation table
##############################
cross_parch = pandas.crosstab(df['Survived'], df['Parch'])
cross_parch

cross_parch.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()

#Check the number of Klamer correlations when Parch is less than 3
# Check Cramer's coefficient of association when Parch is less than 3
df_Parch = df[df['Parch'] < 3]
cramersV(df_Parch['Survived'], df_Parch['Parch'])

##############################
#Survived and Parch(Less than 3)Display the cross-tabulation table of
# Display a crosstabulation of Survived and Parch (less than 3)
##############################
cross_parch = pandas.crosstab(df_Parch['Survived'], df_Parch['Parch'])
cross_parch
cross_parch = pandas.crosstab(df_Parch['Survived'], df_Parch['Parch'])
cross_parch

cross_parch.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()

from sklearn.preprocessing import LabelEncoder
##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing
# Digitize labels
##############################
##############################
# Sex
##############################
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)
df_test['Sex'] = encoder_sex.transform(df_test['Sex'].values)

##############################
#Data preprocessing
# One-Hot encoding
# Data preprocessing
# One-Hot Encoding
##############################
##############################
# SibSp
##############################
SibSp_values = df_all['SibSp'].value_counts()
SibSp_values = pandas.Series(SibSp_values.index, name='SibSp')
categories = set(SibSp_values.tolist())
df['SibSp'] = pandas.Categorical(df['SibSp'], categories=categories)
df_test['SibSp'] = pandas.Categorical(df_test['SibSp'], categories=categories)

df = pandas.get_dummies(df, columns=['SibSp'])
df_test = pandas.get_dummies(df_test, columns=['SibSp'])

##############################
# Parch
##############################
Parch_values = df_all['Parch'].value_counts()
Parch_values = pandas.Series(Parch_values.index, name='Parch')
categories = set(Parch_values.tolist())
df['Parch'] = pandas.Categorical(df['Parch'], categories=categories)
df_test['Parch'] = pandas.Categorical(df_test['Parch'], categories=categories)

df = pandas.get_dummies(df, columns=['Parch'])
df_test = pandas.get_dummies(df_test, columns=['Parch'])

##############################
# Ticket
##############################
ticket_values = df_all['Ticket'].value_counts()
ticket_values = ticket_values[ticket_values > 1]
ticket_values = pandas.Series(ticket_values.index, name='Ticket')
categories = set(ticket_values.tolist())
df['Ticket'] = pandas.Categorical(df['Ticket'], categories=categories)
df_test['Ticket'] = pandas.Categorical(df_test['Ticket'], categories=categories)

df = pandas.get_dummies(df, columns=['Ticket'])
df_test = pandas.get_dummies(df_test, columns=['Ticket'])

##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler

#Standardization
# Standardize numbers
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df['Pclass'] = df_std['Pclass']
df['Fare'] = df_std['Fare']

df_test_std = pandas.DataFrame(standard.transform(df_test[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df_test['Pclass'] = df_test_std['Pclass']
df_test['Fare'] = df_test_std['Fare']

##############################
#Data preprocessing
#Handle missing values
# Data preprocessing
# Fill or remove missing values
##############################
df_test = df_test.fillna({'Fare':0})

#Prepare training data
# Prepare training data
x_train = df.drop(columns='Survived').values
y_train = df[['Survived']].values
# y_Remove train dimension
# Delete y_train dimension
y_train = numpy.ravel(y_train)

##############################
#Build a model
# Build the model
# GradientBoostingClassifier
##############################
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1,loss='exponential', learning_rate=0.1, max_depth=6)

import os
if(os.path.exists('./result.csv')):
    os.remove('./result.csv')

##############################
#Learning
# Trainig
##############################
model.fit(x_train, y_train)

##############################
#Predict results
# Predict results
##############################
x_test = df_test.values
y_test = model.predict(x_test)

#Combine the result with the DataFrame of the PassengerId
# Combine the data frame of PassengerId and the result
df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1)

# result.Write csv to current directory
# Write result.csv to the current directory
df_output.to_csv('result.csv', index=False)

When I submitted this, the score became "0.80382".

6. Summary

The score exceeded 0.8 and I was able to be in the top 10%. The input data finally used is as follows.

No item name item name(Japanese) Conversion method
1 Pclass Ticket class Standardization
2 Sex sex Quantify
3 SibSp Brother/spouse one-hot encoding
4 Parch parent/children one-hot encoding
5 Ticket Ticket number one-hot encoding
6 Fare fare Standardization

Until this time, I was studying with scikit-learn. There are other frameworks for machine learning, so let's use another framework as well. Next time I would like to learn using keras.

History

2020/01/29 First edition released 2020/02/03 Corrected typographical errors 2020/02/15 Add next link

Recommended Posts

Check raw data with Kaggle's Titanic (kaggle ⑥)
Check the correlation with Kaggle's Titanic (kaggle③)
Select models with Kaggle's Titanic (kaggle ④)
Predict Kaggle's Titanic with keras (kaggle ⑦)
I tried learning with Kaggle's Titanic (kaggle②)
I tried factor analysis with Titanic data!
Data analysis before kaggle's titanic feature generation
Read data with python / netCDF> nc.variables [] / Check data size
Data analysis Titanic 2
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)
I tried principal component analysis with Titanic data!
Challenge Kaggle Titanic
Data analysis Titanic 1
Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~
Data analysis Titanic 3
Basic visualization techniques learned from Kaggle Titanic data
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-
Data analysis with python 2
Visualize data with Streamlit
Reading data with TensorFlow
Data visualization with pandas
Data manipulation with Pandas!
Domain check with Python
Shuffle data with pandas
Data Augmentation with openCV
Try Kaggle's Titanic tutorial
Normarize data with Scipy
Data analysis with Python
Check version with python
LOAD DATA with PyMysql
I tried to predict and submit Titanic survivors with Kaggle
Check! Get sensor data via Bluetooth with Raspberry Pi ~ Preparation
Try to process Titanic data with preprocessing library DataLiner (Append)
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Prediction / Evaluation-
[Causal search / causal inference] Implement a Bayesian network with Titanic data
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (conversion)