[PYTHON] Check the correlation with Kaggle's Titanic (kaggle③)

Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In Last "Learning with Kaggle's Titanic" I went through all the steps to learn and submit, but the result was that I lost to the sample data (76%). This time, I would like to scrutinize the data in order to raise the score in the "Titanic competition".

table of contents

  1. About correlation
  2. Variable scale
  3. Correlation coefficient
  4. Examine the Titanic correlation coefficient
  5. Result

reference History

1. About correlation

Last time picked up the training data appropriately and trained it using "pclass", "sex", "Age", and "fare". This time, I would like to select data related to learning based on the grounds.

A scatter plot is useful for checking the correlation. The scatter plot is a graph like the one below. 20191221_02.png

If the scatter plot looks like the one above, you can see that it is related to the right shoulder rise. For example, if the horizontal axis is height and the vertical axis is weight, the taller the person, the heavier the weight, so the scatter plot is as shown above. (I think it will be.)

However, the scatter plot is effective when both the horizontal axis and the vertical axis are quantitative variables. When creating a scatter plot of Titanic's "gender" (qualitative variable) and "survival" (qualitative variable) There are two types of gender on the horizontal axis, "male" and "female", and two types of survival on the vertical axis, "0" and "1". The scatter plot consists of only four points, which is not very meaningful. The method of examining the correlation depends on the type of data (quantitative variable, qualitative variable).

2. Variable scale

Earlier, the terms quantitative and qualitative variables came up. The data (variables) to be handled are classified into the following scales according to their properties.

Variable type Scale type Meaning Example
Qualitative variables Nominal scale Scale used to distinguish Gender, prefecture
Ordinal scale Scale that is meaningful only for magnitude relations Rank, seismic intensity
Quantitative variable Interval scale Scales are evenly spaced and the intervals are meaningful td> Temperature, AD
Proportional scale Meaningful intervals and ratios (0 is the origin) Height, price

Scatter plots are useful when both comparisons are "quantitative variables". I think that it is also effective for "quantitative variables" and "ordinal scales". It is not very effective when one is a nominal scale. For example, if you plot Titanic's "survival (0, 1)" and "fare" on a scatter plot, it will look like this:

20191221_03.png

The horizontal axis (survival) is a binary value of 0 and 1, and the correlation is not clear. For nominal scales, scatter plots are unlikely to be very effective.

In addition to scatter plots, correlations are sometimes referred to as "correlation coefficients." Let's take a look here.

3. Correlation coefficient

The correlation coefficient is used as an indicator of how much the two sets of data are related. numpy has a function to find the correlation coefficient.

numpy.corrcoef(x, y)[0, 1]

numpy's corrcoef is correctly called "Pearson's product moment correlation coefficient". "Pearson's product moment correlation coefficient" is used for the correlation between "quantitative variables" described later. There are other correlation coefficients such as "Kramer's number of associations" and "correlation ratio", which are used according to the scale. I summarized the scale of the variable and the corresponding correlation coefficient.

Variable 1 Variable 2 Correlation coefficient used Titanic variable
Nominal scale Nominal scale Number of Klamer correlations Gender, ticket number, room number, Boarding port
Ordinal scale Rank correlation Ticket class
Quantitative variables Correlation ratio Age, sibs, parch, fares

Below is a code sample of "Number of associations of Cramer" and "Correlation ratio". I think the "rank correlation ratio" is almost the same as the correlation ratio.

import numpy 
import pandas 
######################################
#Number of Klamer correlations
# Cramer's coefficient of association
# 0.5  >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1  >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def cramersV(x, y):
    """
    Calc Cramer's V.

    Parameters
    ----------
    x : {numpy.ndarray, pandas.Series}
    y : {numpy.ndarray, pandas.Series}
    """
    table = numpy.array(pandas.crosstab(x, y)).astype(numpy.float32)
    n = table.sum()
    colsum = table.sum(axis=0)
    rowsum = table.sum(axis=1)
    expect = numpy.outer(rowsum, colsum) / n
    chisq = numpy.sum((table - expect) ** 2 / expect)
    return numpy.sqrt(chisq / (n * (numpy.min(table.shape) - 1)))

######################################
#Correlation ratio
# Correlation ratio
# 0.5  >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1  >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def CorrelationV(x, y):
    """
    Calc Correlation ratio 

    Parameters
    ----------
    x : nominal scale {numpy.ndarray, pandas.Series}
    y : ratio   scale {numpy.ndarray, pandas.Series}
    """
    variation = ((y - y.mean()) ** 2).sum()
    inter_class = sum([((y[x == i] - y[x == i].mean()) ** 2).sum() for i in numpy.unique(x)])
    correlation_ratio = inter_class / variation
    return 1 - correlation_ratio

The correlation coefficient is a number from -1 to 1. The weight of the value changes depending on each formula. The guideline for the value of the correlation coefficient is as follows.

〇 Number of associations and correlation ratio of Kramer

value Correlation
0.5 >= Very strongly correlated
0.25 >= There is a strong correlation
0.1 >= There is a slightly weak correlation
0.1 < No correlation

〇 Pearson's product moment correlation coefficient

value Correlation
0.7 >= Very strongly correlated
0.4 >= There is a strong correlation
0.2 >= There is a slightly weak correlation
0.1 < No correlation

In addition to the correlation coefficient, the nominal scale may show the correlation by graphing the "crosstab".

4. Examine the Titanic correlation coefficient

Now, let's look at the correlation between the correlation coefficient and the graph for each Titanic variable. Create a "New NoteBook" in Titanic and define the above "Number of Klamer associations" and "Correlation ratio". After that, read and prepare the training data with the following code.

import matplotlib.pyplot as plt

# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')

##############################
#Data preprocessing
#Handle missing values
# Data preprocessing
# Fill or remove missing values
##############################
#Age Nan-Convert to 1
# Convert age Nan to -1
df = df.fillna({'Age':-1})
#Embarked Nan-Convert to 1
# Convert Embarked Nan to -1
df = df.fillna({'Embarked':'null'})

##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing 
# Digitize labels
##############################
from sklearn.preprocessing import LabelEncoder
#Quantify gender using Label Encoder
# Digitize gender using LabelEncoder
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)
encoder_embarked = LabelEncoder()
df['Embarked'] = encoder_embarked.fit_transform(df['Embarked'].values)

Pclass (ticket class)

The ticket class is an ordinal scale. Check with the correlation ratio.

######################################################
#Data analysis 1
#Examine the correlation between Survived and Pclass (nominal scale)
# Data analysis 1
# Examine the correlation between Survived and Pclass(nominal scale)
######################################################
CorrelationV(df['Survived'], df['Pclass'])
0.11456941170524215

It became "weakly correlated". Let's graph the crosstab.

cross_pclass = pandas.crosstab(df['Survived'], df['Pclass'])
cross_pclass.T.plot(kind='bar', stacked=True)
plt.show()

20191222_01.png

When the class is 1, the survival "1" exceeds 50%. When the class becomes 3, is the survival "1" about 1/4? The correlation coefficient is weak at 0.1, but looking at the graph, it seems that there is a fair correlation.

Name

I'll skip the name now. The order may change, but it is also necessary to "observe the data". I would like to touch on my name in the "Observing Data" section, which I will discuss again.

Sex

Gender is a "nominal scale". Check by the number of correlations of Klamer.

######################################################
#Data analysis 2
#Examine the correlation between Survived and Sex (nominal scale)
# Data analysis 2
# Examine the correlation between Survived and Sex(nominal scale)
######################################################
cramersV(df['Survived'], df['Sex'])
0.5433513740027712

It became "very strong correlation". Let's graph the crosstab.

cross_sex = pandas.crosstab(df['Survived'], df['Sex'])
cross_sex.T.plot(kind='bar', stacked=True)
plt.show()

20191222_02.png

Certainly, the results differ greatly between men and women.

Age

Age is an "ordinal scale". Check with the correlation ratio.

######################################################
#Data analysis 3
#Examine the correlation between Survived and Age (proportional scale)
# Data analysis 3
# Examine the correlation between Survived and Age(ratio scale)
######################################################
CorrelationV(df['Survived'], df['Age'])
0.0001547299039139638

"No correlation". Let's graph the crosstab. Graph in 10-year increments.

cross_age = pandas.crosstab(df['Survived'], round(df['Age'],-1))
cross_age.T.plot(kind='bar', stacked=True)
plt.show()

20191222_03.png

I feel that there are few survival "1" after 50s, but the result is "no correlation". It was surprising that age had little effect.

SibSp (number of siblings / spouse)

SibSp is a "quantitative variable (proportional scale)". Check with the correlation ratio.

######################################################
#Data analysis 4
#Examine the correlation between Survived and SibSp (proportional scale)
# Data analysis 4
# Examine the correlation between Survived and SibSp(ratio scale)
######################################################
CorrelationV(df['Survived'], df['SibSp'])
0.0012476789275327471

"No correlation". Let's graph the crosstab.

cross_age = pandas.crosstab(df['Survived'], df['SibSp'])
cross_age.T.plot(kind='bar', stacked=True)
plt.show()

20191222_05.png

No significant correlation can be seen by looking at the graph.

Parch (number of parents / children)

Parch is a "quantitative variable (proportional scale)". Check with the correlation ratio.

######################################################
#Data analysis 5
#Examine the correlation between Survived and Parch (proportional scale)
# Data analysis 5
# Examine the correlation between Survived and Parch(ratio scale)
######################################################
CorrelationV(df['Survived'], df['Parch'])
0.006663360100801152

"No correlation". Let's graph the crosstab.

cross_age = pandas.crosstab(df['Survived'], df['Parch'])
cross_age.T.plot(kind='bar', stacked=True)
plt.show()

20191222_05.png

No significant correlation can be seen here either.

Ticket (ticket number)

I will also skip the ticket number this time. I would like to touch on this again in the "Observing Data" section, which I will discuss again.

Fare (fare)

Fares are "quantitative variables (proportional scale)". Check with the correlation ratio. After standardizing, check the correlation ratio. (I think the result will be the same without standardization)

##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler

#Standardization
# Standardize numbers
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Sex', 'Fare']]), columns=['Pclass', 'Sex', 'Fare'])

#Standardize Fare
# Standardize Fare
df['Fare'] = df_std['Fare']

######################################################
#Data analysis 6
#Examine the correlation between Survived and Fare (proportional scale)
# Data analysis 6
# Examine the correlation between Survived and Fare(ratio scale)
######################################################
CorrelationV(df['Survived'], df['Fare'])
0.06620664646184327

"No correlation". Let's graph the crosstab. If it is left as it is, the scale will be finer, so put it together at 0.2 intervals.

######################################
# -1.0 < x < -0.8 ⇒-1.0
# -0.8 < x < -0.6 ⇒-0.8
# -0.6 < x < -0.4 ⇒-0.6
# -0.4 < x < -0.2 ⇒-0.4
# -0.2 < x <    0 ⇒-0.2
#  0   < x <  0.2 ⇒ 0.0
#  0.2 < x <  0.4 ⇒ 0.2
#  0.4 < x <  0.6 ⇒ 0.4
#  0.6 < x <  0.8 ⇒ 0.6
#  0.8 < x <  1.0 ⇒ 0.8
#  1.0 < x        ⇒ 1.0
######################################
def one_fifth(x):
    if  x < -0.8:
        return -1.0
    elif -0.8 <= x and x < -0.6:
        return -0.8
    elif -0.6 <= x and x < -0.4:
        return -0.6
    elif -0.4 <= x and x < -0.2:
        return -0.4
    elif -0.2 <= x and x < 0:
        return -0.2
    elif 0 <= x and x < 0.2:
        return 0.0
    elif 0.2 <= x and x < 0.4:
        return 0.2
    elif 0.4 <= x and x < 0.6:
        return 0.4
    elif 0.6 <= x and x < 0.8:
        return 0.6
    elif 0.8 <= x and x < 1.0:
        return 0.8
    else:
        return 1.0

df['Fare_convert'] = df['Fare'].apply(one_fifth)
cross_age = pandas.crosstab(df['Survived'], df['Fare_convert'])
cross_age.T.plot(kind='bar', stacked=True)
plt.show()

20191225_01.png

When the fare is left behind, the number of survival "1" increases. The coefficient is low, but there may be a correlation.

Cabin (room number)

The room number will also be skipped this time. I would like to touch on this again in the "Observing Data" section, which I will discuss again.

Embarked (port of embarkation)

The port of embarkation is a "nominal scale". Check by the number of correlations of Klamer.

######################################################
#Data analysis 7
#Examine the correlation between Survived and Embarked (nominal scale)
# Data analysis 7
# Examine the correlation between Survived and Embarked(nominal scale)
######################################################
cramersV(df['Survived'], df['Embarked'])
0.18248384812341217

It's good, but it's now "no correlation". Let's graph the crosstab.

cross_embarked = pandas.crosstab(df['Survived'], df['Embarked'])
cross_embarked.T.plot(kind='bar', stacked=True)
plt.show()

20191225_02.png

There seems to be a correlation, there seems to be no ...

5. Result

Correlated are Pclass (ticket class) and Sex (gender). Fare also does not reach the standard value, but there seems to be a little correlation in the graph.

6. Summary

I would like to use Pclass (ticket class), Sex (gender), and Fare (fare) as input parameters based on the correlation and crosstab graph. Next is the selection of models, but this is Next time.

reference

Calculate the relationship between variables of various scales (Python) https://qiita.com/shngt/items/45da2d30acf9e84924b7

Calculation of the number of Klamer correlations https://qiita.com/canard0328/items/5ea4115d964b448903ba

History

2019/12/25 First edition released 2019/12/29 Next link addition

Recommended Posts

Check the correlation with Kaggle's Titanic (kaggle③)
Check raw data with Kaggle's Titanic (kaggle ⑥)
Select models with Kaggle's Titanic (kaggle ④)
Predict Kaggle's Titanic with keras (kaggle ⑦)
I tried learning with Kaggle's Titanic (kaggle②)
Check the code with flake8
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding
Challenges for the Titanic Competition for Kaggle Beginners
Check the existence of the file with python
Check the file size with du -sh *
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)
Until you use the Kaggle API with Colab
Check the date of the flag duty with Python
Challenge Kaggle Titanic
Day 66 [Introduction to Kaggle] The easiest Titanic forecast
Examine the parameters of RandomForestClassifier in the Kaggle / Titanic tutorial
Check when the version does not switch with pyenv
Kaggle Tutorial Titanic know-how to be in the top 2%
Take a closer look at the Kaggle / Titanic tutorial
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-
Domain check with Python
Try Kaggle's Titanic tutorial
Check version with python
Rethink the correlation coefficient
I tried to predict and submit Titanic survivors with Kaggle
Check what line caused the error with apply () (dataframe, Pandas)
Check the scope of local variables with the Python locals function.
Check the operating status of the server with the Linux top command
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Prediction / Evaluation-