[PYTHON] Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-

0. Introduction

--Background: I finally registered with Kaggle, so I tried it for the time being. (slow) --Purpose: Prepare, analyze, model, predict, and introduce the flow of result evaluation (this article goes to modeling → continued / items / cad88fe7b6f26fe86480)). I will pass through the mathematical details, validity, and strictness once (I will write a commentary on logistic regression). --Environment: Kaggle Kernel Notebook

1. Preparing for data analysis

--Get Titanic Survivor Data --Once you start Kaggle, get the data from the Example Competition. --Create a Notebook with Kaggle Kernal --Click "New Notebook" on the Competition page. kernel.PNG

--Read data --When you create a Notebook from the competition page, the first box should look like this, so just execute it. image.png

--Read csv data and store it as a dataframe

train_csv = pd.read_csv('../input/titanic/train.csv', sep=',')
train_csv.head()

image.png

--Examine the data summary

#Data dimension
train_csv.shape

#Output result
(891, 12)

It seems to contain data for 891 passengers.


#Number of missing values in each row
train_csv.isnull().sum()

#Output result
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

There are many missing values for Age and Cabin.

2. Analyze the data and select explanatory variables

The details of the data are described on the competition page, so I will omit them. This data predicts who can survive from the attributes of the Titanic passengers. There are several attributes of the passenger, but we will decide which attribute to use and how to use it while analyzing the data.

2.0. Check the objective variable

Whether or not the passenger survived is stored in Survived at 0/1. 0 means death, 1 means survival. How many people survived?

#Survival probability
train_csv['Survived'].mean()

#Output result
 0.3838383838383838

It seems that a little less than 40% of people survived.

train_csv['Survived'].value_counts()

#Output result
0    549
1    342
Name: Survived, dtype: int64

It looks like this in terms of the number of people.

Now let's look at the relationship between each attribute and Survived.

2.1. Pclass (ticket class)

Contains integer values from 1-3. It seems that the smaller the number, the better the class. Looking at the movie, I think that the upper class passengers were preferentially on the lifeboat. Let's cross-tabulate with Survived and see the impact * 1.

pd.crosstab(train_csv['Pclass'], train_csv['Survived'])

スクリーンショット 2019-12-19 22.14.14.png

The mortality rate of passengers in the 3rd class is overwhelmingly higher than in other classes. On the contrary, 1st class passengers have a high survival rate. After all, the better the class, the higher the survival rate, so this time I will use it as it is for the model.

2.2. Name It may have something to do with it, but I won't use it this time because it's difficult to explain the cause and effect. (There may be something if only the title is extracted.)

2.3. Sex

male and female are included as they are in the character string. In an emergency, women's survival tends to be prioritized, so it seems that more women are alive than men in this case as well. Crosstab with Survived.

pd.crosstab(train_csv['Sex'], train_csv['Survived'])

スクリーンショット 2019-12-19 22.30.23.png

Women have many survivors, but men have few survivors. So I will make a female dummy and put it in the model.

2.4. Age

The age is entered numerically, but there are some missing values. At the analysis stage, the missing values are excluded and compared. Generally speaking, younger people are more likely to survive. Therefore, let's compare the average ages of dead and surviving passengers. * 2

#Average age of deaths
train_csv[(train_csv['Age'].isnull() == False) & (train_csv['Survived'] == 0)].Age.mean()

#Output result
30.62617924528302

#Average age of survivors
train_csv[(train_csv['Age'].isnull() == False) & (train_csv['Survived'] == 1)].Age.mean()

#Output result
28.343689655172415

The survivors are younger. However, as it is, there are missing values and it cannot be included in the model. The quickest and most popular method is to supplement the data with representative values (mean or median). This time, the missing values are complemented by the overall average value and included in the model.

2.5. SibSp and Parch

SibSp contains siblings and spouses, and Parch contains the number of parents and children. It seems that people with many families are more likely to find it, so I feel like I'm alive. Let's draw a histogram of SibSp / Parch for each of the survivors and the dead.

import matplotlib.pyplot as plt

# SibSp
plt.hist([train_csv[train_csv['Survived']==0].SibSp, train_csv[train_csv['Survived']==1].SibSp], label=['Died', 'Survived'])
plt.legend()

image.png

#Parch
plt.hist([train_csv[train_csv['Survived']==0].Parch, train_csv[train_csv['Survived']==1].Parch], label=['Died', 'Survived'])
plt.legend()

image.png

In both cases, 0 people have many deaths, but even 1 person has many survivors or the same. A painful reality for one person. Since the distributions of SibSp and Parch are similar, we will use Parch as a representative this time. (Because putting similar variables causes multicollinearity) Parch is made into a dummy variable by 0 people and 1 or more people and put in the model.

2.6. Ticket There may be something to do with it, but I don't really understand the meaning of the numbers and letters, so I can't explain if it affects me. So I won't use it this time.

2.7. Fare

The amount is an integer value. The higher the person, the more likely it is to survive.

#the deceased
train_csv[train_csv['Survived']==0].Fare.mean()

#Output result
22.117886885245877

#Survivor
train_csv[train_csv['Survived']==1].Fare.mean()

#Output result
48.39540760233917

There was a big difference. Let's also look at the distribution.

plt.hist([train_csv[train_csv['Survived']==0].Fare, train_csv[train_csv['Survived']==1].Fare], label=['Died', 'Survived'], bins=20)
plt.legend()

image.png

It's a little hard to see, but it seems that there are more survivors when the fare is high. So I just put it in the model.

2.8. Cabin Since there are too many missing values, I decided not to use it this time.

2.9. Embarked (boarding place)

It is represented by three bases where you got on. Differences in health / economic conditions, race, and boarding time specific to the place of residence may affect survival rates. There are missing values, but they are insignificant and will be ignored.

pd.crosstab(train_csv['Embarked'], train_csv['Survived'])

スクリーンショット 2019-12-19 22.58.29.png

The number of fatalities of passengers from S (Southampton) is large. Let's make an S dummy and put it in the model.

Why are Southampton passengers so dead? If anyone knows, please let me know.

3. Modeling with logistic regression

Let's model based on the analysis result.

3.1. Data processing

Modeling itself is easy, but you have to work with the variables. Like this.

train = pd.DataFrame()
train = train.drop(['Name', 'SibSp', 'Ticket', 'Cabin'] , axis=1) #Drop unused columns

#Make a female dummy
train['Female'] = train['Sex'].map(lambda x: 0 if x == 'male' else 1 ).astype(int)

#Complement age-deficient values with mean values
train['Age'].fillna(train['Age'].mean(), inplace=True)

#Parch makes 0 and more dummies
# Parch=0 when 0, Prach>=When 1 1
train['Parch_d'] = train['Parch'].map(lambda x: 0 if x == 0 else 1).astype(int)

#Embarked makes S and other dummies
train['Embarked_S'] = train['Embarked'].map(lambda x: 0 if x == 'S' else 1).astype(int)

3.2. Modeling

Then model.

#Model generation
from sklearn.linear_model import LogisticRegression
X = train[['Pclass', 'Age', 'Parch_d', 'Fare', 'Female', 'Embarked_S']]
y = train['Survived']

model = LogisticRegression()
result = model.fit(X, y)

3.3. Modeling results

First, check what kind of model the model has become.

#coefficient
result.coef_
#Output result
rray([[-1.02255162e+00, -2.89166539e-02, -7.14935760e-02,
         1.19056911e-03,  2.49662371e+00, -4.29002495e-01]]

#Intercept
result.intercept_
#Output result
array([1.98119965])

Female (female dummy) seems to have the most influence.

Next, let's see if the created model is good * 3. First, let's get the coefficient of determination.

model.score(X, y)

#Output result
0.792368125701459

It doesn't look bad. The coefficient of determination is also called R 2 </ sup>, but it shows how well the predicted value of the model created earlier fits the actual data. (The above values are the values in the training data) It is represented by 0 to 1, and the closer it is to 1, the better the prediction is.

Now let's predict and confirm with test data. Continue to the forecast section. → https://qiita.com/anWest/items/cad88fe7b6f26fe86480

4. Remarks

-* 1: If you do it strictly, you need a chi-square test -* 2: Originally, t-test and effect size should be calculated -* 3: There are various indexes to evaluate the model, and it is necessary to use them properly according to the purpose.

Recommended Posts

Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-
Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Prediction / Evaluation-
hello world with ctypes
Hello, World with Docker
Draw hello world with mod_wsgi
Hello World with Flask + Hamlish
Until hello world with zappa
Select models with Kaggle's Titanic (kaggle ④)
Hello, world! With virtual CAN communication
Predict Kaggle's Titanic with keras (kaggle ⑦)
[Note] Hello world output with python
Hello World! By QPython with Braincrash
Hello World and face detection with opencv-python 4.2
Check raw data with Kaggle's Titanic (kaggle ⑥)
Hello World with Raspberry Pi + Minecraft Pi Edition
I tried learning with Kaggle's Titanic (kaggle②)
Hello World! By QPython with Brainfu * k
Hello world
Hello World and face detection with OpenCV 4.3 + Python
Hello World with gRPC / go in Docker environment
Hello world with full features of Go language
Say hello to the world with Python with IntelliJ
Hello World with nginx + uwsgi + python on EC2
Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~
First python ① Environment construction with pythonbrew & Hello World !!
Create a "Hello World" (HTTP) server with Tornado