[PYTHON] <Course> Machine Learning Chapter 3: Logistic Regression Model

Machine learning

table of contents Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Chapter 3: Logistic Regression Model

Description of logistic regression model

   x=(x_1,x_2,・ ・ ・,x_m)^T \in R^m     
     y \in \left\{0,1\right\}    


(Practice 3) Predict the survival rate of a 30-year-old man using the Titanic dataset

Google drive mount

from google.colab import drive

0. Data display

#from module name import class name (or function name or variable name)
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Magic for displaying matplotlib inline(plt.show()You don't have to)
%matplotlib inline

In the following, the study_ai_ml folder is used directly under My Drive in Google Drive.

#Read titanic data csv file
titanic_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/titanic_train.csv')
#View the beginning of the file and check the dataset
スクリーンショット 2019-12-12 14.20.19.png

I examined the meaning of variables.

Passenger ID: Passenger ID Survived: Survival result (1: Survival, 0: Death) Pclass: Passenger class 1 is the highest class Name: Passenger's name Sex: Gender Age: Age SibSp Number of siblings and spouses Parch Number of parents and children Ticket Ticket number Fare boarding fee Cabin room number Embarked Port on board Cherbourg, Queenstown, Southampton

1. Logistic regression

Delete unnecessary data / complement missing values

#Drop the karau that you think is unnecessary for prediction
titanic_df.drop(['PassengerId','Pclass', 'Name', 'SibSp','Parch','Ticket','Fare','Cabin','Embarked'], axis=1, inplace=True)

#Display data with some columns dropped


#Show lines containing null


#Complete null in Age column with median

titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())

#Show lines containing null again(Age null is complemented)



1. Logistic regression

Implementation (determine life or death from gender and age)

#Because I filled in the missing value of Age Fill
#titanic_df = titanic_df.drop(['Age'], axis=1)
#Set female 0 male 1 in Gender
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)


Let's draw the distribution of life and death by gender and age

np.random.seed = 0

xmin, xmax = -5, 85
ymin, ymax = -0.5, 1.3

index_survived = titanic_df[titanic_df["Survived"]==0].index
index_notsurvived = titanic_df[titanic_df["Survived"]==1].index

from matplotlib.colors import ListedColormap
fig, ax = plt.subplots()
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
sc = ax.scatter(titanic_df.loc[index_survived, 'AgeFill'],
                titanic_df.loc[index_survived, 'Gender']+(np.random.rand(len(index_survived))-0.5)*0.1,
                color='r', label='Not Survived', alpha=0.3)
sc = ax.scatter(titanic_df.loc[index_notsurvived, 'AgeFill'],
                titanic_df.loc[index_notsurvived, 'Gender']+(np.random.rand(len(index_notsurvived))-0.5)*0.1,
                color='b', label='Survived', alpha=0.3)
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.legend(bbox_to_anchor=(1.4, 1.03))


Since 1 is male, 0 is female, red is dead and blue is alive, it is distributed so that a relatively large number of females are alive.

#Create a list of age and gender only
data2 = titanic_df.loc[:, ["AgeFill", "Gender"]].values


array([[22.        ,  1.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [29.69911765,  0.        ],
       [26.        ,  1.        ],
       [32.        ,  1.        ]])

Let's make a survival graph by age

split_data = []
for survived in [0,1]:

temp = [i["AgeFill"].dropna() for i in split_data ]
plt.hist(temp, histtype="barstacked", bins=16)


Since the missing values of age are filled in on average, the number in the middle is large. Try graphing again with the data excluding the missing values.

temp = [i["Age"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)


Check the survival rate of men and women with a pile map

temp = [i["Gender"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

LOG8.jpg It became like that.

1. Logistic regression

Implementation (determines life or death from 2 variables)

##Create a list of life and death flags only
label2 =  titanic_df.loc[:,["Survived"]].values
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
model2.fit(data2, label2)


/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,

Predict a 30-year-old man







Zero (death) prediction is returned



I'm watching the establishment of that judgment



array([[0.80664059, 0.19335941]])

The percentage of death probability 80% and survival probability 20% can be seen.

