[PYTHON] <Course> Machine Learning Chapter 3: Logistic Regression Model

Machine learning

table of contents Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Chapter 3: Logistic Regression Model

Description of logistic regression model

   x=(x_1,x_2,・ ・ ・,x_m)^T \in R^m     
     y \in \left\{0,1\right\}    

LOG3.jpg

(Practice 3) Predict the survival rate of a 30-year-old man using the Titanic dataset

Google drive mount

from google.colab import drive
drive.mount('/content/drive')

0. Data display

#from module name import class name (or function name or variable name)
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Magic for displaying matplotlib inline(plt.show()You don't have to)
%matplotlib inline

In the following, the study_ai_ml folder is used directly under My Drive in Google Drive.

#Read titanic data csv file
titanic_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/titanic_train.csv')
#View the beginning of the file and check the dataset
titanic_df.head(5)
スクリーンショット 2019-12-12 14.20.19.png

I examined the meaning of variables.

Passenger ID: Passenger ID Survived: Survival result (1: Survival, 0: Death) Pclass: Passenger class 1 is the highest class Name: Passenger's name Sex: Gender Age: Age SibSp Number of siblings and spouses Parch Number of parents and children Ticket Ticket number Fare boarding fee Cabin room number Embarked Port on board Cherbourg, Queenstown, Southampton

1. Logistic regression

Delete unnecessary data / complement missing values

#Drop the karau that you think is unnecessary for prediction
titanic_df.drop(['PassengerId','Pclass', 'Name', 'SibSp','Parch','Ticket','Fare','Cabin','Embarked'], axis=1, inplace=True)

#Display data with some columns dropped
titanic_df.head()

LOG1.jpg

#Show lines containing null
titanic_df[titanic_df.isnull().any(1)].head(10)

LOG2.jpg

#Complete null in Age column with median

titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())

#Show lines containing null again(Age null is complemented)
titanic_df[titanic_df.isnull().any(1)]

#titanic_df.dtypes
#titanic_df.head()

LOG3.jpg

1. Logistic regression

Implementation (determine life or death from gender and age)

#Because I filled in the missing value of Age Fill
#titanic_df = titanic_df.drop(['Age'], axis=1)
#Set female 0 male 1 in Gender
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_df.head()

LOG4.jpg

Let's draw the distribution of life and death by gender and age

np.random.seed = 0

xmin, xmax = -5, 85
ymin, ymax = -0.5, 1.3

index_survived = titanic_df[titanic_df["Survived"]==0].index
index_notsurvived = titanic_df[titanic_df["Survived"]==1].index

from matplotlib.colors import ListedColormap
fig, ax = plt.subplots()
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
sc = ax.scatter(titanic_df.loc[index_survived, 'AgeFill'],
                titanic_df.loc[index_survived, 'Gender']+(np.random.rand(len(index_survived))-0.5)*0.1,
                color='r', label='Not Survived', alpha=0.3)
sc = ax.scatter(titanic_df.loc[index_notsurvived, 'AgeFill'],
                titanic_df.loc[index_notsurvived, 'Gender']+(np.random.rand(len(index_notsurvived))-0.5)*0.1,
                color='b', label='Survived', alpha=0.3)
ax.set_xlabel('AgeFill')
ax.set_ylabel('Gender')
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.legend(bbox_to_anchor=(1.4, 1.03))

LOG5.jpg

Since 1 is male, 0 is female, red is dead and blue is alive, it is distributed so that a relatively large number of females are alive.

#Create a list of age and gender only
data2 = titanic_df.loc[:, ["AgeFill", "Gender"]].values
data2

result


array([[22.        ,  1.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       ...,
       [29.69911765,  0.        ],
       [26.        ,  1.        ],
       [32.        ,  1.        ]])

Let's make a survival graph by age

split_data = []
for survived in [0,1]:
    split_data.append(titanic_df[titanic_df.Survived==survived])

temp = [i["AgeFill"].dropna() for i in split_data ]
plt.hist(temp, histtype="barstacked", bins=16)

LOG6.jpg

Since the missing values of age are filled in on average, the number in the middle is large. Try graphing again with the data excluding the missing values.

temp = [i["Age"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

LOG7.jpg

Check the survival rate of men and women with a pile map

temp = [i["Gender"].dropna() for i in split_data]
plt.hist(temp, histtype="barstacked", bins=16)

LOG8.jpg It became like that.

1. Logistic regression

Implementation (determines life or death from 2 variables)

##Create a list of life and death flags only
label2 =  titanic_df.loc[:,["Survived"]].values
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
model2.fit(data2, label2)

result


/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Predict a 30-year-old man

model2.predict([[30,1]])

result


array([0])```


```python
model2.predict([[30,1]])

result


array([1])
model2.predict_proba([[30,1]])

Zero (death) prediction is returned

result


array([0])

I'm watching the establishment of that judgment

model2.predict_proba([[30,1]])

result


array([[0.80664059, 0.19335941]])

The percentage of death probability 80% and survival probability 20% can be seen.

Related Sites

Chapter 1: Linear Regression Model [Chapter 2: Nonlinear Regression Model] (https://qiita.com/matsukura04583/items/baa3f2269537036abc57) [Chapter 3: Logistic Regression Model] (https://qiita.com/matsukura04583/items/0fb73183e4a7a6f06aa5) [Chapter 4: Principal Component Analysis] (https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c) [Chapter 5: Algorithm 1 (k-nearest neighbor method (kNN))] (https://qiita.com/matsukura04583/items/543719b44159322221ed) [Chapter 6: Algorithm 2 (k-means)] (https://qiita.com/matsukura04583/items/050c98c7bb1c9e91be71) [Chapter 7: Support Vector Machine] (https://qiita.com/matsukura04583/items/6b718642bcbf97ae2ca8)

Recommended Posts

<Course> Machine Learning Chapter 3: Logistic Regression Model
<Course> Machine Learning Chapter 1: Linear Regression Model
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
Machine learning logistic regression
Machine learning algorithm (logistic regression)
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
<Course> Machine Learning Chapter 7: Support Vector Machine
<Course> Machine learning Chapter 4: Principal component analysis
Machine learning linear regression
Machine learning course memo
Machine learning model considering maintainability
Machine Learning: Supervised --Linear Regression
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Understand machine learning ~ ridge regression ~.
Supervised machine learning (classification / regression)
Machine learning stacking template (regression)
Try to evaluate the performance of machine learning / regression model
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
[Language processing 100 knocks 2020] Chapter 6: Machine learning
Machine learning beginners try linear regression
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Classification and regression in machine learning
Inversely analyze a machine learning model
Logistic regression
Logistic regression
Machine learning
Machine learning beginners take Coursera's Deep learning course
Machine learning algorithm (generalization of linear regression)
Machine learning with python (2) Simple regression analysis
TensorFlow Machine Learning Cookbook Chapter 2 Personally Clogged
Cross Validation improves machine learning model accuracy
Stock price forecast using machine learning (regression)
Machine learning algorithm (linear regression summary & regularization)
TensorFlow Machine Learning Cookbook Chapter 3 Personally Clogged
[Machine learning] Regression analysis using scikit learn
Gaussian mixed model EM algorithm [statistical machine learning]
I tried to organize the evaluation indexes used in machine learning (regression model)
TensorFlow Machine Learning Cookbook Chapter 6 (or rather, tic-tac-toe)
Coursera Machine Learning Challenges in Python: ex3 (Handwritten Number Recognition with Logistic Regression)
EV3 x Python Machine Learning Part 2 Linear Regression
Face image dataset sorting using machine learning model (# 3)
[Python3] Let's analyze data using machine learning! (Regression)
Classify machine learning related information by topic model
[Memo] Machine learning
Machine learning classification
Implement a discrete-time logistic regression model with stan
Machine Learning sample
Python learning memo for machine learning by Chainer from Chapter 2
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Machine Learning with Caffe -1-Category images using reference model
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Attempt to include machine learning model in python package
[Machine learning] Text classification using Transformer model (Attention-based classifier)
xgboost: A valid machine learning model for table data
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost