[PYTHON] Logistic Regression (for beginners) -Code Edition-

This time, I will summarize the implementation of logistic regression.

■ Logistic procedure

We will proceed with the following 6 steps.

  1. Preparation of module
  2. Data preparation
  3. Data visualization
  4. Creating a model
  5. Predict classification
  6. Model evaluation

1. Preparation of module

First, import the required modules.


import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Module for visualization
import seaborn as sns

#Module to read the dataset
from sklearn.datasets import load_iris

#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler

#Module that separates training data and test data
from sklearn.model_selection import train_test_split

#Module to perform logistic regression
from sklearn.linear_model import LogisticRegression

#Module to evaluate classification
from sklearn.metrics import classification_report

#Modules that handle confusion matrices
from sklearn.metrics import confusion_matrix

2. Data preparation

This time, we will use the iris dataset for binary classification.

First get the data, standardize it, and then split it.


#Loading iris dataset
iris = load_iris()

#Divide into objective variable and explanatory variable
X, y = iris.data[:100, [0, 2]], iris.target[:100]

#Standardization (distributed normalization)
std = StandardScaler()
X = std.fit_transform(X)

#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In order to perform binary classification, the data set is specified up to the 100th line (Setosa / Versicolor only). We've also narrowed down the explanatory variables to two to make it easier to plot. (Sepal Length / Petal Lengh only)

In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large. The scale is aligned by setting the average to 0 and the variance to 1 for all features.

In random_state, the seed value is fixed so that the result of data division is the same each time.

3. Data visualization

Let's plot the data before classification by logistic regression.


#Creating drawing objects and subplots
fig, ax = plt.subplots()

#Setosa plot
ax.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
           marker = 'o', label = 'Setosa')

#Versicolor plot
ax.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1],
           marker = 'x', label = 'Versicolor')

#Axis label settings
ax.set_xlabel('Sepal Length')
ax.set_ylabel('Petal Length')

#Legend settings
ax.legend(loc = 'best')

plt.show()

Plot with features corresponding to Setosa (y_train == 0) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis) Plot with features corresponding to Versicolor (y_train == 1) (0: Sepal Lengh on the horizontal axis, 1: Petal Length on the vertical axis)


Output result
image.png

4. Creating a model

Create an execution function (instance) of logistic regression and apply it to the training data.


#Create an instance
logreg = LogisticRegression()

#Create a model from training data
logreg.fit(X_train, y_train)

## 5. Predict classification Now that the model is complete, we first predict the probability of classification.
#Predict the probability of classification
y_proba = logreg.predict_proba(X_test)[: , 1]
print(y_proba)


Output result


y_proba: [0.02210131 0.99309888 0.95032727 0.04834431 0.99302674 0.04389388
 0.10540851 0.99718459 0.90218405 0.03983599 0.08000775 0.99280579
 0.99721384 0.78408501 0.08947531 0.01793823 0.99798469 0.01793823
 0.99429762 0.9920454 ]

The sigmoid function outputs a number in the range 0 to 1. The closer it is to 0, the higher the probability of Setosa, and the closer it is to 1, the higher the probability of Versicolor.

\sigma(z)=\frac{1}{1+\exp(-z)}

Next, let's predict the result of classification.


#Predict classification results
y_pred = logreg.predict(X_test)
print(y_pred)


Output result


y_pred: [0 1 1 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1]

Apply the cross entropy error function to the value output by the sigmoid function earlier. Values close to 0 are classified as 0: Setosa, and values close to 1 are classified as 1: Versicolor.

L(w)=y\log(p(x,w))+(1-y)\log(1-p(x,w))

6. Model evaluation

This time it will be a classification (binary classification), so we will evaluate using a confusion matrix.


#Create a confusion matrix
classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)

#Data frame
cmdf = pd.DataFrame(cm, index=classes, columns=classes)

#Plot the confusion matrix
sns.heatmap(cmdf, annot=True)


Output result
image.png

Next, find the numerical value of the evaluation index.


#Outputs precision rate, recall rate, and F value
print(classification_report(y_test, y_pred))


Output result
image.png

From the above, we were able to evaluate the classification in Setosa and Versicolor.

■ Finally

In logistic regression, we will create and evaluate a model based on steps 1 to 6 above.

This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).

Thank you for reading.

Recommended Posts

Logistic Regression (for beginners) -Code Edition-
Linear regression (for beginners) -Code edition-
Ridge Regression (for beginners) -Code Edition-
Decision tree (for beginners) -Code edition-
[Kaggle for super beginners] Titanic (Logistic regression)
Support Vector Machine (for beginners) -Code Edition-
Logistic regression
Logistic regression
Roadmap for beginners
Spacemacs settings (for beginners)
Techniques for code testing?
python textbook for beginners
Dijkstra algorithm for beginners
OpenCV for Python beginners
What is Logistic Regression Analysis?
[For beginners] kaggle exercise (merucari)
Linux distribution recommended for beginners
Python code memo for yourself
Test code for evaluating decorators
CNN (1) for image classification (for beginners)
Python3 environment construction (for beginners)
Overview of Docker (for beginners)
Python #function 2 for super beginners
Seaborn basics for beginners ④ pairplot
Basic Python grammar for beginners
Supervised learning (regression) 2 Advanced edition
100 Pandas knocks for Python beginners
[Python] Sample code for Python grammar
Python for super beginners Python #functions 1
Python #list for super beginners
~ Tips for beginners to Python ③ ~
[For Kaggle beginners] Titanic (LightGBM)
Reference resource summary (for beginners)
Linux command memorandum [for beginners]
Convenient Linux shortcuts (for beginners)
Machine learning algorithm (logistic regression)
Implementing logistic regression with NumPy
[For beginners] How to implement O'reilly sample code in Google Colab