Introduction

In this article, we will implement logistic regression using tensorflow and sckit-learn.

I summarized linear regression in the previous article, so please see the following article if you like.

Try multiple regression analysis using scikit-learn, tensorflow and keras in Python

This time we will use the iris dataset.

About the iris dataset

iris data is data of a flower variety called iris.

There are 50 data on each of the three iris varieties, Setosa, Virginica, and Virginica, for a total of 150 data.

Let's actually look at the contents.

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

print(iris_df.head())

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2

Since each column name is stored in iris.feature_names, you can output the above data by passing it to the argument of Dataframe of pandas.

Sepal Length stores the sepal valve length, Sepal Width stores the sepal valve width, Petal length stores the petal length, and Petal Width stores the petal width data. I am.

You can display the correct label as shown below.

print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

In this way, the iris varieties setosa, versicolor, and virginica are set to 0, 1, and 2, respectively.

This time, we will delete the data of virginica because we will perform binary classification by logistic regression.

Then, based on the objective variable, create a model that identifies whether it is setosa or versicolor.

About logistic regression

I will explain about logistic regression.

Logistic regression is called regression, but what we do is binary classification.

Predict by calculating probabilities based on some explanatory variables. In this example, we will identify whether it is setosa or versicolor based on the petal length and width data.

Now, let's skip a little, but let's consider a machine learning model when all models are linearly approximated.

Now, let's assume that the objective variable is "the number of visitors to the pool" and the explanatory variable is "temperature", and perform linear regression.

Number of visitors = β_0 + β_1 × Temperature

In this case, the number of visitors on the left is called the response variable, and the formula on the right is called the linear predictor.

But this is obviously a bad model, isn't it? Because the formula on the right may be negative depending on the temperature, but the number of visitors cannot be negative.

Therefore, in order to make the response variable correspond to the linear predictor, change as follows.

log (number of visitors) = β_0 + β_1 × Temperature

By applying the logarithmic function to the left side in this way, we were able to make both sides correspond correctly. The logarithmic function in this case is called the link function.

Next, let's set the passing of the entrance examination as the objective variable. In this case, of course, it will be 0 or 1 because it passes or fails.

Let's set the explanatory variable to study time. The following models are possible.

Pass / Fail (1, 0) = β_0 + β_1 × Study time

This model is also not a good model. The left side is only 0 or 1, but obviously the right side doesn't.

Then, what if the objective variable is the pass rate as follows?

Pass rate = β_0 + β_1 × Study time

This model is not a good model either. The pass rate on the left side ranges from 0 to 1, but not on the right side.

A logit function is used as a link function to associate this response variable with a linear predictor. The logit function is expressed by the following formula.

f(p) = ln(\frac{p}{1-p}) \quad (0 < p < 1))

Please refer to the article here for the logit function.

For event A that occurs with probability p, the ratio of the probability that A occurs to the probability that it does not occur $ \ frac {p} {1-p} $ is called the odds.

The logit function has the following form.

If you use the logit function as the link function, you can create the following model with the pass rate as p.

ln(\frac{p}{1-p}) = β_0 + β_1 x study time

This model is a good model because the left side and the right side correspond.

However, what we want now is a model that predicts the pass rate, so let's solve p. Let's say the study time is $ x_1 $.

\begin{align}
ln(\frac{p}{1-p}) &= β_0 + β_1 × x_1\\
ln(\frac{p}{1-p}) &= z \quad  (z = β_0 + β_1 × x_1)\\
\frac{p}{1-p}&=e^z\\
p& = e^z - e^zp\\
p &= \frac{e^z}{1+e^z}\\
p & = \frac{1}{1+e^{-z}}
\end{align}

In this way, the sigmoid function can be derived. Generalize z a little more

z= β_0 + β_1 × x_1  + β_2 × x_2 ... + β_n × x_n

Let's say. This is the formula used for logistic regression, and the goal is to optimize β.

About β optimization

Considering the following equation, β can be optimized by the method of least squares as in multiple regression analysis.

ln(\frac{p}{1-p}) = β_0 + β_1 × x_1  + β_2 × x_2 ... + β_n × x_n

Please refer to the article here for the method of least squares.

The other is the method using the maximum likelihood method.

About maximum likelihood method

The maximum likelihood method is summarized in the article here in an easy-to-understand manner, so please refer to it.

I will briefly explain the maximum likelihood method.

In order to think about the maximum likelihood method, we need to understand the likelihood and the likelihood function.

The following is a quote from Kotobank.

Yudo [Likelihood likelihood] In the probability density function, the observed value is assigned to the random variable. In other words, it is a value evaluated by the observed value of the probability density. Moreover, when this is regarded as a function of an unknown parameter, it is especially called a likelihood function. The natural logarithm of the likelihood function is called the log-likelihood. Given the observed values and their probability distributions, the value of the parameter that maximizes the likelihood or log-likelihood gives one natural estimator of the parameter. This is called the maximum likelihood estimator, and has various asymptotic properties such as asymptotic matching with the true value of the population parameter as the sample size increases and asymptotic normal distribution.

The following is a quote from wikipedia.

What is the likelihood function? In statistics, when a result appears according to a certain precondition, on the contrary, it is inferred from the observation result that the precondition was "what". The numerical value representing the likelihood (likelihood) is regarded as a function with "what" as a variable. It is also simply called likelihood.

I'm not sure, so let's check it with a mathematical formula.

When N data $ x_1, x_2, x_3, ..., x_n $ are observed, the likelihood function is expressed by the following equation, where p (x) is the probability that each value will occur.

p(x_1)p(x_2)....p(x_3)

From here, I understand that the maximum likelihood method is a method of estimating the probability distribution that the original data follows by considering the probability that the data will occur when the data is observed. Please think.

For example, consider throwing a coin. Of course, the front and back of the coin have the same probability unless there is a squid.

It's the Bernoulli trial in statistics.

However, in the real world, it is rare to handle things such as coins whose probability distribution is sensuously known. So this time, we will assume that the odds of coins appearing on both sides are not equal.

How can I find the probability distribution that this coin follows?

It seems that you should throw a lot of coins and keep a record. Now, when I tossed the coin three times, it turned out to be front, front, and back.

If the probability that the front appears is $ p $, the probability that the back appears is $ 1-p , so the probability that this event occurs, that is, the likelihood function is $p^2(1-p)$$

It will be. Now this has happened. For example, if the probability that the coin will appear on the front is $ \ frac {1} {2} $ and the probability that the coin will appear on the back is $ \ frac {1} {2} $, the probability that this event will occur is $ \ frac {1} {8}. It will be $. If the probability that the coin will appear on the front is $ \ frac {2} {3} $ and the probability that the coin will appear on the back is $ \ frac {1} {3} $, the probability that this event will occur is $ \ frac {4} {27} $. It becomes.

Thinking in this way, when considering the probability that the coin will come out $ p $ and the probability that the coin will come out $ 1-p $ under the condition that "the coin was thrown three times and became front, front, back" In addition, the maximum likelihood method is used when considering what kind of $ p $ is likely (more) likely.

Now, the event "when I threw a coin three times, it was front, front, back" happened, so the plausible $ p $ is this "when I threw a coin three times, front, front, back" It's $ p $ that maximizes the probability that the "was" event will occur.

In other words, in order to find the "likelihood" $ p $ (maximum likelihood method), it is sufficient to select $ p $ that has the largest likelihood function, which is the "probability that the event will occur".

In other words, in this example, it is sufficient to find $ p $ that maximizes $ p ^ 2 (1-p) $, so the logarithm of both sides can be used for differentiation, or the value can be differentiated as it is and the maximum value is $ p $. It looks good if you ask for.

In this way, the method of finding the probability distribution that the original data follows from the result is called maximum likelihood estimation or Bayesian estimation.

This time, I knew from the beginning that it would follow the Bernoulli distribution on the front and back, so I could easily find the likelihood function, but in reality, I have to think from the point of finding the probability distribution (normal distribution, etc.) that the original data follows. It will be more difficult because it will not be.

The slides on here were very easy to understand, so please refer to them.

About β optimization again

Let's consider the likelihood function for this logistic regression. For logistic regression, when the objective variable is given by (0, 1) and the explanatory variables $ x_1, x_2, x_3, ..., x_n $ are given for some data, the probability that the objective variable of the data is 1. $ p $ is

p  = \frac{1}{1+e^{-z}}　\quad (z= β_0 + β_1 × x_1  + β_2 × x_2 ... + β_n × x_n)

It will be. The probability that the objective variable will be 1 is $ p $, and the probability that the objective variable will be 0 is $ 1-p $.

When you know which data the objective variable is (0, 1) for a certain data, that is, which data is that data (in this iris example, which is Setosa or Virginica? Think about when you know.

Assuming that the probability that the certain data is the data of the objective variable is $ L $, $ t_n $ is the objective variable (0, 1) using the above $ p $ (probability that the objective variable is 1). )

L = p^{t_n}(1-p)^{1-t_n}

Can be expressed as. You can see that when the objective variable is 1, $ L $ is $ p $, and when the objective variable is 0, $ L $ is $ 1-p $.

In other words, $ L $ is the "probability that the event will occur" when the objective variable is known, that is, when the result is known. If you find $ p $ that maximizes $ L $, which is the probability that the event will occur, under the condition that the event has occurred, you can think that $ p $ is a plausible $ p $. I can do it.

This example was for one piece of data, so now let's think about all the data.

Since it is independent for each data, the whole likelihood function can be thought of as the product of the likelihood functions of $ n $ data as follows.

L(β)=　\prod_{n = 1}^{N}
p_n^{t_n}(1-p_n)^{1-t_n}

Find $ p $ that maximizes the likelihood function $ L (β) $ for this $ n $ data, that is, $ β $ (because $ p $ is a function of $ β $) By doing so, it seems that the parameter $ β $ can be optimized.

To make the above formula easier to calculate, take the logarithm of both sides and add a minus to get the following formula.

E(β) = -logL(β) = - \sum^N_{n = 1}
\{t_nlogp_n + (1-t_n)log(1-p_n)\}

The $ E (β) $ derived in this way is called the cross entropy error function.

Implemented with scikit-learn

Let's implement it with sckit-learn. First, let's prepare the dataset.

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

iris = load_iris()
iris_data = iris.data[:100]
iris_target_data = pd.DataFrame(iris.target[:100], columns=['Species'])
iris_df = pd.DataFrame(iris_data, columns=iris.feature_names)

print(iris_df)

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 .. ... ... ... ... 95 5.7 3.0 4.2 1.2 96 5.7 2.9 4.2 1.3 97 6.2 2.9 4.3 1.3 98 5.1 2.5 3.0 1.1 99 5.7 2.8 4.1 1.3

We will classify the two data, setosa and versicolor. Therefore, we only handle the first 100 pieces of data.

The first 50 are data about setoba and the latter 50 are data about versicolor.

Let's take a look at the contents of iris_target_data.

print(iris_target_data)

Species 0 0 1 0 2 0 3 0 4 0 .. ... 95 1 96 1 97 1 98 1 99 1

In this way, it is a Data frame that stores 0 and 1 data.

The reason why I returned what was originally a numpy array to Dataframe is to use the Apply method of Dataframe as follows.

def species(num):
    if num == 0:
        return 'setosa'
    else:
        return 'versicolor'


iris_target_speacies = iris_target_data['Species'].apply(species)
print(iris_target_speacies)

0 setosa 1 setosa 2 setosa 3 setosa 4 setosa ...
95 versicolor 96 versicolor 97 versicolor 98 versicolor 99 versicolor

By using the apply method of Dataframe in this way, you can execute the function specified by the argument for the specified column and receive the Dataframe to which the return value is assigned. You can combine horizontally by specifying axis = 1.

The following code combines iris_df and iris_target_speacies.

iris_all_data = pd.concat([iris_df, iris_target_speacies], axis=1)
print(iris_all_data)

sepal length (cm) sepal width (cm) ... petal width (cm) Species 0 5.1 3.5 ... 0.2 setosa 1 4.9 3.0 ... 0.2 setosa 2 4.7 3.2 ... 0.2 setosa 3 4.6 3.1 ... 0.2 setosa 4 5.0 3.6 ... 0.2 setosa .. ... ... ... ... ... 95 5.7 3.0 ... 1.2 versicolor 96 5.7 2.9 ... 1.3 versicolor 97 6.2 2.9 ... 1.3 versicolor 98 5.1 2.5 ... 1.1 versicolor 99 5.7 2.8 ... 1.3 versicolor

Let's visualize the data with the following code. A scatter plot and abundance ratio for each data are shown.

sns.pairplot(iris_all_data, hue='Species')
plt.show()

It seems that such data can be classified.

Let's look at the frequency table for sepal length in the following code.

sns.countplot('sepal length (cm)', data=iris_all_data, hue='Species')
plt.show()

I think you have somehow understood the data.

Learn as it is

Let's create a model with the following code. You will learn as it is, not for training and for testing.

logistic_model = LogisticRegression()
logistic_model.fit(iris_df, iris_target_data)

This completes the model creation.

With the following code

β_1,β_2,β_3,β_4

You can output the value of.

print(logistic_model.coef_[0])

[-0.40247392 -1.46382925 2.23785648 1.00009294]

Since it is difficult to understand if it is left as it is, the following code will output it in an easy-to-understand manner.

coeff_df = DataFrame([iris_df.columns, logistic_model.coef_[0]]).T
print(coeff_df)

0 1 0 sepal length (cm) -0.402474 1 sepal width (cm) -1.46383 2 petal length (cm) 2.23786 3 petal width (cm) 1.00009

From this result, it can be seen that when the petal length and petal width are large, the probability of being versicolor is high, and when the sepal width and sepal length are large, the probability of being setosa is high.

You can see that this result is in good agreement with the graph.

Let's check the prediction accuracy with the following code. Test as is with the trained model.

...py print(logistic_model.score(iris_df, iris_target_data))


>1.0

It can be classified with 100% accuracy. It's amazing.

You can also check the accuracy by the following methods.

```python
class_predict = logistic_model.predict(iris_df)
print(class_predict)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

By using the predict method, you can get the predicted value for the given data in the form of numpy array.

thisclass_predictAnd the correct labelmetrics.accuracy_scorePrediction accuracy can be calculated by comparing with.

print(metrics.accuracy_score(iris_target_data, class_predict))

1.0

Of course, this method is also 100%.

####Learn separately for testing and training Next, let's learn the data separately for testing and training.

The code looks like this:

X_train, X_test, Y_train, Y_test = train_test_split(iris_df, iris_target_data)
logistic_model2 = LogisticRegression()
logistic_model2.fit(X_train, Y_train)
coeff_df2 = DataFrame([iris_df.columns, logistic_model2.coef_[0]]).T
print(coeff_df2)
print(logistic_model2.score(X_test, Y_test))

0 1 0 sepal length (cm) -0.381626 1 sepal width (cm) -1.33919 2 petal length (cm) 2.12026 3 petal width (cm) 0.954906 1.0

Even if the data for testing and training were separated, the prediction accuracy was 100%. It's amazing.

#Implemented with tensorflow If you implement it in tensorflow, you need to understand the original formula. Let's check.

p  = \frac{1}{1+e^{-z}}　\quad (z= β_0 + β_1 × x_1  + β_2 × x_2 ... + β_n × x_n)

The cross entropy error is used for the loss function. It is the logarithm of the likelihood function multiplied by minus.

E(β) = -logL(β) = - \sum^N_{n = 1}
\{t_nlogp_n + (1-t_n)log(1-p_n)\}

Let's implement it. Let's prepare the data with the following code.

from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import tensorflow as tf

iris = load_iris()
iris_data = iris.data[:100]
iris_target_data = pd.DataFrame(iris.target[:100], columns=['Species'])
iris_df = pd.DataFrame(iris_data, columns=iris.feature_names)
X_train, X_test, Y_train, Y_test = train_test_split(iris_df, iris_target_data)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(75, 4) (25, 4) (75, 1) (25, 1)

Let's create a model with the following code.

X = tf.placeholder(dtype=tf.float32, shape=[None, 4])
W = tf.Variable(tf.zeros([4, 1]))
b = tf.Variable(tf.zeros([1, 1]))
y = tf.nn.sigmoid(tf.matmul(X, W) + b)
y = tf.clip_by_value(y, 1e-10, 1.0 - 1e-10)
t = tf.placeholder(dtype=tf.float32, shape=[None, 1])

The explanatory variable is stored in X, and the objective variable is stored in t. Y is the above formulapIt corresponds to.

Because y is logarithmic when assigned to the cross entropy error functiony = 0Will cause an error, so y = tf.clip_by_value(y, 1e-10, 1.0 - 1e-10)By the minimum value of y1e-10To the maximum value of y 1.0 - 1e-10I am doing it.

Let's optimize the cross entropy error function with Adam with the following formula.

cross_entropy = tf.reduce_sum(t * tf.log(y) + (1 - t) * tf.log(1 - y))
train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)

Now that the model is complete, let's run it with the code below.

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for step in range(10000):
        if step % 10 == 0:
            loss_val = sess.run(cross_entropy, feed_dict={X: X_train, t: Y_train})
            W_val = sess.run(W)
            b_val = sess.run(b)
            print('Step: {},   Loss: {} W_val: {} b_val: {}'.format(step, loss_val, W_val, b_val))

        sess.run(train_step, feed_dict={X: X_train, t: Y_train})

    correct_prediction = sess.run(tf.cast(tf.greater(y, 0.5), tf.int32), feed_dict={X: X_train, t: Y_train})
 print ('Evaluation of training data:', metrics.accuracy_score (Y_train, correct_prediction))

    test_data = tf.placeholder(dtype=tf.float32, shape=[None, 4])
    test_prediction = tf.nn.sigmoid(tf.matmul(test_data, W_val) + b_val)
    test_correct_prediction = sess.run(tf.cast(tf.greater(test_prediction, 0.5), tf.int32), feed_dict={test_data: X_test})

 print ('Evaluation of test data:', metrics.accuracy_score (Y_test, test_correct_prediction))

Step: 9990, Loss: 0.10178631544113159 W_val: [[ 0.25520977] [-3.2158086 ] [ 3.3751483 ] [ 3.4625392 ]] b_val: [[-3.2248065]] Evaluation of training data: 1.0 Evaluation of test data: 1.0

tf.greater(y, 0.5)By 0.True for y greater than or equal to 5, 0.A Boolean value with False assigned is returned for y less than 5.

Ittf.castConverts True to 1 and False to 0 by assigning to.

This is the end of implementation by tensorflow.

#At the end Thank you for staying with us so far.

Thank you for your hard work.

[Python] I thoroughly explained the theory and implementation of logistic regression