[PYTHON] When I tried to write about logistic regression, I ended up finding the mean and variance of the logistic distribution.

This is an article that I just tried to leave a memo about logistic regression and calculated the mean and variance of the logistic distribution. sorry.

I often use logistic regression at work, but when I have a little concern, I often find it difficult to get to the information I want, so I would like to summarize logistic regression as a memo for myself.

It seems that logistic regression is often used in the medical field. Of course, it is often used in other fields because of its high interpretability, simplicity of the model, and high accuracy.

Now consider the problem of predicting whether the input vector $ x $ will be assigned to the two classes $ C_0 or C_1 $.

Let $ y ∈ \ {0,1 \} $ be the objective variable (output) and $ x ∈ R ^ d $ be the explanatory variable (input). Here, $ y = 0 $ when assigned to $ C_0 $, and $ y = 1 $ when assigned to $ C_1 $.

## Linear discrimination I have prepared 20 artificially generated data (I just made it properly). This time I will use Python.

`python`


import matplotlib.pyplot as plt

x = [1.3, 2.5, 3.1, 4, 5.8, 6, 7.5, 8.4, 9.9, 10, 11.1, 12.2, 13.8, 14.4, 15.6, 16, 17.7, 18.1, 19.5, 20]
y = [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

plt.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.show

I don't always know, but how can the image be placed in the center? ??

I think the simplest approach to the classification problem is linear discrimination.

In the linear model, the output $ y $ is linear with respect to the input $ x $ ($ y (x) = \ beta_0 + \ beta_1 x $), and $ y $ is a real number. Here's one more step to adapt to the classification problem.

For example, transform a linear function with the nonlinear function $ f (・) $.

y(x) = f(\beta_0 + \beta_1 x)

In the field of machine learning, $ f (・) $ is called the activation function.

For example, in this case, the following activation function can be considered.

f(z) = \left\{
\begin{array}{ll}
1 & (z \geq 0.5) \\
0 & (z \lt 0.5)
\end{array}
\right.

Well, this time we will make a prediction using this activation function. First, we will train a linear model. We will use the least squares method to estimate the parameters.

This code is written in Python, but basically it is based on Essence of Machine Learning (Kato).

As an aside, this book is highly recommended as a starting point for machine learning. It starts with the minimum math required to start studying machine learning. This type of book doesn't know a lot about books. It's a pretty good book. Also, the code is very easy to read. I think that I will use the library in practice, but I think that it is the best in terms of writing from scratch for studying.

The original code is published on the Support Page.

import matplotlib.pyplot as plt
import numpy as np


def reg(x,y):
    n = len(x)
    a = ((np.dot(x,y) - y.sum() * x.sum() / n) /
        ((x**2).sum() - x.sum()**2 / n))
    b = (y.sum() - a * x.sum()) / n
    return a,b


x = np.array(x)
y = np.array(y)
a, b = reg(x,y)

print('y =', b,'+', a, 'x')

fig = plt.scatter(x, y)
xmax = x.max()
plt.plot([0, xmax], [b, a * xmax + b])
plt.axhline(0.5, ls = "--", color = "r")
plt.axhline(0, linewidth = 1, ls = "--", color = "black")
plt.axhline(1, linewidth = 1, ls = "--", color = "black")
plt.xlabel("x")
plt.ylabel("y")
plt.show

The estimated model is

\hat{y} = -0.206 + 0.07x

have become.

Next, considering the conversion by the activation function defined earlier, the boundary line seems to be about $ x = 10 $.

Now, there are some problems with this method. The least squares method is equivalent to the maximum likelihood method when a normal distribution is assumed for the conditional probability distribution.

On the other hand, a binary objective variable vector like this one causes various problems because it is clearly far from the normal distribution. For details, please refer to "Pattern Recognition and Machine Learning (Bishop)", but mainly

-The approximation accuracy of the class posterior probability vector is poor. -The flexibility of the linear model is low. ⇒The value of the probability exceeds $ [0,1] $ due to these two.

・ Penalize predictions that are too correct.

Can be given. Therefore, we adopt an appropriate probability model and consider a classification algorithm that has better characteristics than the least squares method.

Logistic distribution

Before we get into logistic regression, let's think a lot about the logistic distribution. Given the input vector $ x $, the conditional probabilities of class $ C_1 $ are

\begin{eqnarray}
P(y=1|x)&=&\frac{P(x|y=1)P(y=1)}{P(x|y=1)P(y=1)+P(x|y=0)P(y=0)}\\
\\
&=&\frac{1}{1+\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}}\\
\\
&=&\frac{1}{1+e^{-\log\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}}}\\
\end{eqnarray}

\log\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}=aIf you say

P(y=1|x)=\frac{1}{1+e^{-a}}\\

Will be. This is called a logistic distribution and will be represented by $ \ sigma (a) $.

The form of the distribution function is as follows.

import numpy as np
from  matplotlib import pyplot as plt

a = np.arange(-8., 8., 0.001)
y = 1 / (1+np.exp(-a))

plt.plot(a, y)
plt.axhline(0, linewidth = 1, ls = "--", color = "black")
plt.axhline(1, linewidth = 1, ls = "--", color = "black")
plt.xlabel("a")
plt.ylabel("σ (a)")
plt.show()

You can see that the range is within $ (0,1) $.

## Mean and variance of logistic distribution As mentioned above, the distribution function of the logistic distribution is

\sigma(x)=\frac{1}{1+e^{-x}}

It is represented by. The probability density function $ f (x) $ differentiates $ \ sigma (x) $ and

\begin{eqnarray}
f(x)&=&\frac{d}{dx}\frac{1}{1+e^{-x}}\\
\\
&=&\frac{e^{-x}}{(1+e^{-x})^2}
\end{eqnarray}

It will be. The form of the probability density function is as follows.

import numpy as np
from  matplotlib import pyplot as plt

x = np.arange(-8., 8., 0.001)
y = np.exp(-x) / ((1+np.exp(-x))**2)

plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("f (x)")
plt.show()

Given a distribution, people want to know the mean and variance. I will calculate it immediately. The moment generating function $ M (t) $ is

M(t) = \int_{-\infty}^{\infty}e^{tx}\frac{e^{-x}}{(1+e^{-x})^2}dx

By replacing $ \ frac {1} {(1 + e ^ {-x})} = y $

\begin{eqnarray}
M(t) &=& \int_{0}^{1}e^{-t\log(\frac{1}{y}-1)}dy\\
\\
&=& \int_{0}^{1}(\frac{1}{y}-1)^{-t}dy\\
\\
&=& \int_{0}^{1}(\frac{1-y}{y})^{-t}dy\\
\\
&=& \int_{0}^{1}(\frac{1}{y})^{-t}(1-y)^{-t}dy\\
\\
&=& \int_{0}^{1}y^t(1-y)^{-t}dy\\
\\
&=& \int_{0}^{1}y^{(t+1)-1}(1-y)^{(-t+1)-1}dy\\
\\
&=& Beta(t+1,1-t)\\
\\
&=& \frac{\Gamma(t+1)\Gamma(1-t)}{\Gamma((t+1)+(1-t))}\\
\\
&=& \frac{\Gamma(t+1)\Gamma(1-t)}{\Gamma(2)}=\Gamma(t+1)\Gamma(1-t)
\end{eqnarray}

(Shindo ...!)

Furthermore, if the order of differentiation and integration is allowed (the order can be exchanged without proof), the first derivative of this moment generating function is

\begin{eqnarray}
\frac{dM(t)}{dt}=\Gamma'(t+1)\Gamma(1-t)-\Gamma(t+1)\Gamma'(1-t)
\end{eqnarray}

And if $ t = 0 $,

\begin{eqnarray}
M'(0)=\Gamma'(1)\Gamma(1)-\Gamma(1)\Gamma'(1)=0
\end{eqnarray}

That is, $ E [X] = M'(0) = 0 $. Then find $ E [X ^ 2] $.

\begin{eqnarray}
\frac{d^2M(t)}{dt^2}&=&\Gamma''(t+1)\Gamma(1-t)-\Gamma'(t+1)\Gamma'(1-t)-\Gamma'(t+1)\Gamma'(1-t)+\Gamma'(t+1)\Gamma''(1-t)\\
\\
&=& \Gamma''(t+1)\Gamma(1-t)-2\Gamma'(t+1)\Gamma'(1-t)+\Gamma(t+1)\Gamma''(1-t)
\end{eqnarray}

If $ t = 0 $,

\begin{eqnarray}
M''(0)&=&\Gamma''(1)-2\Gamma'(1)^2+\Gamma''(1)\\
\\
&=& 2\Gamma''(1)-2\Gamma'(1)^2
\end{eqnarray}

Here, put $ \ psi (x) = \ frac {d} {dx} \ log \ Gamma (x) = \ frac {\ Gamma'(x)} {\ Gamma (x)} $ and differentiate this. Then

\begin{eqnarray}
\frac{d}{dx}\psi(x)=\frac{\Gamma''(x)\Gamma(x)-\Gamma'(x)^2}{\Gamma(x)^2}
\end{eqnarray}

That is, $ \ psi ’(0) = \ Gamma'' (1)-\ Gamma'(1) ^ 2 $. By the way, $ \ psi'(0) = \ zeta (2) $ [likely](https://ja.wikipedia.org/wiki/%E3%83%9D%E3%83%AA%E3%82 % AC% E3% 83% B3% E3% 83% 9E% E9% 96% A2% E6% 95% B0) [^ 1], so $ \ psi'(0) = \ frac {\ pi ^ 2} { It is calculated as 6} $.

Therefore, $ M'' (0) = 2 × \ frac {\ pi ^ 2} {6} $, and $ E [X ^ 2] = M'' (0) = \ frac {\ pi ^ 2} { I was able to ask for 3} $. Than this,

\begin{eqnarray}
V[X]&=&E[X^2]-E[X]^2\\
\\
&=& \frac{\pi^2}{3} - 0\\
\\
&=& \frac{\pi^2}{3}
\end{eqnarray}

It turns out that the expected value of the logistic distribution is $ 0 $ and the variance is $ \ frac {\ pi ^ 2} {3} $.

By the way, it seems that the derivative of the logarithm of the gamma function is called the polygamma function. In particular, the first-order derivative is said to be the digamma function.

(It was tough, and suddenly the $ ζ $ function came out and I wasn't sure, so I can't say I could calculate it ...)

## Logistic regression

Now, consider $ p = \ sigma (\ beta x) $ when the parameters of the logistic distribution are represented by a linear combination. Solving this for $ \ beta x $

\begin{eqnarray}
p &=& \frac{1}{1+e^{-\beta x}}\\
\\
(1+e^{-\beta x})p &=& 1\\
\\
p+e^{-\beta x}p &=& 1\\
\\
e^{-\beta x} &=& \frac{1-p}{p}\\
\\
-\beta x &=& \log\frac{1-p}{p}\\
\\
\beta x &=& \log\frac{p}{1-p}\\
\\
\end{eqnarray}

(Equal positions are aligned, but it's a little hard to see ...)

The right-hand side is called log odds in the field of statistics.

What I mean by that is, conversely, linear regression of log odds and solving for $ p $ will give you an estimate of the probability of being assigned to each class.

By the way, for $ p \ in [0,1] $, the odds are $ \ frac {p} {1-p} \ in [0, \ infty) $, and the logarithmic odds are $ \ log \ frac { Since it is p} {1-p} \ in (-\ infty, \ infty) $, we can also see that the range of log odds is the same as the range of linear functions.

## Logistic regression parameter estimation I'm going to lose track of what it is, so I'll sort out the letters and symbols. Data set $ D = \\ {X, Y \\}, $

Y = \left(
\begin{array}{c}
y_1\\
\vdots\\
y_n
\end{array}
\right),\quad　y_i \in \{ 0,1 \},(i=1,...n)

Regarding $ X $, I would like to include a constant term in the parameter, but it is troublesome to change the notation, so

X = \left(
\begin{array}{cccc}
1 & x_{11} & \cdots & x_{1d}\\\

\vdots & \vdots & \ddots & \vdots \\\
1 & x_{n1} & \cdots & x_{nd}
\end{array}
\right)

I would like to say that. Also, for $ i = 1, ..., n $, let $ x_i = (1, x_ {i1}, ..., x_ {id}) ^ T $ (that is, $ x_i $ is $ X) Transposed row component of $)).

The likelihood function for the parameter vector $ \ beta = (\ beta_0, \ beta_1, ..., \ beta_d) $

L(\beta) = P(Y | \beta)= \prod_{i=1}^{n} \sigma(\beta x_i)^{y_i}\{1-\sigma(\beta x_i)\}^{1-y_i}

The log-likelihood function can be written as

E(\beta)=-\log L(\beta)= -\sum_{i=1}^{n}\{y_i\log \sigma(\beta x_i)+(1-y_i)\log(1-\sigma(\beta x_i))\}

Can be written as. Find the parameter $ \ beta $ by solving this minimization problem.

However, due to the non-linearity of $ \ sigma $, the maximum likelihood solution cannot be derived analytically.

However, since $ E $ is a convex function, it has only the smallest solution. Find this minimum solution by Newton's method.

## Newton's method

The Newton method is also called the Newton-Raphson method. I will tell you a lot about Newton's method if I ask Google teacher without making a memo, so I will leave the explanation there. The early story is how to find the solution of an equation by numerical calculation. In the book I have,

・ P247 of Essence of Machine Learning (Kato) ・ Pattern recognition and machine learning (Bishop) P207 ・ P140 of Basics of Statistical Learning (Hastie) ・ P74 of Galois theory (Fujita) that can be solved

There is an explanation in etc.

## I'm exhausted. I didn't think it would be so hard just to calculate the mean and variance of the logistic distribution. I'm exhausted.

## ★ References ★ [1] Kato: The Essence of Machine Learning (2018) [2] Hastie, Tibshirani, Friedman: Basics of Statistical Learning (2014) [3] Bishop: Pattern Recognition and Machine Learning (2006) [4] Fujita: Galois theory that can be solved (2013)

[^ 1]: If you look it up, you will find various things, but since there are many pdf direct links such as lecture materials, wikipedia links are provided.