In this article, I will introduce the basic idea of logistic regression and practical code in python! I intend to summarize what I was studying when I was a student in a super easy-to-understand manner, so I hope that everyone from beginners to those who usually write code will find it helpful.
Logistic regression is a method commonly used for binary classification. To explain the idea of logistic regression, consider, for example, the famous Titanic example.
n variables $ (x_0, x_1, x_2 ,, x_n) $ (PassengerID in the Titanic example)
X =
\begin{bmatrix}
x_0 \\
x_1 \\
x_2 \\
\vdots \\
x_n \\
\end{bmatrix}
W =
\begin{bmatrix}
β_0 \\
β_1 \\
β_2 \\
\vdots \\
β_n \\
\end{bmatrix}
As
Z=β_0+β_1x_1+β_2x_2+...+β_nx_n …① \\
= w^{T}X
In this case, logistic regression expresses the probability p as follows.
p=\frac{1}{1+e^{-(β_0+β_1x_1+β_2x_2+...+β_nx_n)}}
=\frac{1}{1+e^{-Z}} …②
And for the obtained p
The basic flow of logistic regression is to classify by whether or not p exceeds the range value (certain line).
(1) becomes the following formula by equivalence transformation.
\log\frac{p}{1-p}=β_0+β_1x_1+β_2x_2+...+β_nx_n
-(Reference) Equivalence transformation (If you understand, you can skip it) Multiply the denominator and numerator on the right side of ② by e ^ Z,
p=\frac{e^Z}{1+e^Z}\\
⇔e^Z=p(1+e^Z)\\
⇔e^Z=\frac{p}{1-p}\\
⇔\log\frac{p}{1-p}=Z=β_0+β_1x_1+β_2x_2+...+β_nx_n …③
The left side of ③, that is
\log\frac{p}{1-p}
Is called "log odds" (or "logit"). Logistic regression is, in other words, a given factor
x_1,x_2,x_3,...x_n
On the other hand, the optimum parameter such that the objective variable p satisfies the equation (3).
β_1,β_2,β_3,...β_n
It is a calculation to find.
By the way, the formula (2) is as follows when graphed. As you can see from the graph, the input value $ Z $ takes $ -∞ ≤ Z ≤ ∞ $, while the output value $ y $ takes $ 0 <y <1 $.
Unlike multiple regression analysis, which basically assumes a normal analysis of the residuals, logistic regression conveniently does not assume a normal distribution for the error distribution.
So how do you find the optimal parameters? The calculation part of this chapter is calculated automatically by using LogisticRegression () in python, but when using it for analysis in actual business, "It happened when I calculated using the library, so that If you say "I don't know how it works", it will be a black box.
There are various methods for optimizing parameters, but here I would like to introduce the "Newton method".
Let's look at the formula (1) above again.
p=\frac{1}{1+e^{-(β_0+β_1x_1+β_2x_2+...+β_nx_n)}}
=\frac{1}{1+e^{-Z}} …②
n datasets (p and $ x_0, x_1, ... x_n $)
The log-likelihood function is
L(w)=log(\prod_{i = 0}^{n}p^Y(1-p)^{1-Y})\\\
⇔L(w)=Y\sum_{i=0}^{n}log{p}+(1-Y)\sum_{i=0}^{n}log{(1-p)}\\\
⇔L(w)=Y\sum_{i=0}^{n}log{\frac{1}{1+e^{-w^Tx}}}+(1-Y)\sum_{i=0}^{n}log{(1-\frac{1}{1+e^{-w^Tx}})}\\\
You can find $ w $ that maximizes this. This is synonymous with minimizing $ -L (w) $ (= $ E (w) $).
So how can we find the parameter $ w $ that minimizes $ E (w) $? Here, a certain $ w ^ {(t)} $ is tentatively set at the beginning, and depending on the situation, $ w ^ {(t + 1)} ← w ^ {(t)}-d ^ {(t)} Let's think about how to get closer to the optimum matrix by updating $ w $ like $. So how do you find $ d ^ {(t)} $? By Taylor expansion
E(w+d)∼E(w)+\frac{δE(w)}{δw}d^{(t)}+\frac{1}{2}\frac{δ^2E(w)}{δwδw}{d^{(t)}}^2 …④
Since we only need to find d ^ {(t)} that minimizes this, we partially differentiate this with $ d ^ {(t)} $ and
\frac{δE(w)}{δw}+\frac{δ^2E(w)}{δwδw}{d^{(t)}}=0 \\\
⇔{d^{(t)}}=\left(\frac{δ^2E(w)}{δwδw}\right)^{-1}\frac{δE(w)}{δw}
w^{(t+1)}←w^{(t)}-\left(\frac{δ^2E(w)}{δwδw}\right)^{-1}\frac{δE(w)}{δw} …⑤
This is Newton's method.
here,
\frac{δE(w)}{δp}=-\frac{δY\sum_{i=0}^{n}log{p}+(1-Y)\sum_{i=0}^{n}log{(1-p)}}{δp}\\\
=-\frac{Y}{p}+\frac{1-Y}{1-p} \\\
=\frac{p-Y}{p(1-p)}
\frac{δp}{δw}=\frac{δ}{δw}\frac{1}{1+e^{-w^Tx}} \\\
=\frac{δ}{δw}(1+e^{-w^Tx})^{-1} \\\
=\frac{xe^{-w^Tx}}{(1+e^{-w^Tx})^2} \\\
=\frac{xe^{-w^Tx}}{(1+e^{-w^Tx})^2} \\\
=px(1-p)
Therefore,
\frac{δE(w)}{δw}=\frac{p-Y}{p(1-p)}px(1-p)=x(p-Y) …⑥
Also, the second derivative of $ E (w) $ is
\frac{δ^2E(w)}{δwδw}=\frac{δ}{δw}x\left(\frac{1}{1+e^{-w^Tx}}-Y\right) \\\
=px^2(1-p) …⑦
⑥ From ⑦, substitute for ⑤ and
w^{(t+1)}←w^{(t)}-\frac{1}{px^2(1-p)}x(p-Y) …⑧
L(w+d)∼L(w)+L'(w)h
Based on the above, the parameters are optimized by the following algorithm.
So far, I have explained Newton's method. Since this Newton's method uses the second derivative, it has the disadvantage that the amount of calculation is very large, while the speed of convergence is fast.
logistic.py
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=10, solver : {‘newton-cg’})
lr.fit(X_train, y_train)
In python, you can implement logistic regression with the above simple code using a library called scikit-learn. In the code above, we're tinkering with the numbers for the two parameters C and solver. Of these, solver sets the method for finding the optimal parameters described above. Here, the Newton method explained above is used. Next is the explanation of C. This C is called regularization and is generally the most driven parameter.
So far, I have explained the mathematical idea of logistic regression and how to write simple code in python based on the example of Titanic, but in fact, logistic regression is also applied in various fields of social science.
For example, in the accounting field, when investigating the bankruptcy risk of a company, the bankruptcy risk is logically returned by taking into account the presence or absence of a note regarding GC (Going Concern) in addition to the PL value such as the operating profit margin. Research is being done to analyze in.
Recommended Posts