Level4. Machine learning course (theory and practice)

alt What is a deep learning course that can be crushed in the field in 3 months

4-3. Logistic regression model

4-3-1. Data to be treated as a classification problem

● Classification problem (classification) Problem of classifying a certain input (numerical value) into a class ● Data handled by classification Input ... Each element is called an explanatory variable or feature quantity. An m-dimensional vector, scalar if m = 1. Output ... Objective variable. A value of 0 or 1.

[Explanatory variable] $ x = (x_1, x_2, ･･･, x_m) ^ T \ in \ mathbb R ^ m $ [Objective variable] $ y \ in {(0,1)} $

4-3-2. Logistic regression model

● Supervised machine learning model for solving classification problems (learning from teacher data) ● Input a linear combination of input and m-dimensional parameters into the sigmoid function ● Output is the value of the probability that y = 1 (← output range of sigmoid function)

[Parameter]

w=(w_1,w_2,･ ･ ･,w_m)^T \in \mathbb R^m

[Linear combination]

\hat y = w^Tx + w_0 = \sum^{m}_{j=1} w_jx_j + w_0\

4-3-3. Sigmoid function

● Sigmoid function Input is a real number, output is a value from 0 to 1 (expresses probability) Monotonically increasing function ● The shape of the sigmoid function changes due to changes in parameters $ A $ increases ⇒ The slope of the curve increases near $ x = 0 $. Make $ a $ extremely large ⇒ Approach the unit step function.

[Sigmoid function]

σ(x) = \frac{1}{1+exp(-ax)}

● Properties of sigmoid functions (good properties) The derivative of the sigmoid function can be expressed by the sigmoid function itself. This fact simplifies the calculation when differentiating the likelihood function.

\begin{align}
\frac{\partial σ(x)}{\partial x} &= \frac{\partial}{\partial x}\Bigl(\frac{1}{1+exp(-ax)}\Bigr)\\
&=(-1)・\{1+exp(-ax)\}^2 ・ exp(-ax)・(-a)\\
&=\frac{a \space exp(-ax)}{\left\{1+exp(-ax)\right\}^2}=\frac{a}{1+exp(-ax)}・\frac{1+exp(-ax)-1}{1+exp(-ax)}\\
&=aσ(x)(1-\sigma(x))
\end{align}

In machine learning, a model is decided and parameters are estimated. Since differentiation is always included, it is advantageous to simplify the calculation.

4-3-4. Correspond to the probability that the output of the sigmoid function becomes Y = 1.

● Calculate linear combination of data ● Probability that the i-th data will be Y = 1 in the output of the sigmoid function when the i-th data is given.

[Value you want to find] $ P (Y = 1 | x) = σ (w_0 + w_1x_1 + ･･･ + w_mx_m) $

⇒ Probability that Y = 1 when the realization value of the explanatory variable is given $ w_0 + ... $ is a linear combination to the parameters of the data

[Formula] $ P (Y = 1 | x) = σ (w_0 + w_1x_1) $ ($ w_0 $: intercept, $ w_1 $: regression coefficient, $ x_1 $: explanatory variable)

4-3-5. Maximum likelihood estimation

● There are various probability distributions ･･･ Normal distribution, t distribution, gamma distribution, uniform distribution, Dirichlet distribution, etc. ● Bernoulli distribution Discrete probability distribution in mathematics where probability p is 1 and probability 1-p is 0 ⇒The appearance rate of the front and back by coin toss is common

[Expressing the probabilities that Y = 0 and Y = 1 collectively] 　　　P(y)=p^y(1-p)^{1-y}

● Estimating the parameters of the Bernoulli distribution I want to estimate the plausible distribution (parameter) that would have generated the data from the data. ⇒ “Maximum likelihood estimation”

● Simultaneous probability Probability that given some data, it will be obtained at the same time. Assuming that the random variables are independent, they are the multiplication of each probability. ⇒Probability that the front and back of the coin will appear ... Since it is an independent result, multiplication is sufficient.

● Likelihood function The data is fixed and only the parameters are changed. An estimation method that selects a parameter that maximizes the likelihood function is called "maximum likelihood estimation".

[Probability that $ y = y_1 $ in one trial] $ P (y) = p ^ y (1-p) ^ {1-y} $ [Probability that $ y_1 to y_n $ occur at the same time in n trials]

P(y_1,y_2,･ ･ ･,y_n;p)=\prod_{i=1}^{n}p^{y_{i}}(1-p)^{1-y_{i}}

[Likelihood function when data of $ y_1 to y_n $ is obtained]

P(y_1,y_2,･ ･ ･,y_n;p)=\prod_{i=1}^{n}p^{y_{i}}(1-p)^{1-y_{i}}

⇒Fix the data you can have and find the parameter $ \ hat p $.

4-3-6. Maximum likelihood estimation of logistic regression model

● The probability $ p $ is a sigmoid function. The parameter to be estimated is a weight parameter.

P(Y=y_1|x_1)=p_1^{y_1}(1-p_1)^{1-y_1}=\sigma(w^Tx_1)^{y_1}(1-\sigma(w^Tx_1))^{1-y_1}\\
P(Y=y_2|x_2)=p_2^{y_2}(1-p_2)^{1-y_2}=\sigma(w^Tx_2)^{y_2}(1-\sigma(w^Tx_2))^{1-y_2}\\
・ ・ ・\\
P(Y=y_n|x_n)=p_n^{y_n}(1-p_n)^{1-y_n}=\sigma(w^Tx_n)^{y_n}(1-\sigma(w^Tx_n))^{1-n_1}

$ w ^ T $ is $ w $ of linear regression, and when $ w $ is determined, $ p $ is calculated and the probability is calculated.

● Likelihood function formula

\begin{align}
P(y_1,y_2,･ ･ ･,y_n|w_0,w_1,･ ･ ･,w_m)&=\prod_{i=1}^{n}p_i^{y_{i}}(1-p_i)^{1-y_{i}}\\
&=\prod_{i=1}^{n}\sigma(w^Tx_i)^{y_i}(1-\sigma(w^Tx_i))^{1-y_i}\\
&=L(w)
\end{align}

⇒ Maximize what you have multiplied $ L (w) $ by $ log $! !!

4-3-7. Estimate the parameter that maximizes the likelihood function

● Taking the logarithm makes it easier to calculate the derivative (the product of simultaneous probabilities is the sum, and the exponent is converted to the calculation of the product) ● ** The point where the log-likelihood function is maximized and the point where the likelihood function is maximized are the same ** (Question, but confirmed to be mathematically consistent) ● The likelihood function is maximized, but "minimize the likelihood function by adding a minus" to "minimize the least squares method".

\begin{align}
E(w_0,w_1,･ ･ ･,w_m)&=-logL(w_0,w_1,･ ･ ･,w_m)\\
&=-\sum_{i=1}^n\{y_ilog p_i+(1-y_i)log(1-p_i)\}
\end{align}

Even if you try to solve with $ = 0 $, you cannot solve it. Use the gradient descent method.

● Gradient descent Approach to update parameters sequentially by iterative learning ** $ \ eta $ is a hyperparameter called learning rate **, which adjusts the ease of convergence of the model.

[Linear regression model (least squares method)] It is possible to find the value at which the derivative of the MSE parameter becomes 0. [Logistic regression model (maximum likelihood method)] Differentiate the log-likelihood function with parameters to find the value that becomes 0, It is difficult to obtain this value analytically.

w(k+1)=w^k-\eta\frac{\partial E(w)}{\partial w}

● Differentiate the log-likelihood function with respect to coefficients and bias

\begin{align}
\frac{\partial E(w)}{\partial w}&=\sum_{i=1}^{n}\frac{\partial E_i(w)}{\partial p_i}\frac{\partial p_i}{\partial w}\\
&=\sum_{i=1}^{n}\Bigl(\frac{y_i}{p_i}-\frac{1-y_i}{1-p_i}\Bigr)\frac{\partial p_i}{\partial w}\\
&=\sum_{i=1}^{n}\Bigl(\frac{y_i}{p_i}-\frac{1-y_i}{1-p_i}\Bigr)p_i(1-p_i)x_i\\
&=-\sum_{i=1}^{n}(y_i(1-p_i)-p_i(1-y_i))x_i\\
&=-\sum_{i=1}^{n}(y_i-p_i)x_i
\end{align}

● Parameters are no longer updated ⇒ Gradient has become 0. (= Optimal solution was found)

w^{(k+1)}=w^k+\eta\sum_{i=1}^{n}(y_i-p_i)x_i

⇒ In the gradient descent method, you have to calculate using n parameters to update the parameters. If n is huge, it cannot be expanded in memory or the calculation time becomes huge. ⇒ ** Use the stochastic gradient descent method **.

4-3-8. Stochastic Gradient Descent Method (SGD)

● Randomly select data and update parameters. ● The parameter is updated n times with the same amount of calculation as updating the parameter once by the gradient descent method. It is possible to efficiently search for the optimum solution. ● Since SGD changes the point of view one after another, it is unlikely that the point will have a gradient of 0 near the initial value.

w(k+1)=w^k+\eta(y_i-p_i)x_i

4-3-9. Index for measuring performance

● Confusion matrix A table that classifies the model prediction results for the validation data from four perspectives

(1) Model prediction results "positive" and "negative" (2) Results of verification data: 2x2 table of "positive" and "negative"

Prediction result\Verification data result	positive	negative
positive	True positive(True Positive)	False negative(False Positive)
negative	false positive(False Negative)	True negative(True Negative)

● Evaluation method of classification The correct answer rate is often used, but when the data is imbalanced, the evaluation is incorrect. When 80 spam emails and 20 normal emails are judged by the classifier as spam ⇒The correct answer rate will be 80%.

● Evaluate using "Recall rate: Recall" and "Precision rate: Precision".

It is better to have a re-examination due to a misdiagnosis than to miss it due to a diagnosis of illness ⇒ Recall Exclude only certain ones like the judgment of spam mail, and let the person judge a little ⇒ Precision

● F value Ideally, both models should be expensive However, there is a trade-off relationship, and it becomes a see-saw game.

[PYTHON] DeepRunning ~ Level4.3.1 ~