What is machine learning?

Computer programs are said to measure task T with performance index P and, if its performance is improved by experience E, learn from experience E with respect to task T and performance index P (Tom Mitchell 1997).

・ Suppose you enter data into a computer program and solve task T. ・ Output Y1 is output when unknown data is input. ・ Output Y1 can be measured by performance index P ・ Input new data and output Y2 ・ If Y2 is improved from Y1 when measured by the performance index P, it can be said that this computer program has learned.

Linear regression

** Regression problem **

Explanatory variable (input) $ x = (x_1, x_2, ・・・, x_m) ^ T \ in ℝ ^ m $
Objective variable (output) $ y \in ℝ^1$

** Linear regression model **

One of the machine learning models for solving regression problems
Supervised learning
A model that outputs a linear combination of inputs and m-dimensional parameters
Add a hat to the predicted value (distinguish from y in the teacher data)

** Teacher data ** $$ {(x_i, y_i) ; i = 1, ... , n} $$ ** Parameters ** (same number of dimensions as input variables) $$ w = (w_1, w_2, ... , w_m)^T \in ℝ^m $$ ** Linear combination ** (Inner product of unknown parameter w and input x) $$\hat{y} = w^Tx + w_0 = \sum_{j = 1}^{m} w_jx_j + w_0 $$

** Linear combination ** -Sum of the inner product of the unknown parameter w and the input x -Add the intercept w_0 (intercept = intersection with the y-axis, which has the effect of translating toward the y-axis) -Even if the input vector $ x_j $ is multidimensional, the output will be one-dimensional (scalar).
** Model parameters ** ・ Set of weights $ w_j $ (How features affect the predicted area) ・ Estimate the best gradient by the least squares method

~ For simple regression model (explanatory variable m = 1) ~ * $ y = w_0 + w_1x_1 + \epsilon $ $ (objective variable = intercept + regression coefficient x explanatory variable + error) $ ** Simultaneous equations ** $y_1 = w_0 + w_1x_1 + \epsilon_1 $ $y_1 = w_0 + w_1x_2 + \epsilon_2 $ $y_1 = w_0 + w_1x_n + \epsilon_n $ ** Matrix representation ** $ y = Xw + \epsilon $
~ For multiple regression model (explanatory variable m> 1) ~ * $ y = w_0 + w_1x_1 + w_2x_2 + \ epsilon $ etc.

Data split

Measure generalization performance (good accuracy for unknown data)
Linear regression model

Mean squared error (residual sum of squares)
Sum of the squared error of the data and the model
Depends only on parameters
$ MSE_{train} = \frac{1}{n_{train}} \sum_{i=1}^{n_{train}}(\hat{y_i}^{(train)} - y_i^{(train)})^2 $

Least squares method

Search for parameters that minimize the mean square error of the training data (gradient 0 = minimum)

About MSE minimum point = $ \ hat {w} $ in w space When $ \ frac {\ partial} {\ partial {w} MSE_ {train}} = 0 $ (the point where MSE becomes w = 0 by the differentiation with respect to w), ...

** Regression coefficient ** $ \hat{w} = (X^{(train)T}X^{(train)})^ {-1}X^{(train)T}y^{(train)} $ The predicted value $ \ hat {y} is the inner product $ of the regression coefficient \ hat {w} and the explanatory variable x, so ...
** Predicted value ** $\hat{y} = X(X^{(train)T}X^{(train)})^ {-1}X^{(train)T}y^{(train)}$

Non-linear regression model

Basis set $ y_i = f(x_i) + \epsilon_i $ Once x is non-linearized by a linear map $ \ phi , the inner product with w is obtained. $ y_i = w_0 + \sum_{i=1}^{m} w_i\phi_j(x_i) + \epsilon_i $$ Examples of basis functions: polynomial functions, Gaussian basis functions, (B) spline functions, etc.
Non-linear regression based on one-dimensional basis functions

Polynomial (1st to 9th order)
$\phi_j = x^j$
Gaussian basis
$ \phi(x) = exp{\frac{(x - \mu_j)^2}{2h_j}}$ --Since it is in the form of a normal distribution, it becomes 1 when integrated. --The average location of the normal distribution and how much a mountain spreads are defined by a parameter called bandwidth → Can be used for two-dimensional basis functions

Non-linear regression based on two-dimensional basis functions

2D Gaussian basis set $\phi_j(x) = exp{\frac{(x - \mu_j)^T(x - \mu_j)}{2h_j}}$ --Basis functions can be placed in the data space

Basis expansion method

Explanatory variable $ x_i = (x_ {i1}, x_ {i2}, ・・・, x_ {im}) \ in ℝ ^ m $ Convert the explanatory variables to basis function vectors (convert x with K maps $ \ phi $ prepared in advance), and ...
Non-linear function vector (K-dimensional feature vector) $ \ phi (x_i) = (\ phi_1 (x_i), \ phi_2 (x_i), ・・・, \ phi_k (x_i)) ^ T \ in ℝ ^ k $ Since there are n i (learning data), n k-dimensional parameter vectors will appear if the mapping is used as training data. This gives the design matrix
Design matrix $ \ Phi ^ {(train)} = (\ phi (x_1), \ phi (x_2), ・・・, \ phi (x_n)) ^ T \ in ℝ ^ {n * k} $ After that, find the predicted value using the maximum likelihood method.
Predicted value $\hat{y} = \Phi(\Phi^{(train)T}\Phi^{(train)})^ {-1}\Phi^{(train)T}y^{(train)}$

Overfitting and unlearning

In case of underfitting ,,,
Low expressiveness of model $ \ rightarrow $ Use high expressive model
In case of overfitting ,,,
Increase training data / Delete unnecessary basis functions (variables) / Regularization

Regularization

Regularization method (penalization method)
Give penalties to reduce model complexity 　$ S\gamma = (y - \Phi w)^T(y - \Phi w) + \gamma R(w) $
Role of regularization term R

No R ... Least squares estimator
L2 norm ・・・ Ridge estimator (○ type, reduced estimation)
Constraint the parameter near the origin and find the point that minimizes MSE in the constraint
L1 norm ・・・ Lasso estimator (◇ type, sparse estimation)
The point of contact that minimizes MSE is Kado, and some parameters ($ w_1 $) are estimated to be exactly 0. Unnecessary bases and variables can be eliminated at the estimation stage (when there are many explanatory variables, it is useful to set variables that have a small effect on prediction to 0 in one estimation).

Parameter $ \ gamma $ = Constraint surface size Make $ \ gamma $ smaller → The constraint surface becomes larger Increase $ \ gamma $ → decrease the constraint surface
Choosing the right model

Holdout method
If there is not a lot of data, it will not give a good performance evaluation
Cross-validation
Divided for learning and evaluation for each iterator
Even if the average accuracy = CV value is lower than the model verified by holdout, the generalization performance is not necessarily high.

Logistic regression

Classification problem ** Explanatory variable (input) ** $ x = (x_1, x_2, ・・・, x_m) ^ T \ in ℝ ^ m $ ** Objective variable (output) ** $ y \in \{0, 1\}$
Logistic regression model

One of the machine learning models for solving classification problems
Supervised learning
Output linear combination of input and m-dimensional parameters to sigmoid function
The output will be the value of the probability that y = 1
** Teacher data ** $ {(x_i, y_i) ; i = 1, ... , n} $ ** Parameters ** (same number of dimensions as input variables) $ w = (w_1, w_2, ... , w_m)^T \in ℝ^m $ ** Linear combination ** (Inner product of unknown parameter w and input x) $\hat{y} = w^Tx + w_0 = \sum_{j = 1}^{m} w_jx_j + w_0 $

Sigmoid function

Input is real
Output is 0-1
As parameter a increases, the slope of the curve near x = 0 increases.
Changes in bias change the position of the step $ \sigma(x) = \frac{1}{1 + exp(-ax)}$
Properties of the sigmoid function

Easy differentiation
In model parameter estimation, it is necessary to find the points that minimize and maximize criteria such as MSE and likelihood → It is advantageous to be able to easily differentiate. $ \ frac {\ partial \ sigma (x)} {\ partial x} = \ frac {\ partial} {\ partial x} \ biggl (\ frac {1} {1 + exp (-ax)} \ biggr) \\ = (-1) ・ \ {1 + exp (-ax) \} ^ {-2} ・ exp (-ax) ・ (-a) \\ = \ frac {a ・ exp (-ax)} { \ {1 + exp (-ax) \} ^ 2} = \ frac {a} {1 + exp (-ax)} ・ \ frac {1 + exp (-ax) --1} {1 + exp (-ax) )} \\ = a \ sigma (x) (1-\ sigma (x)) $

Logistic regression formulation $ P(Y = 1|x) = \sigma(w_0 + w_1x_1 + ... + w_mx_m) $ $ (Probability that Y = 1 when the explanatory variable x is given) (Linear combination to data parameters) $
Bernoulli distribution

One of various distribution models such as normal distribution
Discrete probability distribution (probability p → 1, probability 1-p → 0)
Probability of $ y = y_1 $ in one trial $P(y)=p^y(1-p)^{1-y}$

Maximum likelihood estimation (what is the most likely P?)

Simultaneous probability
Each random variable is independent → Multiplication of probabilities

** Probability that $ y_1 $ ~ $ y_n $ ** will occur at the same time in n trials $ P (y_1, y_2, ..., y_n; p) = \ prod_ {i = 1} {n} p ^ {y_i} (1-p) ^ {1-y_i} $ ($ P, y_i) $ Is known)

Likelihood function

Likelihood = Probability of fetching a specific sample $ x_k $ from the target population $ \ mu $
Logarithm of probability = Likelihood function
Choosing a parameter that maximizes the likelihood function is called likelihood estimation. Likelihood function when data of $ y_1 $ ~ $ y_n $ is obtained $ P (y_1, y_2, ..., y_n; p) = \ prod_ {i = 1} {n} p ^ {y_i} (1-p) ^ {1-y_i} $ ($ P is unknown) , Y_i $ is known)

Log-likelihood function

Addition and subtraction are easier to calculate than repeating multiplication of complicated values
→ Take the logarithm and convert the multiplication to the addition
Likelihood function is maximized with the same coefficient w as before logarithmization (logarithmic function increases monotonically)
→ This optimal w is called the maximum likelihood estimator (local or global optimal solution)
Minimize the likelihood function multiplied by minus and combine it with the least squares minimization
→ Gradient descent method $E(w_0, w_1, ... , w_m) = -logL(w_0, w_1, ... , w_m) = \sum_{i=1}^{n}\{y_i log p_i + (1 - y_i) log(1 - p_i)\}$
Cross entropy (negative log-likelihood function)

Update the learning function with the wrong probability

Gradient descent

Approach to sequentially update parameters (by computer) by iterative learning
$ \ eta $ is a hyperparameter called learning rate that adjusts the ease of convergence of model parameters.
Derivatives for MSE parameters can be obtained analytically $w^{(k+1) = w^{(k)} + \eta \sum_{i=1}^{n}(y_i - p_i)x_i}$ In the gradient descent method, it is necessary to find the sum of all N data for parameter update.
→ There are problems such as insufficient memory and huge calculation time.

Stochastic Gradient Descent (SGD)

Randomly (probabilistically) select data one by one and update parameters
Since the parameter can be updated n times with the same amount of calculation as updating the parameter once by the purchase descent method, the optimum solution can be searched efficiently.
Parameter updates are irregular to some extent (compared to gradient descent), and it is difficult to fall into a locally optimal solution close to the initial value (logistic regression has only one mountain, so it does not matter much). $w(k+1) = w^k + \eta (y_i - p_i)x_i$

Model evaluation

Confusion matrix
Validation data results: Model prediction results

TP (true positive) = ○: ○, correct positive
FP (false negative) = ×: ○, wrongly positive (possible to make a normal person abnormal)
FN (false positive) = ○: ×, wrongly negative (normally by mistake for an abnormal person)
TN (true negative) = ×: ×, correctly negative

Correct answer rate $\frac {TP+TN}{TP+FN+FP+TN}$
Recall

Percentage of "really positive things" that can be predicted as positive
Example: Disease screening (oversight is NG) $\frac{TP}{TP+FN}$

Precision

Percentage of what the model predicts to be Positive is "really Positive"
Spam email (may be overlooked) $\frac{TP}{TP+FP}$

F value

Harmonic mean of Recall and Precision

Principal component analysis (PCA)

Dimensional compression

Compress the structure of multivariate data into a smaller number of indicators

I want to reduce information loss as much as possible
Analysis and visualization are possible with a small number of variables (about 2 or 3 dimensions)

Learning data $ x = (x_1, x_2, ・・・, x_m) ^ T \ in ℝ ^ m $
Average (vector) $ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $
Data matrix $ \ bar {X} = (x_1-\ bar x, ..., x_n-\ bar x) ^ T \\ (I'll centralize it so that it's distributed around the origin) $
Covariance matrix $ \sum = Var(\bar{X}) = \frac1n \bar{X}^T \bar{X}$
Vector after linear transformation $ s_j = (s_{1j}, ... , s_{nj})^T = \bar{X}a_j$ $a_j \in ℝ^m $

(Unknown parameter w of n data converted by common a → s)
(J is the number of dimensions of the conversion destination 3D → 1 or 2D, etc., axis index)

Information loss can be suppressed by maximizing the variance of the converted data $ s_j $

Search for a mapping axis that maximizes the variance of variables after linear transformation

Variance after linear transformation $Var(s_j) = \frac1n s^T_j s_j \\ = \frac1n (\bar{X}a_j)^T (\bar{X}a_j) \\ = \frac1n a^T_j \bar{X}^T \bar{X} a_j \\ = a^T_j Var(\bar{X}) a_j $

Constraint optimization problem

Insert a constraint that the norm is 1 (without the constraint, there are infinite solutions)
Objective function $ arg max a ^ T_j Var (\ bar {X}) a_j \\ a \ in ℝ ^ m \\ (Distribution value of conversion destination) $
Constraints $ a^T_j a_j =1 $
Lagrange function $ E(a_j) = a^T_j Var(\bar{X}) a_j - \lambda(a^T_j a_j - 1)$ Lagrange function = objective function-Lagrange multiplier x constraint $ (find the point to maximize with respect to a) $

Differentiation $\\ \frac{\partial E(a_j)}{\partial a_j} = 2 Var(\bar{(X)}a_j - 2 \lambda a_j = 0)$
Solution
Eigenvalue = eigenvector form $ Var(\bar{X} a_j = \lambda a_j)$
The variance of the projection destination matches the eigenvalue $Var(s_1) = a_1^T Var(\bar{X}) a_1 = \lambda_1 a_1^T a_1 = \lambda_1 $

Contribution rate

How much information was lost as a result of dimensional compression
Sum of eigenvalues matches the variance of the data in the original space $ V_{total} = \sum^{m}_{i=1} \lambda_i $ Total variance of the original data = Total variance of the principal components
Contribution rate: The ratio of the variance of a certain axis (kth principal component) to the total variance
Cumulative contribution rate: Percentage of information loss when compressed to the 1-k main component $c_k = \frac{\lambda_k}{\sum_{i=1}^{m}\lambda_i} $ = (Variance of the kth principal component) / (Total variance of the principal component) $c_k = \frac { \sum_{j=1}^{k} \lambda_j}{\sum_{i=1}^{m}\lambda_i} \\ $ = (Dispersion of 1st to k principal components) / (Total dispersion of principal components)

K-nearest neighbor method

K-means

Set the initial value of the cluster center
For each data point, calculate the distance to each cluster center and assign the closest cluster
Calculate the mean vector (center) of each cluster
Repeat a few steps until it converges --If you change the initial value of the center, the result of clustering also changes. The result changes even if the value of K is changed

[PYTHON] Machine learning

** What is machine learning? **

** Linear regression **

** Data split **

** Non-linear regression model **

** Regularization **

** Logistic regression **