[PYTHON] Machine learning
** What is machine learning? **
Computer programs are said to measure task T with performance index P and, if its performance is improved by experience E, learn from experience E with respect to task T and performance index P (Tom Mitchell 1997).
・ Suppose you enter data into a computer program and solve task T.
・ Output Y1 is output when unknown data is input.
・ Output Y1 can be measured by performance index P
・ Input new data and output Y2
・ If Y2 is improved from Y1 when measured by the performance index P, it can be said that this computer program has learned.
** Linear regression **
 ** Regression problem **
 Explanatory variable (input)
$ x = (x_1, x_2, ・ ・ ・, x_m) ^ T \ in ℝ ^ m $
 Objective variable (output)
$ y \in ℝ^1$
 ** Linear regression model **
 One of the machine learning models for solving regression problems
 Supervised learning
 A model that outputs a linear combination of inputs and mdimensional parameters
 Add a hat to the predicted value (distinguish from y in the teacher data)
** Teacher data **
$$ {(x_i, y_i) ; i = 1, ... , n} $$
** Parameters ** (same number of dimensions as input variables)
$$ w = (w_1, w_2, ... , w_m)^T \in ℝ^m $$
** Linear combination ** (Inner product of unknown parameter w and input x)
$$\hat{y} = w^Tx + w_0 = \sum_{j = 1}^{m} w_jx_j + w_0 $$

** Linear combination **
Sum of the inner product of the unknown parameter w and the input x
Add the intercept w_0 (intercept = intersection with the yaxis, which has the effect of translating toward the yaxis)
Even if the input vector $ x_j $ is multidimensional, the output will be onedimensional (scalar).

** Model parameters **
・ Set of weights $ w_j $ (How features affect the predicted area)
・ Estimate the best gradient by the least squares method
 ~ For simple regression model (explanatory variable m = 1) ~ *
$ y = w_0 + w_1x_1 + \epsilon $
$ (objective variable = intercept + regression coefficient x explanatory variable + error) $
** Simultaneous equations **
$y_1 = w_0 + w_1x_1 + \epsilon_1 $
$y_1 = w_0 + w_1x_2 + \epsilon_2 $
$y_1 = w_0 + w_1x_n + \epsilon_n $
** Matrix representation **
$ y = Xw + \epsilon $
 ~ For multiple regression model (explanatory variable m> 1) ~ *
$ y = w_0 + w_1x_1 + w_2x_2 + \ epsilon $ etc.
** Data split **
 Mean squared error (residual sum of squares)
 Sum of the squared error of the data and the model
 Depends only on parameters
$ MSE_{train} = \frac{1}{n_{train}} \sum_{i=1}^{n_{train}}(\hat{y_i}^{(train)}  y_i^{(train)})^2 $
 Search for parameters that minimize the mean square error of the training data (gradient 0 = minimum)
 About MSE minimum point = $ \ hat {w} $ in w space
When $ \ frac {\ partial} {\ partial {w} MSE_ {train}} = 0 $ (the point where MSE becomes w = 0 by the differentiation with respect to w), ...
 ** Regression coefficient **
$ \hat{w} = (X^{(train)T}X^{(train)})^ {1}X^{(train)T}y^{(train)} $
The predicted value $ \ hat {y} is the inner product $ of the regression coefficient \ hat {w} and the explanatory variable x, so ...
 ** Predicted value **
$\hat{y} = X(X^{(train)T}X^{(train)})^ {1}X^{(train)T}y^{(train)}$
** Nonlinear regression model **

Basis set
$ y_i = f(x_i) + \epsilon_i $
Once x is nonlinearized by a linear map $ \ phi , the inner product with w is obtained.
$ y_i = w_0 + \sum_{i=1}^{m}
w_i\phi_j(x_i) + \epsilon_i $$
Examples of basis functions: polynomial functions, Gaussian basis functions, (B) spline functions, etc.

Nonlinear regression based on onedimensional basis functions
 Polynomial (1st to 9th order)
$\phi_j = x^j$
 Gaussian basis
$ \phi(x) = exp{\frac{(x  \mu_j)^2}{2h_j}}$
Since it is in the form of a normal distribution, it becomes 1 when integrated.
The average location of the normal distribution and how much a mountain spreads are defined by a parameter called bandwidth → Can be used for twodimensional basis functions
 Nonlinear regression based on twodimensional basis functions
 2D Gaussian basis set
$\phi_j(x) = exp{\frac{(x  \mu_j)^T(x  \mu_j)}{2h_j}}$
Basis functions can be placed in the data space
 Basis expansion method
 Explanatory variable
$ x_i = (x_ {i1}, x_ {i2}, ・ ・ ・, x_ {im}) \ in ℝ ^ m $
Convert the explanatory variables to basis function vectors (convert x with K maps $ \ phi $ prepared in advance), and ...
 Nonlinear function vector (Kdimensional feature vector)
$ \ phi (x_i) = (\ phi_1 (x_i), \ phi_2 (x_i), ・ ・ ・, \ phi_k (x_i)) ^ T \ in ℝ ^ k $
Since there are n i (learning data), n kdimensional parameter vectors will appear if the mapping is used as training data. This gives the design matrix
 Design matrix
$ \ Phi ^ {(train)} = (\ phi (x_1), \ phi (x_2), ・ ・ ・, \ phi (x_n)) ^ T \ in ℝ ^ {n * k} $
After that, find the predicted value using the maximum likelihood method.
 Predicted value
$\hat{y} = \Phi(\Phi^{(train)T}\Phi^{(train)})^ {1}\Phi^{(train)T}y^{(train)}$
 Overfitting and unlearning
 In case of underfitting ,,,
Low expressiveness of model $ \ rightarrow $ Use high expressive model
 In case of overfitting ,,,
Increase training data /
Delete unnecessary basis functions (variables) /
Regularization
** Regularization **

Regularization method (penalization method)
Give penalties to reduce model complexity
$ S\gamma = (y  \Phi w)^T(y  \Phi w) + \gamma R(w) $

Role of regularization term R
 No R ... Least squares estimator
 L2 norm ・ ・ ・ Ridge estimator (○ type, reduced estimation)
Constraint the parameter near the origin and find the point that minimizes MSE in the constraint
 L1 norm ・ ・ ・ Lasso estimator (◇ type, sparse estimation)
The point of contact that minimizes MSE is Kado, and some parameters ($ w_1 $) are estimated to be exactly 0. Unnecessary bases and variables can be eliminated at the estimation stage (when there are many explanatory variables, it is useful to set variables that have a small effect on prediction to 0 in one estimation).

Parameter $ \ gamma $ = Constraint surface size
Make $ \ gamma $ smaller → The constraint surface becomes larger
Increase $ \ gamma $ → decrease the constraint surface

Choosing the right model
 Holdout method
If there is not a lot of data, it will not give a good performance evaluation
 Crossvalidation
Divided for learning and evaluation for each iterator
Even if the average accuracy = CV value is lower than the model verified by holdout, the generalization performance is not necessarily high.
** Logistic regression **
 Classification problem
** Explanatory variable (input) **
$ x = (x_1, x_2, ・ ・ ・, x_m) ^ T \ in ℝ ^ m $
** Objective variable (output) **
$ y \in \{0, 1\}$
 Logistic regression model
 One of the machine learning models for solving classification problems
 Supervised learning
 Output linear combination of input and mdimensional parameters to sigmoid function
 The output will be the value of the probability that y = 1
** Teacher data **
$ {(x_i, y_i) ; i = 1, ... , n} $
** Parameters ** (same number of dimensions as input variables)
$ w = (w_1, w_2, ... , w_m)^T \in ℝ^m $
** Linear combination ** (Inner product of unknown parameter w and input x)
$\hat{y} = w^Tx + w_0 = \sum_{j = 1}^{m} w_jx_j + w_0 $
 Sigmoid function
 Input is real
 Output is 01
 As parameter a increases, the slope of the curve near x = 0 increases.
 Changes in bias change the position of the step
$ \sigma(x) = \frac{1}{1 + exp(ax)}$
 Properties of the sigmoid function
 Easy differentiation
 In model parameter estimation, it is necessary to find the points that minimize and maximize criteria such as MSE and likelihood → It is advantageous to be able to easily differentiate.
$ \ frac {\ partial \ sigma (x)} {\ partial x} = \ frac {\ partial} {\ partial x} \ biggl (\ frac {1} {1 + exp (ax)} \ biggr) \\ = (1) ・ \ {1 + exp (ax) \} ^ {2} ・ exp (ax) ・ (a) \\ = \ frac {a ・ exp (ax)} { \ {1 + exp (ax) \} ^ 2} = \ frac {a} {1 + exp (ax)} ・ \ frac {1 + exp (ax) 1} {1 + exp (ax) )} \\ = a \ sigma (x) (1\ sigma (x)) $
 Logistic regression formulation
$ P(Y = 1x) = \sigma(w_0 + w_1x_1 + ... + w_mx_m) $
$ (Probability that Y = 1 when the explanatory variable x is given) (Linear combination to data parameters) $
 Bernoulli distribution
 One of various distribution models such as normal distribution
 Discrete probability distribution (probability p → 1, probability 1p → 0)
Probability of $ y = y_1 $ in one trial
$P(y)=p^y(1p)^{1y}$
 Maximum likelihood estimation (what is the most likely P?)
 Simultaneous probability
Each random variable is independent → Multiplication of probabilities
** Probability that $ y_1 $ ~ $ y_n $ ** will occur at the same time in n trials
$ P (y_1, y_2, ..., y_n; p) = \ prod_ {i = 1} {n} p ^ {y_i} (1p) ^ {1y_i} $ ($ P, y_i) $ Is known)
 Likelihood = Probability of fetching a specific sample $ x_k $ from the target population $ \ mu $
 Logarithm of probability = Likelihood function
 Choosing a parameter that maximizes the likelihood function is called likelihood estimation.
Likelihood function when data of $ y_1 $ ~ $ y_n $ is obtained
$ P (y_1, y_2, ..., y_n; p) = \ prod_ {i = 1} {n} p ^ {y_i} (1p) ^ {1y_i} $ ($ P is unknown) , Y_i $ is known)
 Loglikelihood function
 Addition and subtraction are easier to calculate than repeating multiplication of complicated values
→ Take the logarithm and convert the multiplication to the addition
 Likelihood function is maximized with the same coefficient w as before logarithmization (logarithmic function increases monotonically)
→ This optimal w is called the maximum likelihood estimator (local or global optimal solution)
 Minimize the likelihood function multiplied by minus and combine it with the least squares minimization
→ Gradient descent method
$E(w_0, w_1, ... , w_m) = logL(w_0, w_1, ... , w_m) = \sum_{i=1}^{n}\{y_i log p_i + (1  y_i) log(1  p_i)\}$
 Cross entropy (negative loglikelihood function)
 Update the learning function with the wrong probability
 Gradient descent
 Approach to sequentially update parameters (by computer) by iterative learning
 $ \ eta $ is a hyperparameter called learning rate that adjusts the ease of convergence of model parameters.
 Derivatives for MSE parameters can be obtained analytically
$w^{(k+1) = w^{(k)} + \eta \sum_{i=1}^{n}(y_i  p_i)x_i}$
In the gradient descent method, it is necessary to find the sum of all N data for parameter update.
→ There are problems such as insufficient memory and huge calculation time.
 Stochastic Gradient Descent (SGD)
 Randomly (probabilistically) select data one by one and update parameters
 Since the parameter can be updated n times with the same amount of calculation as updating the parameter once by the purchase descent method, the optimum solution can be searched efficiently.
 Parameter updates are irregular to some extent (compared to gradient descent), and it is difficult to fall into a locally optimal solution close to the initial value (logistic regression has only one mountain, so it does not matter much).
$w(k+1) = w^k + \eta (y_i  p_i)x_i$
 Model evaluation
 Confusion matrix
 Validation data results: Model prediction results
 TP (true positive) = ○: ○, correct positive
 FP (false negative) = ×: ○, wrongly positive (possible to make a normal person abnormal)
 FN (false positive) = ○: ×, wrongly negative (normally by mistake for an abnormal person)
 TN (true negative) = ×: ×, correctly negative
 Correct answer rate
$\frac {TP+TN}{TP+FN+FP+TN}$
 Recall
 Percentage of "really positive things" that can be predicted as positive
 Example: Disease screening (oversight is NG)
$\frac{TP}{TP+FN}$
 Percentage of what the model predicts to be Positive is "really Positive"
 Spam email (may be overlooked)
$\frac{TP}{TP+FP}$
 Harmonic mean of Recall and Precision
Principal component analysis (PCA)
 Dimensional compression
 Compress the structure of multivariate data into a smaller number of indicators
 I want to reduce information loss as much as possible
 Analysis and visualization are possible with a small number of variables (about 2 or 3 dimensions)
 Learning data
$ x = (x_1, x_2, ・ ・ ・, x_m) ^ T \ in ℝ ^ m $
 Average (vector)
$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $
 Data matrix
$ \ bar {X} = (x_1\ bar x, ..., x_n\ bar x) ^ T \\ (I'll centralize it so that it's distributed around the origin) $
 Covariance matrix
$ \sum = Var(\bar{X}) = \frac1n \bar{X}^T \bar{X}$
 Vector after linear transformation
$ s_j = (s_{1j}, ... , s_{nj})^T = \bar{X}a_j$
$a_j \in ℝ^m $
 (Unknown parameter w of n data converted by common a → s)
 (J is the number of dimensions of the conversion destination 3D → 1 or 2D, etc., axis index)
 Information loss can be suppressed by maximizing the variance of the converted data $ s_j $
 Search for a mapping axis that maximizes the variance of variables after linear transformation
 Variance after linear transformation
$Var(s_j) = \frac1n s^T_j s_j \\ = \frac1n (\bar{X}a_j)^T (\bar{X}a_j) \\ = \frac1n a^T_j \bar{X}^T \bar{X} a_j \\ = a^T_j Var(\bar{X}) a_j $
 Constraint optimization problem
 Insert a constraint that the norm is 1 (without the constraint, there are infinite solutions)
 Objective function
$ arg max a ^ T_j Var (\ bar {X}) a_j \\ a \ in ℝ ^ m \\ (Distribution value of conversion destination) $
 Constraints
$ a^T_j a_j =1 $
 Lagrange function
$ E(a_j) = a^T_j Var(\bar{X}) a_j  \lambda(a^T_j a_j  1)$
Lagrange function = objective functionLagrange multiplier x constraint
$ (find the point to maximize with respect to a) $
 Differentiation
$\\ \frac{\partial E(a_j)}{\partial a_j} = 2 Var(\bar{(X)}a_j  2 \lambda a_j = 0)$
 Solution
 Eigenvalue = eigenvector form
$ Var(\bar{X} a_j = \lambda a_j)$
 The variance of the projection destination matches the eigenvalue
$Var(s_1) = a_1^T Var(\bar{X}) a_1 = \lambda_1 a_1^T a_1 = \lambda_1 $
 Contribution rate
 How much information was lost as a result of dimensional compression
 Sum of eigenvalues matches the variance of the data in the original space
$ V_{total} = \sum^{m}_{i=1} \lambda_i $
Total variance of the original data = Total variance of the principal components
 Contribution rate: The ratio of the variance of a certain axis (kth principal component) to the total variance
 Cumulative contribution rate: Percentage of information loss when compressed to the 1k main component
$c_k = \frac{\lambda_k}{\sum_{i=1}^{m}\lambda_i} $
= (Variance of the kth principal component) / (Total variance of the principal component)
$c_k = \frac { \sum_{j=1}^{k} \lambda_j}{\sum_{i=1}^{m}\lambda_i} \\ $
= (Dispersion of 1st to k principal components) / (Total dispersion of principal components)
Knearest neighbor method
 Set the initial value of the cluster center
 For each data point, calculate the distance to each cluster center and assign the closest cluster
 Calculate the mean vector (center) of each cluster
 Repeat a few steps until it converges
If you change the initial value of the center, the result of clustering also changes. The result changes even if the value of K is changed