[PYTHON] Machine learning
** What is machine learning? **
Computer programs are said to measure task T with performance index P and, if its performance is improved by experience E, learn from experience E with respect to task T and performance index P (Tom Mitchell 1997).
・ Suppose you enter data into a computer program and solve task T.
・ Output Y1 is output when unknown data is input.
・ Output Y1 can be measured by performance index P
・ Input new data and output Y2
・ If Y2 is improved from Y1 when measured by the performance index P, it can be said that this computer program has learned.
** Linear regression **
- ** Regression problem **
- Explanatory variable (input)
$ x = (x_1, x_2, ・ ・ ・, x_m) ^ T \ in ℝ ^ m $
- Objective variable (output)
$ y \in ℝ^1$
- ** Linear regression model **
- One of the machine learning models for solving regression problems
- Supervised learning
- A model that outputs a linear combination of inputs and m-dimensional parameters
- Add a hat to the predicted value (distinguish from y in the teacher data)
** Teacher data **
$$ {(x_i, y_i) ; i = 1, ... , n} $$
** Parameters ** (same number of dimensions as input variables)
$$ w = (w_1, w_2, ... , w_m)^T \in ℝ^m $$
** Linear combination ** (Inner product of unknown parameter w and input x)
$$\hat{y} = w^Tx + w_0 = \sum_{j = 1}^{m} w_jx_j + w_0 $$
-
** Linear combination **
-Sum of the inner product of the unknown parameter w and the input x
-Add the intercept w_0 (intercept = intersection with the y-axis, which has the effect of translating toward the y-axis)
-Even if the input vector $ x_j $ is multidimensional, the output will be one-dimensional (scalar).
-
** Model parameters **
・ Set of weights $ w_j $ (How features affect the predicted area)
・ Estimate the best gradient by the least squares method
- ~ For simple regression model (explanatory variable m = 1) ~ *
$ y = w_0 + w_1x_1 + \epsilon $
$ (objective variable = intercept + regression coefficient x explanatory variable + error) $
** Simultaneous equations **
$y_1 = w_0 + w_1x_1 + \epsilon_1 $
$y_1 = w_0 + w_1x_2 + \epsilon_2 $
$y_1 = w_0 + w_1x_n + \epsilon_n $
** Matrix representation **
$ y = Xw + \epsilon $
- ~ For multiple regression model (explanatory variable m> 1) ~ *
$ y = w_0 + w_1x_1 + w_2x_2 + \ epsilon $ etc.
** Data split **
- Mean squared error (residual sum of squares)
- Sum of the squared error of the data and the model
- Depends only on parameters
$ MSE_{train} = \frac{1}{n_{train}} \sum_{i=1}^{n_{train}}(\hat{y_i}^{(train)} - y_i^{(train)})^2 $
- Search for parameters that minimize the mean square error of the training data (gradient 0 = minimum)
- About MSE minimum point = $ \ hat {w} $ in w space
When $ \ frac {\ partial} {\ partial {w} MSE_ {train}} = 0 $ (the point where MSE becomes w = 0 by the differentiation with respect to w), ...
- ** Regression coefficient **
$ \hat{w} = (X^{(train)T}X^{(train)})^ {-1}X^{(train)T}y^{(train)} $
The predicted value $ \ hat {y} is the inner product $ of the regression coefficient \ hat {w} and the explanatory variable x, so ...
- ** Predicted value **
$\hat{y} = X(X^{(train)T}X^{(train)})^ {-1}X^{(train)T}y^{(train)}$
** Non-linear regression model **
-
Basis set
$ y_i = f(x_i) + \epsilon_i $
Once x is non-linearized by a linear map $ \ phi , the inner product with w is obtained.
$ y_i = w_0 + \sum_{i=1}^{m}
w_i\phi_j(x_i) + \epsilon_i $$
Examples of basis functions: polynomial functions, Gaussian basis functions, (B) spline functions, etc.
-
Non-linear regression based on one-dimensional basis functions
- Polynomial (1st to 9th order)
$\phi_j = x^j$
- Gaussian basis
$ \phi(x) = exp{\frac{(x - \mu_j)^2}{2h_j}}$
--Since it is in the form of a normal distribution, it becomes 1 when integrated.
--The average location of the normal distribution and how much a mountain spreads are defined by a parameter called bandwidth → Can be used for two-dimensional basis functions
- Non-linear regression based on two-dimensional basis functions
- 2D Gaussian basis set
$\phi_j(x) = exp{\frac{(x - \mu_j)^T(x - \mu_j)}{2h_j}}$
--Basis functions can be placed in the data space
- Basis expansion method
- Explanatory variable
$ x_i = (x_ {i1}, x_ {i2}, ・ ・ ・, x_ {im}) \ in ℝ ^ m $
Convert the explanatory variables to basis function vectors (convert x with K maps $ \ phi $ prepared in advance), and ...
- Non-linear function vector (K-dimensional feature vector)
$ \ phi (x_i) = (\ phi_1 (x_i), \ phi_2 (x_i), ・ ・ ・, \ phi_k (x_i)) ^ T \ in ℝ ^ k $
Since there are n i (learning data), n k-dimensional parameter vectors will appear if the mapping is used as training data. This gives the design matrix
- Design matrix
$ \ Phi ^ {(train)} = (\ phi (x_1), \ phi (x_2), ・ ・ ・, \ phi (x_n)) ^ T \ in ℝ ^ {n * k} $
After that, find the predicted value using the maximum likelihood method.
- Predicted value
$\hat{y} = \Phi(\Phi^{(train)T}\Phi^{(train)})^ {-1}\Phi^{(train)T}y^{(train)}$
- Overfitting and unlearning
- In case of underfitting ,,,
Low expressiveness of model $ \ rightarrow $ Use high expressive model
- In case of overfitting ,,,
Increase training data /
Delete unnecessary basis functions (variables) /
Regularization
** Regularization **
-
Regularization method (penalization method)
Give penalties to reduce model complexity
$ S\gamma = (y - \Phi w)^T(y - \Phi w) + \gamma R(w) $
-
Role of regularization term R
- No R ... Least squares estimator
- L2 norm ・ ・ ・ Ridge estimator (○ type, reduced estimation)
Constraint the parameter near the origin and find the point that minimizes MSE in the constraint
- L1 norm ・ ・ ・ Lasso estimator (◇ type, sparse estimation)
The point of contact that minimizes MSE is Kado, and some parameters ($ w_1 $) are estimated to be exactly 0. Unnecessary bases and variables can be eliminated at the estimation stage (when there are many explanatory variables, it is useful to set variables that have a small effect on prediction to 0 in one estimation).
-
Parameter $ \ gamma $ = Constraint surface size
Make $ \ gamma $ smaller → The constraint surface becomes larger
Increase $ \ gamma $ → decrease the constraint surface
-
Choosing the right model
- Holdout method
If there is not a lot of data, it will not give a good performance evaluation
- Cross-validation
Divided for learning and evaluation for each iterator
Even if the average accuracy = CV value is lower than the model verified by holdout, the generalization performance is not necessarily high.
** Logistic regression **
- Classification problem
** Explanatory variable (input) **
$ x = (x_1, x_2, ・ ・ ・, x_m) ^ T \ in ℝ ^ m $
** Objective variable (output) **
$ y \in \{0, 1\}$
- Logistic regression model
- One of the machine learning models for solving classification problems
- Supervised learning
- Output linear combination of input and m-dimensional parameters to sigmoid function
- The output will be the value of the probability that y = 1
** Teacher data **
$ {(x_i, y_i) ; i = 1, ... , n} $
** Parameters ** (same number of dimensions as input variables)
$ w = (w_1, w_2, ... , w_m)^T \in ℝ^m $
** Linear combination ** (Inner product of unknown parameter w and input x)
$\hat{y} = w^Tx + w_0 = \sum_{j = 1}^{m} w_jx_j + w_0 $
- Sigmoid function
- Input is real
- Output is 0-1
- As parameter a increases, the slope of the curve near x = 0 increases.
- Changes in bias change the position of the step
$ \sigma(x) = \frac{1}{1 + exp(-ax)}$
- Properties of the sigmoid function
- Easy differentiation
- In model parameter estimation, it is necessary to find the points that minimize and maximize criteria such as MSE and likelihood → It is advantageous to be able to easily differentiate.
$ \ frac {\ partial \ sigma (x)} {\ partial x} = \ frac {\ partial} {\ partial x} \ biggl (\ frac {1} {1 + exp (-ax)} \ biggr) \\ = (-1) ・ \ {1 + exp (-ax) \} ^ {-2} ・ exp (-ax) ・ (-a) \\ = \ frac {a ・ exp (-ax)} { \ {1 + exp (-ax) \} ^ 2} = \ frac {a} {1 + exp (-ax)} ・ \ frac {1 + exp (-ax) --1} {1 + exp (-ax) )} \\ = a \ sigma (x) (1-\ sigma (x)) $
- Logistic regression formulation
$ P(Y = 1|x) = \sigma(w_0 + w_1x_1 + ... + w_mx_m) $
$ (Probability that Y = 1 when the explanatory variable x is given) (Linear combination to data parameters) $
- Bernoulli distribution
- One of various distribution models such as normal distribution
- Discrete probability distribution (probability p → 1, probability 1-p → 0)
Probability of $ y = y_1 $ in one trial
$P(y)=p^y(1-p)^{1-y}$
- Maximum likelihood estimation (what is the most likely P?)
- Simultaneous probability
Each random variable is independent → Multiplication of probabilities
** Probability that $ y_1 $ ~ $ y_n $ ** will occur at the same time in n trials
$ P (y_1, y_2, ..., y_n; p) = \ prod_ {i = 1} {n} p ^ {y_i} (1-p) ^ {1-y_i} $ ($ P, y_i) $ Is known)
- Likelihood = Probability of fetching a specific sample $ x_k $ from the target population $ \ mu $
- Logarithm of probability = Likelihood function
- Choosing a parameter that maximizes the likelihood function is called likelihood estimation.
Likelihood function when data of $ y_1 $ ~ $ y_n $ is obtained
$ P (y_1, y_2, ..., y_n; p) = \ prod_ {i = 1} {n} p ^ {y_i} (1-p) ^ {1-y_i} $ ($ P is unknown) , Y_i $ is known)
- Log-likelihood function
- Addition and subtraction are easier to calculate than repeating multiplication of complicated values
→ Take the logarithm and convert the multiplication to the addition
- Likelihood function is maximized with the same coefficient w as before logarithmization (logarithmic function increases monotonically)
→ This optimal w is called the maximum likelihood estimator (local or global optimal solution)
- Minimize the likelihood function multiplied by minus and combine it with the least squares minimization
→ Gradient descent method
$E(w_0, w_1, ... , w_m) = -logL(w_0, w_1, ... , w_m) = \sum_{i=1}^{n}\{y_i log p_i + (1 - y_i) log(1 - p_i)\}$
- Cross entropy (negative log-likelihood function)
- Update the learning function with the wrong probability
- Gradient descent
- Approach to sequentially update parameters (by computer) by iterative learning
- $ \ eta $ is a hyperparameter called learning rate that adjusts the ease of convergence of model parameters.
- Derivatives for MSE parameters can be obtained analytically
$w^{(k+1) = w^{(k)} + \eta \sum_{i=1}^{n}(y_i - p_i)x_i}$
In the gradient descent method, it is necessary to find the sum of all N data for parameter update.
→ There are problems such as insufficient memory and huge calculation time.
- Stochastic Gradient Descent (SGD)
- Randomly (probabilistically) select data one by one and update parameters
- Since the parameter can be updated n times with the same amount of calculation as updating the parameter once by the purchase descent method, the optimum solution can be searched efficiently.
- Parameter updates are irregular to some extent (compared to gradient descent), and it is difficult to fall into a locally optimal solution close to the initial value (logistic regression has only one mountain, so it does not matter much).
$w(k+1) = w^k + \eta (y_i - p_i)x_i$
- Model evaluation
- Confusion matrix
- Validation data results: Model prediction results
- TP (true positive) = ○: ○, correct positive
- FP (false negative) = ×: ○, wrongly positive (possible to make a normal person abnormal)
- FN (false positive) = ○: ×, wrongly negative (normally by mistake for an abnormal person)
- TN (true negative) = ×: ×, correctly negative
- Correct answer rate
$\frac {TP+TN}{TP+FN+FP+TN}$
- Recall
- Percentage of "really positive things" that can be predicted as positive
- Example: Disease screening (oversight is NG)
$\frac{TP}{TP+FN}$
- Percentage of what the model predicts to be Positive is "really Positive"
- Spam email (may be overlooked)
$\frac{TP}{TP+FP}$
- Harmonic mean of Recall and Precision
Principal component analysis (PCA)
- Dimensional compression
- Compress the structure of multivariate data into a smaller number of indicators
- I want to reduce information loss as much as possible
- Analysis and visualization are possible with a small number of variables (about 2 or 3 dimensions)
- Learning data
$ x = (x_1, x_2, ・ ・ ・, x_m) ^ T \ in ℝ ^ m $
- Average (vector)
$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $
- Data matrix
$ \ bar {X} = (x_1-\ bar x, ..., x_n-\ bar x) ^ T \\ (I'll centralize it so that it's distributed around the origin) $
- Covariance matrix
$ \sum = Var(\bar{X}) = \frac1n \bar{X}^T \bar{X}$
- Vector after linear transformation
$ s_j = (s_{1j}, ... , s_{nj})^T = \bar{X}a_j$
$a_j \in ℝ^m $
- (Unknown parameter w of n data converted by common a → s)
- (J is the number of dimensions of the conversion destination 3D → 1 or 2D, etc., axis index)
- Information loss can be suppressed by maximizing the variance of the converted data $ s_j $
- Search for a mapping axis that maximizes the variance of variables after linear transformation
- Variance after linear transformation
$Var(s_j) = \frac1n s^T_j s_j \\ = \frac1n (\bar{X}a_j)^T (\bar{X}a_j) \\ = \frac1n a^T_j \bar{X}^T \bar{X} a_j \\ = a^T_j Var(\bar{X}) a_j $
- Constraint optimization problem
- Insert a constraint that the norm is 1 (without the constraint, there are infinite solutions)
- Objective function
$ arg max a ^ T_j Var (\ bar {X}) a_j \\ a \ in ℝ ^ m \\ (Distribution value of conversion destination) $
- Constraints
$ a^T_j a_j =1 $
- Lagrange function
$ E(a_j) = a^T_j Var(\bar{X}) a_j - \lambda(a^T_j a_j - 1)$
Lagrange function = objective function-Lagrange multiplier x constraint
$ (find the point to maximize with respect to a) $
- Differentiation
$\\ \frac{\partial E(a_j)}{\partial a_j} = 2 Var(\bar{(X)}a_j - 2 \lambda a_j = 0)$
- Solution
- Eigenvalue = eigenvector form
$ Var(\bar{X} a_j = \lambda a_j)$
- The variance of the projection destination matches the eigenvalue
$Var(s_1) = a_1^T Var(\bar{X}) a_1 = \lambda_1 a_1^T a_1 = \lambda_1 $
- Contribution rate
- How much information was lost as a result of dimensional compression
- Sum of eigenvalues matches the variance of the data in the original space
$ V_{total} = \sum^{m}_{i=1} \lambda_i $
Total variance of the original data = Total variance of the principal components
- Contribution rate: The ratio of the variance of a certain axis (kth principal component) to the total variance
- Cumulative contribution rate: Percentage of information loss when compressed to the 1-k main component
$c_k = \frac{\lambda_k}{\sum_{i=1}^{m}\lambda_i} $
= (Variance of the kth principal component) / (Total variance of the principal component)
$c_k = \frac { \sum_{j=1}^{k} \lambda_j}{\sum_{i=1}^{m}\lambda_i} \\ $
= (Dispersion of 1st to k principal components) / (Total dispersion of principal components)
K-nearest neighbor method
- Set the initial value of the cluster center
- For each data point, calculate the distance to each cluster center and assign the closest cluster
- Calculate the mean vector (center) of each cluster
- Repeat a few steps until it converges
--If you change the initial value of the center, the result of clustering also changes. The result changes even if the value of K is changed