[PYTHON] Mathematics for ML

Linear Models Let y hat be the predicted value, vector w (= w1, w2 ..., wp) be the coefficient (coef), and w0 be the intercept (intercept).

\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p

Ordinary Least Squares Find the following coefficients that minimize the residual sum of squares. The L2 norm means an ordinary Euclidean distance.

\min_{w} || X w - y||_2^2

sklearn

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

Implementation

Ridge Regression As a loss function, the regularization term of the square of the L2 norm is added. The absolute value of the coefficient is suppressed, which prevents overfitting.

\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2

sklearn

class sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)

Lasso Regression As a loss function, the regularization term of L1 norm (Manhattan distance) is added. It may be possible to reduce the dimension of the feature quantity by setting a part of the coefficient to 0.

\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}

Multi-task Lasso

Elastic-Net Add a regularization term for the sum of the L1 norm and the L2 norm. It results in Ridge regression when ρ = 0 and Lasso regression when ρ = 1.

\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 +
\frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}

Multi-task Elastic-Net

Least Angle Regression (LARS)

Orthogonal Matching Pursuit (OMP) There is a stop condition.

\underset{w}{\operatorname{arg\,min\,}}  ||y - Xw||_2^2 \text{ subject to } ||w||_0 \leq n_{\text{nonzero\_coefs}}

Bayesian Regression

p(y|X,w,\alpha) = \mathcal{N}(y|X w,\alpha)

Logistic Regression Classification while saying Regression. A statistical regression model of variables that follow the Bernoulli distribution. Use logit as a concatenation function.

K-nearest Neighbors / k-nearest neighbor method

application

Predict user hobbies such as movies, music, search results, and shopping. There are Collaborative Filtering that makes predictions based on similar user preferences, and Content-based Filtering that makes predictions based on what users have liked in the past.

Q-Learning / Q learning

A learning algorithm for the state action value Q (s, a) when s is the state, a is the action, and r is the reward. In the following equation, α means the learning rate and γ means the discount rate. Q (st, at) is updated one after another according to α as follows. The maximum Q value of the update destination state st + 1 is adopted according to γ.

Q(s_t, a_t) \leftarrow (1-\alpha)Q(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}))\\

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

Sarsa

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha(r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

Monte Carlo method

Returns(s, a) \leftarrow append(Returns(s, a), r)\\
Q(s, a) \leftarrow average(Returns(s, a))