[PYTHON] What is Reduced Rank Ridge Regression?

Reference link

simply! : bow_tone1:

"** Reduced rank regression " and " Ridge regression **" are combined to match ** explanatory variable co-linearity ** and ** objective variable low-dimensional structure ** at the same time. Linear model regression method.

What is Reduced Rank Regression?

Generalized linear model

As a problem setting, consider predicting (regressing) the objective variable $ Y_ {N \ times Q} $ by the explanatory variable $ X_ {N \ times P} $.

Often used here is the ** generalized linear model ** expressed by the following equation. $ Y = XB + E $ Where $ B_ {P \ times Q} $ is the coefficient matrix and $ E_ {N \ times Q} $ is the random error matrix.

The solution by this least squares method (OLS) is as follows. $ \ hat {B} _ {OLS} = (X ^ TX) ^ {-1} X ^ TY $

This time, we will focus on ** two problems ** when solving a generalized linear model by the least squares method.

--Performance drops when the true dimension of the objective variable $ Y_ {N \ times Q} $ is less than $ Q $. (** $ Y $ rank drop ) --Performance drops when the rows of the explanatory variable $ X_ {N \ times P} $ are highly correlated. ( $ X $ collinearity **)

Dimensionality reduction

Thus, dimensionality reduction is useful when there is a potential low-dimensional structure in the high-dimensional data. One of the most famous is ** Principal Component Analysis (PCA) **. A model that undergoes regression after dimensionality reduction by principal component analysis is called ** principal component regression (PCR) **.

Reference: Principal component analysis that makes sense, [Understanding principal component analysis with Python](https://qiita.com/maskot1977/items/ 082557fcda78c4cdb41f)

A model that regresses the objective variable by selecting a low-dimensional factor from the explanatory variables, such as principal component regression, is called a ** Linear Factor Model **. For example, independent component regression, partial least squares regression, canonical correlation analysis.

Another dimensionality reduction technique is ** Reduced Rank Regression **. Reduced rank regression allows regression while assuming potential low-dimensional structures by minimizing the error function while limiting the rank of the coefficient matrix $ B $.

Reference: Reduced rank regression

Regularization

For the second problem, the collinearity ** of the explanatory variable ** $ X $ **, ** regularization ** is often done. Well-known methods include ** Ridge Regression ** and ** LASSO Regression **. The LASSO regression is mainly used for estimating sparsity and is also used as a feature selection method. Ridge regression is widely used to address the well-posed problem due to the collinearity of explanatory variables.

Extension to Reduced Rank Ridge Regression

Introduce two constraints into the error function

Based on the above, in order to deal with the two problems, we will add the following two constraints to the squared error.

  1. Regularization term of ridge regression
  2. Rank constraint of $ \ hat {B} $

Therefore, you can get $ \ hat {B} (\ lambda, r) $ by minimizing the error as follows.

\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert Y-XB \Vert_F^2 + \lambda \Vert B \Vert_F^2

However, $ r \ leq \ min \ lbrace P, Q \ rbrace $, and $ \ Vert \ cdot \ Vert_F ^ 2 $ is the Frobenius norm (matrix norm).

To think about this in the framework of reduced rank regression,

X_{(N+P)\times P}^* = 
\left(  \begin{array}{c}    X \\    \sqrt{\lambda}I  \end{array}\right), \ 
Y_{(N+P)\times Q}^* = 
\left(  \begin{array}{c}    Y \\    0  \end{array}\right)

By doing so, the error function is expressed as follows.

\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert Y^*-X^*B \Vert_F^2

Also, using the estimated value of ridge regression $ \ hat {Y_R} ^ * = X ^ * \ hat {B_R} ^ * $, from the orthonormality,

\Vert Y^*-X^*B \Vert_F^2 = \Vert Y^*-\hat{Y}_R^* \Vert_F^2 + \Vert \hat{Y}_R^*-X^*B \Vert_F^2

Since the first term does not depend on $ B $, the error function can be expressed as:

\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert \hat{Y}_R^*-X^*B \Vert_F^2

Derivation of a solution by a linear problem

Now suppose that the singular value decomposition is given as follows:

\hat{Y}_R^* = \sum_{i=1}^{\tau} \sigma_i u_i v_i^T

Where $ \ sigma_i $ is the singular value, $ u_i, v_i $ are the left and right singular value vectors, and $ \ tau $ is the rank of $ \ hat {Y} _R ^ * $ (by regularization, $ Q $ and It becomes).

This is Eckert Young's theorm in Frobenius Norm ([Reference](https://ja.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88] Think of it as% 86% E6% 9E% 90)), and the best approximation by rank $ r $ is:

\hat{Y}_r^* = \sum_{i=1}^{r} \sigma_i u_i v_i^T

Let's use this to represent the optimal coefficient matrix $ \ hat {B} (\ lambda, r) $.

\hat{Y}_r^* = \sum_{i=1}^{r} \sigma_i u_i v_i^T = \left(\sum_{i=1}^{\tau} \sigma_i u_i v_i^T\right) \left(\sum_{j=1}^{r} u_j v_j^T\right) = \hat{Y}_r^* P_r = X^* \hat{B}_R^* P_r = X^* \hat{B}(\lambda,r)

Then you get $ \ hat {B} (\ lambda, r) = \ hat {B} _R ^ * P_r $ as the solution.

Furthermore, it can be expressed as follows using the least squares solution of ridge regression.

\hat{B}(\lambda,r) = \hat{B}_R^* P_r = (X^T X + \lambda I)^{-1}X^T Y P_r \\
\hat{Y}(\lambda,r) = X \hat{B}(\lambda,r) = X (X^T X + \lambda I)^{-1}X^T Y P_r = \hat{Y}_{\lambda} P_r

Here, $ \ hat {Y} _ {\ lambda} $ is the estimated solution of the ridge regression in $ \ lambda $, and "** The optimal solution by projecting the estimated solution of the ridge regression into the space of $ r $ dimension. Can be interpreted as ** ". Therefore, for $ r = Q $, a simple ridge regression solution is the optimal solution.

Parameter tuning

Tuning the parameters $ \ lambda, r $ will determine the performance of the reduced rank ridge regression. In the reference paper, K-fold cross-validation was performed to determine the optimal parameters.

Try the package

Execute Demo in RRRR Package of the reference link. Saw.

data

Let's use the demo data provided.

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import PredefinedSplit
import reduced_rank_regressor as RRR

import matplotlib.pyplot as plt
%matplotlib inline

N_PARAMETERS_GRID_SEARCH = 20 
Data_path = "data/"

 Load Data
trainX = np.loadtxt(Data_path+"trainX.txt")
testX = np.loadtxt(Data_path+"testX.txt")
validX = np.loadtxt(Data_path+"validX.txt")

trainY = np.loadtxt(Data_path+"trainY.txt")
testY = np.loadtxt(Data_path+"testY.txt")
validY = np.loadtxt(Data_path+"validY.txt")

 Inspection of Data
f,ax = plt.subplots(1,2,figsize=(10,10))
ax[0].imshow(np.corrcoef(trainX), cmap='jet')
ax[0].set_title('trainX')
ax[1].imshow(np.corrcoef(trainY), cmap='jet')
ax[1].set_title('trainY')
plt.show()

From the correlation matrix, we can see that the objective variable $ Y $ has a low-dimensional structure.

fig1.png

Cross-validation

We perform cross-validation of hyperparameters for constraint rank and regularization strength, and perform optimization.

 Cross-validation setup. Define search spaces
rank_grid = np.linspace(1,min(min(trainX.shape),min(trainY.shape)), num=N_PARAMETERS_GRID_SEARCH)
rank_grid = rank_grid.astype(int)
reg_grid = np.power(10,np.linspace(-20,20, num=N_PARAMETERS_GRID_SEARCH+1))
parameters_grid_search  = {'reg':reg_grid, 'rank':rank_grid}

valid_test_fold = np.concatenate((np.zeros((trainX.shape[0],))-1,np.zeros((validX.shape[0],))))
ps_for_valid = PredefinedSplit(test_fold=valid_test_fold)

 Model initialisation
rrr = RRR.ReducedRankRegressor()#rank, reg)
grid_search = GridSearchCV(rrr, parameters_grid_search, cv=ps_for_valid,
                           scoring='neg_mean_squared_error')

print("fitting...")
grid_search.fit(np.concatenate((trainX,validX)), np.concatenate((trainY,validY)))

 Display the best combination of values found
print(grid_search.best_params_)
means = grid_search.cv_results_['mean_test_score']
means = np.array(means).reshape(N_PARAMETERS_GRID_SEARCH, N_PARAMETERS_GRID_SEARCH+1)
print(grid_search.best_score_)

[output]

fitting...
{'rank': 316, 'reg': 1e-20}
-75126.47521541138

Let's visualize the result of cross-validation. You can see that the score is higher in the optimum rank part (around 316).

 Show CV results
scores = [x for x in grid_search.cv_results_['rank_test_score']]
scores = np.array(scores).reshape(N_PARAMETERS_GRID_SEARCH, N_PARAMETERS_GRID_SEARCH+1)

f,ax = plt.subplots(1,1,figsize=(10,10))
cbar = ax.imshow(scores, cmap='jet')
ax.set_title('Test score Ranking')
ax.set_xlabel('Regression')
ax.set_ylabel('Rank')
ax.set_xticks(list(range(len(reg_grid))))
ax.set_yticks(list(range(len(rank_grid))))
ax.set_xticklabels([str("%0.*e"%(0,x)) for x in reg_grid],rotation=30)
ax.set_yticklabels([str(x) for x in rank_grid])
f.colorbar(cbar)
plt.show()

fig2.png

inference

Inference is made based on the optimal hyperparameters obtained by cross-validation.

 Train a model with the best set of hyper-parameters found
rrr.rank                = int(grid_search.best_params_['rank'])
rrr.reg                 = grid_search.best_params_['reg']
rrr.fit(trainX, trainY)


 Testing 
Yhat = rrr.predict(testX).real
MSE = (np.power((testY - Yhat),2)/np.prod(testY.shape)).mean()
print("MSE =",MSE)

f,ax = plt.subplots(1,2,figsize=(10,10))
ax[0].imshow(np.corrcoef(testY), cmap='jet')
ax[0].set_title('testY')
ax[1].imshow(np.corrcoef(Yhat), cmap='jet')
ax[1].set_title('Yhat')
plt.show()

[output]

MSE = 0.1508996355795968

fig3.png

The MSE (Mean Squared Error) is smaller and $ \ hat {Y} $ seems to be well predicted.

Summary

Reduced rank ridge regression has a high ability to fit low-dimensional objective variables and has the potential to be widely applicable to real-world data structures. Also, the package is implemented according to scikit-learn, so it seems to be easy to use.

Recommended Posts

What is Reduced Rank Ridge Regression?
What is Logistic Regression Analysis?
What is Multinomial Logistic Regression Analysis?
What is namespace
What is copy.copy ()
What is Django? .. ..
What is dotenv?
What is POSIX?
What is Linux
What is klass?
What is SALOME?
What is Linux?
What is python
What is hyperopt?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is Python
What is a distribution?
What is Piotroski's F-Score?
What is Raspberry Pi?
[Python] What is Pipeline ...
What is Calmar Ratio?
What is a terminal?
[PyTorch Tutorial ①] What is PyTorch?
What is hyperparameter tuning?
What is a hacker?
What is JSON? .. [Note]
What is Linux for?
What is a pointer?
What is ensemble learning?
What is TCP / IP?
What is Python's __init__.py?
What is an iterator?
What is UNIT-V Linux?
[Python] What is virtualenv
What is machine learning?
What is Minisum or Minimax?
What is Linux? [Command list]
What is the activation function?
Ridge regression with Pyspark's Mllib
What is the Linux kernel?
What is an instance variable?
Understand machine learning ~ ridge regression ~.
What is a decision tree?
What is a Context Switch?
What is Google Cloud Dataflow?
[DL] What is weight decay?
[Python] Python and security-① What is Python?
What is a super user?
Competitive programming is what (bonus)
[Python] * args ** What is kwrgs?
What is a system call
[Definition] What is a framework?
What is the interface for ...
What is Project Euler 3 Acceleration?
What is a callback function?
What is the Callback function?
What is a python map?
What is your "Tanimoto coefficient"?