[PYTHON] What is Reduced Rank Ridge Regression?

Reference link

Mukherjee A, Zhu J. Reduced Rank Ridge Regression and Its Kernel Extensions. Stat Anal Data Min. 2011;4(6):612–622. doi:10.1002/sam.10138
RRRR - Python implementation of Reduced Rank Ridge Regression

simply! : bow_tone1:

"** Reduced rank regression " and " Ridge regression **" are combined to match ** explanatory variable co-linearity ** and ** objective variable low-dimensional structure ** at the same time. Linear model regression method.

What is Reduced Rank Regression?

Generalized linear model

As a problem setting, consider predicting (regressing) the objective variable $ Y_ {N \ times Q} $ by the explanatory variable $ X_ {N \ times P} $.

Often used here is the ** generalized linear model ** expressed by the following equation. $ Y = XB + E $ Where $ B_ {P \ times Q} $ is the coefficient matrix and $ E_ {N \ times Q} $ is the random error matrix.

The solution by this least squares method (OLS) is as follows. $ \ hat {B} _ {OLS} = (X ^ TX) ^ {-1} X ^ TY $

This time, we will focus on ** two problems ** when solving a generalized linear model by the least squares method.

--Performance drops when the true dimension of the objective variable $ Y_ {N \ times Q} $ is less than $ Q $. (** $ Y $ rank drop ) --Performance drops when the rows of the explanatory variable $ X_ {N \ times P} $ are highly correlated. ( $ X $ collinearity **)

Dimensionality reduction

Thus, dimensionality reduction is useful when there is a potential low-dimensional structure in the high-dimensional data. One of the most famous is ** Principal Component Analysis (PCA) **. A model that undergoes regression after dimensionality reduction by principal component analysis is called ** principal component regression (PCR) **.

Reference: Principal component analysis that makes sense, [Understanding principal component analysis with Python](https://qiita.com/maskot1977/items/ 082557fcda78c4cdb41f)

A model that regresses the objective variable by selecting a low-dimensional factor from the explanatory variables, such as principal component regression, is called a ** Linear Factor Model **. For example, independent component regression, partial least squares regression, canonical correlation analysis.

Another dimensionality reduction technique is ** Reduced Rank Regression **. Reduced rank regression allows regression while assuming potential low-dimensional structures by minimizing the error function while limiting the rank of the coefficient matrix $ B $.

Reference: Reduced rank regression

Regularization

For the second problem, the collinearity ** of the explanatory variable ** $ X $ **, ** regularization ** is often done. Well-known methods include ** Ridge Regression ** and ** LASSO Regression **. The LASSO regression is mainly used for estimating sparsity and is also used as a feature selection method. Ridge regression is widely used to address the well-posed problem due to the collinearity of explanatory variables.

Extension to Reduced Rank Ridge Regression

Introduce two constraints into the error function

Based on the above, in order to deal with the two problems, we will add the following two constraints to the squared error.

Regularization term of ridge regression
Rank constraint of $ \ hat {B} $

Therefore, you can get $ \ hat {B} (\ lambda, r) $ by minimizing the error as follows.

\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert Y-XB \Vert_F^2 + \lambda \Vert B \Vert_F^2

However, $ r \ leq \ min \ lbrace P, Q \ rbrace $, and $ \ Vert \ cdot \ Vert_F ^ 2 $ is the Frobenius norm (matrix norm).

To think about this in the framework of reduced rank regression,

X_{(N+P)\times P}^* = 
\left(  \begin{array}{c}    X \\    \sqrt{\lambda}I  \end{array}\right), \ 
Y_{(N+P)\times Q}^* = 
\left(  \begin{array}{c}    Y \\    0  \end{array}\right)

By doing so, the error function is expressed as follows.

\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert Y^*-X^*B \Vert_F^2

Also, using the estimated value of ridge regression $ \ hat {Y_R} ^ * = X ^ * \ hat {B_R} ^ * $, from the orthonormality,

\Vert Y^*-X^*B \Vert_F^2 = \Vert Y^*-\hat{Y}_R^* \Vert_F^2 + \Vert \hat{Y}_R^*-X^*B \Vert_F^2

Since the first term does not depend on $ B $, the error function can be expressed as:

\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert \hat{Y}_R^*-X^*B \Vert_F^2

Derivation of a solution by a linear problem

Now suppose that the singular value decomposition is given as follows:

\hat{Y}_R^* = \sum_{i=1}^{\tau} \sigma_i u_i v_i^T

Where $ \ sigma_i $ is the singular value, $ u_i, v_i $ are the left and right singular value vectors, and $ \ tau $ is the rank of $ \ hat {Y} _R ^ * $ (by regularization, $ Q $ and It becomes).

This is Eckert Young's theorm in Frobenius Norm ([Reference](https://ja.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88] Think of it as% 86% E6% 9E% 90)), and the best approximation by rank $ r $ is:

\hat{Y}_r^* = \sum_{i=1}^{r} \sigma_i u_i v_i^T

Let's use this to represent the optimal coefficient matrix $ \ hat {B} (\ lambda, r) $.

\hat{Y}_r^* = \sum_{i=1}^{r} \sigma_i u_i v_i^T = \left(\sum_{i=1}^{\tau} \sigma_i u_i v_i^T\right) \left(\sum_{j=1}^{r} u_j v_j^T\right) = \hat{Y}_r^* P_r = X^* \hat{B}_R^* P_r = X^* \hat{B}(\lambda,r)

Then you get $ \ hat {B} (\ lambda, r) = \ hat {B} _R ^ * P_r $ as the solution.

Furthermore, it can be expressed as follows using the least squares solution of ridge regression.

\hat{B}(\lambda,r) = \hat{B}_R^* P_r = (X^T X + \lambda I)^{-1}X^T Y P_r \\
\hat{Y}(\lambda,r) = X \hat{B}(\lambda,r) = X (X^T X + \lambda I)^{-1}X^T Y P_r = \hat{Y}_{\lambda} P_r

Here, $ \ hat {Y} _ {\ lambda} $ is the estimated solution of the ridge regression in $ \ lambda $, and "** The optimal solution by projecting the estimated solution of the ridge regression into the space of $ r $ dimension. Can be interpreted as ** ". Therefore, for $ r = Q $, a simple ridge regression solution is the optimal solution.

Parameter tuning

Tuning the parameters $ \ lambda, r $ will determine the performance of the reduced rank ridge regression. In the reference paper, K-fold cross-validation was performed to determine the optimal parameters.

Try the package

Execute Demo in RRRR Package of the reference link. Saw.

data

Let's use the demo data provided.

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import PredefinedSplit
import reduced_rank_regressor as RRR

import matplotlib.pyplot as plt
%matplotlib inline

N_PARAMETERS_GRID_SEARCH = 20 
Data_path = "data/"

 Load Data
trainX = np.loadtxt(Data_path+"trainX.txt")
testX = np.loadtxt(Data_path+"testX.txt")
validX = np.loadtxt(Data_path+"validX.txt")

trainY = np.loadtxt(Data_path+"trainY.txt")
testY = np.loadtxt(Data_path+"testY.txt")
validY = np.loadtxt(Data_path+"validY.txt")

 Inspection of Data
f,ax = plt.subplots(1,2,figsize=(10,10))
ax[0].imshow(np.corrcoef(trainX), cmap='jet')
ax[0].set_title('trainX')
ax[1].imshow(np.corrcoef(trainY), cmap='jet')
ax[1].set_title('trainY')
plt.show()

From the correlation matrix, we can see that the objective variable $ Y $ has a low-dimensional structure.

Cross-validation

We perform cross-validation of hyperparameters for constraint rank and regularization strength, and perform optimization.

 Cross-validation setup. Define search spaces
rank_grid = np.linspace(1,min(min(trainX.shape),min(trainY.shape)), num=N_PARAMETERS_GRID_SEARCH)
rank_grid = rank_grid.astype(int)
reg_grid = np.power(10,np.linspace(-20,20, num=N_PARAMETERS_GRID_SEARCH+1))
parameters_grid_search  = {'reg':reg_grid, 'rank':rank_grid}

valid_test_fold = np.concatenate((np.zeros((trainX.shape[0],))-1,np.zeros((validX.shape[0],))))
ps_for_valid = PredefinedSplit(test_fold=valid_test_fold)

 Model initialisation
rrr = RRR.ReducedRankRegressor()#rank, reg)
grid_search = GridSearchCV(rrr, parameters_grid_search, cv=ps_for_valid,
                           scoring='neg_mean_squared_error')

print("fitting...")
grid_search.fit(np.concatenate((trainX,validX)), np.concatenate((trainY,validY)))

 Display the best combination of values found
print(grid_search.best_params_)
means = grid_search.cv_results_['mean_test_score']
means = np.array(means).reshape(N_PARAMETERS_GRID_SEARCH, N_PARAMETERS_GRID_SEARCH+1)
print(grid_search.best_score_)

[output]

fitting...
{'rank': 316, 'reg': 1e-20}
-75126.47521541138

Let's visualize the result of cross-validation. You can see that the score is higher in the optimum rank part (around 316).

 Show CV results
scores = [x for x in grid_search.cv_results_['rank_test_score']]
scores = np.array(scores).reshape(N_PARAMETERS_GRID_SEARCH, N_PARAMETERS_GRID_SEARCH+1)

f,ax = plt.subplots(1,1,figsize=(10,10))
cbar = ax.imshow(scores, cmap='jet')
ax.set_title('Test score Ranking')
ax.set_xlabel('Regression')
ax.set_ylabel('Rank')
ax.set_xticks(list(range(len(reg_grid))))
ax.set_yticks(list(range(len(rank_grid))))
ax.set_xticklabels([str("%0.*e"%(0,x)) for x in reg_grid],rotation=30)
ax.set_yticklabels([str(x) for x in rank_grid])
f.colorbar(cbar)
plt.show()

inference

Inference is made based on the optimal hyperparameters obtained by cross-validation.

 Train a model with the best set of hyper-parameters found
rrr.rank                = int(grid_search.best_params_['rank'])
rrr.reg                 = grid_search.best_params_['reg']
rrr.fit(trainX, trainY)


 Testing 
Yhat = rrr.predict(testX).real
MSE = (np.power((testY - Yhat),2)/np.prod(testY.shape)).mean()
print("MSE =",MSE)

f,ax = plt.subplots(1,2,figsize=(10,10))
ax[0].imshow(np.corrcoef(testY), cmap='jet')
ax[0].set_title('testY')
ax[1].imshow(np.corrcoef(Yhat), cmap='jet')
ax[1].set_title('Yhat')
plt.show()

[output]

MSE = 0.1508996355795968

The MSE (Mean Squared Error) is smaller and $ \ hat {Y} $ seems to be well predicted.

Summary

Reduced rank ridge regression has a high ability to fit low-dimensional objective variables and has the potential to be widely applicable to real-world data structures. Also, the package is implemented according to scikit-learn, so it seems to be easy to use.