"** Reduced rank regression " and " Ridge regression **" are combined to match ** explanatory variable co-linearity ** and ** objective variable low-dimensional structure ** at the same time. Linear model regression method.
As a problem setting, consider predicting (regressing) the objective variable $ Y_ {N \ times Q} $ by the explanatory variable $ X_ {N \ times P} $.
Often used here is the ** generalized linear model ** expressed by the following equation.
The solution by this least squares method (OLS) is as follows.
This time, we will focus on ** two problems ** when solving a generalized linear model by the least squares method.
--Performance drops when the true dimension of the objective variable $ Y_ {N \ times Q} $ is less than $ Q $. (** $ Y $ rank drop ) --Performance drops when the rows of the explanatory variable $ X_ {N \ times P} $ are highly correlated. ( $ X $ collinearity **)
Thus, dimensionality reduction is useful when there is a potential low-dimensional structure in the high-dimensional data. One of the most famous is ** Principal Component Analysis (PCA) **. A model that undergoes regression after dimensionality reduction by principal component analysis is called ** principal component regression (PCR) **.
Reference: Principal component analysis that makes sense, [Understanding principal component analysis with Python](https://qiita.com/maskot1977/items/ 082557fcda78c4cdb41f)
A model that regresses the objective variable by selecting a low-dimensional factor from the explanatory variables, such as principal component regression, is called a ** Linear Factor Model **. For example, independent component regression, partial least squares regression, canonical correlation analysis.
Another dimensionality reduction technique is ** Reduced Rank Regression **. Reduced rank regression allows regression while assuming potential low-dimensional structures by minimizing the error function while limiting the rank of the coefficient matrix $ B $.
Reference: Reduced rank regression
For the second problem, the collinearity ** of the explanatory variable ** $ X $ **, ** regularization ** is often done. Well-known methods include ** Ridge Regression ** and ** LASSO Regression **. The LASSO regression is mainly used for estimating sparsity and is also used as a feature selection method. Ridge regression is widely used to address the well-posed problem due to the collinearity of explanatory variables.
Based on the above, in order to deal with the two problems, we will add the following two constraints to the squared error.
Therefore, you can get $ \ hat {B} (\ lambda, r) $ by minimizing the error as follows.
However, $ r \ leq \ min \ lbrace P, Q \ rbrace $, and $ \ Vert \ cdot \ Vert_F ^ 2 $ is the Frobenius norm (matrix norm).
To think about this in the framework of reduced rank regression,
X_{(N+P)\times P}^* =
\left( \begin{array}{c} X \\ \sqrt{\lambda}I \end{array}\right), \
Y_{(N+P)\times Q}^* =
\left( \begin{array}{c} Y \\ 0 \end{array}\right)
By doing so, the error function is expressed as follows.
Also, using the estimated value of ridge regression $ \ hat {Y_R} ^ * = X ^ * \ hat {B_R} ^ * $, from the orthonormality,
\Vert Y^*-X^*B \Vert_F^2 = \Vert Y^*-\hat{Y}_R^* \Vert_F^2 + \Vert \hat{Y}_R^*-X^*B \Vert_F^2
Since the first term does not depend on $ B $, the error function can be expressed as:
\underset{ \lbrace B:rank(B) \leq r \rbrace}{argmin} \Vert \hat{Y}_R^*-X^*B \Vert_F^2
Now suppose that the singular value decomposition is given as follows:
\hat{Y}_R^* = \sum_{i=1}^{\tau} \sigma_i u_i v_i^T
Where $ \ sigma_i $ is the singular value, $ u_i, v_i $ are the left and right singular value vectors, and $ \ tau $ is the rank of $ \ hat {Y} _R ^ * $ (by regularization, $ Q $ and It becomes).
This is Eckert Young's theorm in Frobenius Norm ([Reference](https://ja.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88] Think of it as% 86% E6% 9E% 90)), and the best approximation by rank $ r $ is:
\hat{Y}_r^* = \sum_{i=1}^{r} \sigma_i u_i v_i^T
Let's use this to represent the optimal coefficient matrix $ \ hat {B} (\ lambda, r) $.
\hat{Y}_r^* = \sum_{i=1}^{r} \sigma_i u_i v_i^T = \left(\sum_{i=1}^{\tau} \sigma_i u_i v_i^T\right) \left(\sum_{j=1}^{r} u_j v_j^T\right) = \hat{Y}_r^* P_r = X^* \hat{B}_R^* P_r = X^* \hat{B}(\lambda,r)
Then you get $ \ hat {B} (\ lambda, r) = \ hat {B} _R ^ * P_r $ as the solution.
Furthermore, it can be expressed as follows using the least squares solution of ridge regression.
\hat{B}(\lambda,r) = \hat{B}_R^* P_r = (X^T X + \lambda I)^{-1}X^T Y P_r \\
\hat{Y}(\lambda,r) = X \hat{B}(\lambda,r) = X (X^T X + \lambda I)^{-1}X^T Y P_r = \hat{Y}_{\lambda} P_r
Here, $ \ hat {Y} _ {\ lambda} $ is the estimated solution of the ridge regression in $ \ lambda $, and "** The optimal solution by projecting the estimated solution of the ridge regression into the space of $ r $ dimension. Can be interpreted as ** ". Therefore, for $ r = Q $, a simple ridge regression solution is the optimal solution.
Tuning the parameters $ \ lambda, r $ will determine the performance of the reduced rank ridge regression. In the reference paper, K-fold cross-validation was performed to determine the optimal parameters.
Execute Demo in RRRR Package of the reference link. Saw.
Let's use the demo data provided.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import PredefinedSplit
import reduced_rank_regressor as RRR
import matplotlib.pyplot as plt
%matplotlib inline
N_PARAMETERS_GRID_SEARCH = 20
Data_path = "data/"
Load Data
trainX = np.loadtxt(Data_path+"trainX.txt")
testX = np.loadtxt(Data_path+"testX.txt")
validX = np.loadtxt(Data_path+"validX.txt")
trainY = np.loadtxt(Data_path+"trainY.txt")
testY = np.loadtxt(Data_path+"testY.txt")
validY = np.loadtxt(Data_path+"validY.txt")
Inspection of Data
f,ax = plt.subplots(1,2,figsize=(10,10))
ax[0].imshow(np.corrcoef(trainX), cmap='jet')
ax[0].set_title('trainX')
ax[1].imshow(np.corrcoef(trainY), cmap='jet')
ax[1].set_title('trainY')
plt.show()
From the correlation matrix, we can see that the objective variable $ Y $ has a low-dimensional structure.
We perform cross-validation of hyperparameters for constraint rank and regularization strength, and perform optimization.
Cross-validation setup. Define search spaces
rank_grid = np.linspace(1,min(min(trainX.shape),min(trainY.shape)), num=N_PARAMETERS_GRID_SEARCH)
rank_grid = rank_grid.astype(int)
reg_grid = np.power(10,np.linspace(-20,20, num=N_PARAMETERS_GRID_SEARCH+1))
parameters_grid_search = {'reg':reg_grid, 'rank':rank_grid}
valid_test_fold = np.concatenate((np.zeros((trainX.shape[0],))-1,np.zeros((validX.shape[0],))))
ps_for_valid = PredefinedSplit(test_fold=valid_test_fold)
Model initialisation
rrr = RRR.ReducedRankRegressor()#rank, reg)
grid_search = GridSearchCV(rrr, parameters_grid_search, cv=ps_for_valid,
scoring='neg_mean_squared_error')
print("fitting...")
grid_search.fit(np.concatenate((trainX,validX)), np.concatenate((trainY,validY)))
Display the best combination of values found
print(grid_search.best_params_)
means = grid_search.cv_results_['mean_test_score']
means = np.array(means).reshape(N_PARAMETERS_GRID_SEARCH, N_PARAMETERS_GRID_SEARCH+1)
print(grid_search.best_score_)
[output]
fitting...
{'rank': 316, 'reg': 1e-20}
-75126.47521541138
Let's visualize the result of cross-validation. You can see that the score is higher in the optimum rank part (around 316).
Show CV results
scores = [x for x in grid_search.cv_results_['rank_test_score']]
scores = np.array(scores).reshape(N_PARAMETERS_GRID_SEARCH, N_PARAMETERS_GRID_SEARCH+1)
f,ax = plt.subplots(1,1,figsize=(10,10))
cbar = ax.imshow(scores, cmap='jet')
ax.set_title('Test score Ranking')
ax.set_xlabel('Regression')
ax.set_ylabel('Rank')
ax.set_xticks(list(range(len(reg_grid))))
ax.set_yticks(list(range(len(rank_grid))))
ax.set_xticklabels([str("%0.*e"%(0,x)) for x in reg_grid],rotation=30)
ax.set_yticklabels([str(x) for x in rank_grid])
f.colorbar(cbar)
plt.show()
Inference is made based on the optimal hyperparameters obtained by cross-validation.
Train a model with the best set of hyper-parameters found
rrr.rank = int(grid_search.best_params_['rank'])
rrr.reg = grid_search.best_params_['reg']
rrr.fit(trainX, trainY)
Testing
Yhat = rrr.predict(testX).real
MSE = (np.power((testY - Yhat),2)/np.prod(testY.shape)).mean()
print("MSE =",MSE)
f,ax = plt.subplots(1,2,figsize=(10,10))
ax[0].imshow(np.corrcoef(testY), cmap='jet')
ax[0].set_title('testY')
ax[1].imshow(np.corrcoef(Yhat), cmap='jet')
ax[1].set_title('Yhat')
plt.show()
[output]
MSE = 0.1508996355795968
The MSE (Mean Squared Error) is smaller and $ \ hat {Y} $ seems to be well predicted.
Reduced rank ridge regression has a high ability to fit low-dimensional objective variables and has the potential to be widely applicable to real-world data structures. Also, the package is implemented according to scikit-learn, so it seems to be easy to use.
Recommended Posts