Overview
Chapter 5 deals with the case where the observation data contains noise. Since the observation data contains noise, the constraint of the error term is weakened to $ \ epsilon $ or less.
(P_{0}^{\epsilon}): \min_{\mathbf{x}} \|\|\mathbf{x}\|\|_{0} \text{ subject to } \|\|\mathbf{b}-\mathbf{Ax}\|\| _{2} \leq \epsilon
- In Orthogonal Matching Tracking (OMP), simply find the solution as the stop condition $ \ epsilon_ {0} = \ epsilon $ for iterative calculation.
Relaxing the $ \ ell_ {0} $ norm to the $ \ ell_ {1} $ norm yields a variant of $ (P_ {1}) $, basis pursuit denoising (BPDN).
(P_{1}^{\epsilon}): \min_{\mathbf{x}} \|\|\mathbf{x}\|\|_{1} \text{ subject to } \|\|\mathbf{b}-\mathbf{Ax}\|\| _{2} \leq \epsilon
For a suitable Lagrange multiplier $ \ lambda $, the solution of $ (P_ {1} ^ {\ epsilon}) $ is consistent with the solution of the following unconstrained optimization problem.
(Q_{1}^{\lambda}): \min_{\mathbf{x}} \lambda \|\|\mathbf{x}\|\|_{1} + \frac{1}{2} \|\|\mathbf{b}-\mathbf{Ax}\|\| _{2}
A simple way to solve $ (Q_ {1} ^ {\ lambda}) $ is the iterative-reweighted-least-squares (IRLS) method.
\mathbf{X}=\rm{diag}(\mathbf{\|x\|})Then$ ||\mathbf{x}||_{1} = \mathbf{x}\mathbf{X}^{-1}\mathbf{x}It becomes. In other words\ell_{1}The norm is (weighted) squared\ell_{2}It can be regarded as a norm. Current approximate solution\mathbf{x_{\rm{k-1}}}Given,\mathbf{X_{\rm{k-1}}}=\rm{diag}(\left|\mathbf{x_{\rm{k-1}}}\right|)$Anyway, solve the following problem.
(M_{k}): \min_{\mathbf{x}} \lambda \mathbf{x^{\rm{T}}} \mathbf{X^{\rm{-1}}} \mathbf{x} + \frac{1}{2} \|\| \mathbf{b} - \mathbf{Ax} \|\|_{2}^{2}
So the IRLS algorithm uses the current weights $ \ mathbf {X_ {\ rm {k-1}}} $ to get $ \ mathbf {x_ {k}} $ and then $ \ mathbf {x_ {k} } $ Is used to obtain $ \ mathbf {X_ {\ rm {k}}} $ alternately, and approximately $ (Q_ {1} ^ {\ lambda}) $ is obtained. .. Chapter 5 mainly consists of an explanation of IRLS that approximates $ (Q_ {1} ^ {\ lambda}) $ and an outline of LARS (least angle regression stagewise).
IRLS algorithm
Initialization
As $ k = 0 $
- (Arbitrary) initial solution $ \ mathbf {x_ {\ rm {0}}} = \ mathbf {1} $
- Initial weight matrix $ \ mathbf {X_ {\ rm {0}}} = \ mathbf {I} $
Main loop
Set $ k \ leftarrow k + 1 $ and execute the following steps.
- Least squares with regularization: The following linear simultaneous equations $ \ left (2 \ lambda \ mathbf {X_ {\ rm {k-1}} ^ {\ rm {-1}}} + \ mathbf {A ^ {\ rm {T}} A} \ right) \ mathbf {x} = \ mathbf {A ^ {\ rm {T}} b} $ is solved using the iterative method, and the approximate solution $ \ mathbf {x_ { Get \ rm {k}}} $. The textbook states that several iterations of the conjugate gradient method are sufficient. In this study, scipy.optimize.minimize is used.
*Weight update:\mathbf{x_{\rm{k}}}UsingX_{k}(j,j)=\|x_{k}(j)\|+\epsilonAs a weighted diagonal matrix\mathbf{X}Is updated.
*Stop condition: If\|\|\mathbf{x_{\rm{k}}}-\mathbf{x_{\rm{k-1}}}\|\|_{2}If is less than the pre-given threshold, it exits, otherwise it repeats.
Experiment
simulation
$ \ mathbf {x} $ 200 dimensional vector
$ \ mathbf {b} $ 100 dimensional vector
$ \ mathbf {A} $ 100 × 200 dimensional matrix, column normalization of uniform random numbers of $ [-1, 1) $
- $ \ mathbf {x} $ has a uniform distribution in the range of $ [-2, -1] \ cup [1, 2] $, where $ k = 4 $ for the number of nonzero elements Extract iid from.
- Calculate $ \ mathbf {b} = \ mathbf {Ax} + \ mathbf {e} $ using additive noise $ \ mathbf {e} $ extracted from a Gaussian distribution with standard deviation $ \ sigma = 0.1 $. To do.
One of the issues to be solved is how to determine the value of $ \ lambda $. Here, according to the rule of thumb, $ \ lambda $ is a value close to the ratio of $ \ sigma $ and the standard deviation of the nonzero element. Since the standard deviation of non-zero elements is about 2, the value near $ \ sigma / 2 = 0.05 $ is set as $ \ lambda $.
result
- \lambdaThe solution obtained while changing the value of$ \frac{|| \hat{\mathbf{x_{\rm{\lambda}}}} - \mathbf{x_{\rm{0}}} ||}{||\mathbf{x_{\rm{0}}}||} $Evaluated in. The number of IRLS repetitions was 100.
The dashed line is\left\|\log\left(\|\|\mathbf{A\hat{x_{\rm{\lambda}}}}-\mathbf{b}\|\|^{2}/\left(n\sigma^{2}\right)\right)\right\|Indicates the value of\lambdaResidual at what value\|\|\mathbf{A\hat{x_{\rm{\lambda}}}}-\mathbf{b}\|\|Is the power of noisen\sigma^{2}It shows whether it will be about the same as. this is\lambdaAnother empirical rule that determines the value of.
- Three solutions when using the best $ \ lambda $ value and when using larger and smaller values
Compared to the true solution $ \ mathbf {x_ {\ rm {0}}} $ (red circle), $ \ mathbf {x_ {\ rm {0}} when using the optimal $ \ lambda $ value } The position of the nonzero element of $ is best restored. If the value of $ \ lambda $ is small, the solution is dense, and if it is large, the solution is sparse.
- Display solutions for all $ \ lambda $ at once in one graph
Each plot shows the value of one element of $ \ mathbf {x} $ for $ \ lambda $. The dashed line is the value of $ \ mathbf {x_ {\ rm {0}}} $.
*Optimal\lambdaFunction value for the number of IRLS iterations when(\lambda\|\|\mathbf{x}\|\|_{1}+\frac{1}{2}\|\|\mathbf{b}-\mathbf{Ax}\|\|)
LARS algorithm
- I couldn't understand it, so I only experimented with sklearn's Lasso Lars.
Experiment
simulation
$ \ mathbf {x} $ 50 dimensional vector
$ \ mathbf {b} $ 30 dimensional vector
$ \ mathbf {A} $ 30 × 50 dimensional matrix, column normalization of uniform random numbers of $ [-1, 1) $
- Assuming that the number of non-zero elements k of $ \ mathbf {x} $, extract the value of the non-zero element from the uniform distribution in the range of $ [-2, -1] \ cup [1, 2] $. ..
- Calculate $ \ mathbf {b} = \ mathbf {Ax} + \ mathbf {e} $ using additive noise $ \ mathbf {e} $ extracted from a Gaussian distribution with standard deviation $ \ sigma = 0.1 $. To do.
- Simulate 200 times each with $ k = 1, ..., 15 $
- Compare OMP and LARS
index
- The $ L_2 $ norm of the estimated $ \ mathbf {\ hat {x}} $ and the true $ \ mathbf {x} $
*Estimated\mathbf{\hat{x}}And true\mathbf{x}Distance between supports$dist(\hat{S},S)=\frac{max\\{|\hat{S}|, |S| \\} - |\hat{S}\cap S|}{max\\{|\hat{S}|,|S|\\}}$
- Take the average of 200 simulations.
result
Consideration
- If k is small, OMP has a smaller error. When k is large, LARS has a smaller error.
- Support restoration seems to be more accurate in OMP. It reverses as K increases.