[PYTHON] Speed improvement by self-implementation of numpy.random.multivariate_normal

This is curshey from AI.RL.LYs.

When I implemented Thompson sampling on a logistic regression model by batch learning, the processing speed became very slow. One of the reasons for this is that the weights that follow the multivariate normal distribution were obtained with numpy.random.multivariate_normal, but in this article It has been reported that this function is slower and more unstable as the covariance matrix becomes larger, which is thought to slow down the process.

So, in this article, I would like to write about the verification result of how much the speed has improved between the case where the multivariate normal distribution is self-implemented using Cholesky decomposition and numpy.random.multivariate_normal is used as it is and the self-implementation. I think.

The detailed theory and implementation method of Thompson sampling on the logistic regression model will not be explained. For details, see "[Bandit Problem Theory and Algorithm](https://www.amazon.co.jp/%E3%83%90%E3%83%B3%E3%83%87%E3%82%A3%" E3% 83% 83% E3% 83% 88% E5% 95% 8F% E9% A1% 8C% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E3% 82% A2% E3% 83% AB% E3% 82% B4% E3% 83% AA% E3% 82% BA% E3% 83% A0-% E6% A9% 9F% E6% A2% B0% E5% AD % A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7 % E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E6% 9C% AC% E5% A4% 9A- Please check p122 ~ p125 of "% E6% B7% B3% E4% B9% 9F / dp / 406152917X)". In addition, this article only mentions speed improvement, and does not evaluate the performance of Thompson sampling by cumulative rewards or regrets.

Theory of multivariate normal distribution generation

The following theorem is used to generate the multivariate normal distribution.

Multivariate normal distribution $ X \ sim N (\ mu, \ Sigma) for a variance-covariance matrix $ \ Sigma $ with mean $ \ mu \ in \ mathbb {R} ^ n $ and $ n \ times n $ If $, then $ B \ in \ mathbb {R} ^ {n \ times n} $ exists and the standard normal distribution $ Z \ sim N (0,1) $ can be obtained from the following formula. $Z = B^{-1}(X-\mu)$

You can get the multivariate normal distribution $ X $ by transforming the last formula as follows: $ X = BZ + \mu $

This is $ B $, which can be obtained by matrix factorization of the covariance matrix $ \ Sigma $. There are various methods for matrix factorization, such as singular value decomposition and Cholesky decomposition. In numpy.random.multivariate_normal, singular value decomposition is performed, but this process becomes heavy as the number of dimensions of the variance-covariance matrix increases. Cholesky decomposition is relatively faster than singular value decomposition, so we chose Cholesky decomposition for our own implementation. However, since the amount of calculation is $ O (n ^ 3) $ in both cases, you should be careful about the size of the number of dimensions.

Actual implementation

Before actually implementing it, I would like to pay attention to the difference between singular value decomposition and Cholesky decomposition. Singular value decomposition requires the matrix to be a semi-definite matrix, while Cholesky decomposition requires a definite matrix. The covariance matrix is always a semi-definite matrix by definition, but there is no guarantee that it will be a definite matrix.

Also, this time, Thompson sampling on the logistic regression model is implemented by batch learning. Thompson sampling on the logistic regression model updates the variance-covariance matrix by MAP estimation, but since this update is a batch update, there is no need to perform matrix factorization each time a weight is generated. Since it is sufficient to decompose the updated covariance matrix and reuse it, we have divided it into a function that performs matrix factorization and a function that acquires a multivariate normal distribution. The implementation looks like this:

def get_decomposition_matrix(cov: np.array) -> (Tuple, str):
    try:
        return np.linalg.cholesky(cov), "cholesky"
    except np.linalg.LinAlgError as e:
        return np.linalg.svd(cov), "SVD"

def sample_multivariate_normal(mean: np.array, decomposition_matrix: Tuple,
                               decomposition: str) -> np.array:
    if decomposition == "cholesky":
        standard_normal_vector = np.random.standard_normal(len(decomposition_matrix))
        return decomposition_matrix @ standard_normal_vector + mean
    elif decomposition == "SVD":
        u, s, vh = decomposition_matrix
        standard_normal_vector = np.random.standard_normal(len(u))
        return u @ np.diag(np.sqrt(s)) @ vh @ standard_normal_vector + mean

Thompson sampling pseudocode

I will introduce numpy.random.multivariate_normal and the pseudo code when using the self-implemented code for Thompson sampling on the logistic regression model,

First is the pseudo code when using numpy.random.multivariate_normal.

Parameters: Enter mean vector $ \ mu $, covariance matrix $ \ Sigma $

  1. for $t = 1, 2, \cdots, T $ do
  2. Generate $ ~~~~ $ random number $ \ hat {\ theta} $ by entering $ \ mu, \ Sigma $ in numpy.random.multivariate_normal
  3. $ ~~~~ $ Action $ i \ leftarrow argmax_ {i} ~ a_ {i, t} ^ T \ hat {\ theta} $ and observe reward $ X (t) $
  4. When $ ~~~~ $ if $ t $ reaches the value to periodically update $ \ mu, \ Sigma $
  5. $ ~~~~~~~~ $ MAP is estimated from the features of the selected arm up to batch update and the observed reward trajectory, and $ \ mu, \ Sigma $ is updated.
  6. ~~~~end if
  7. end for

The following is the pseudo code when using the self-implemented code.

Parameters: Enter mean vector $ \ mu $, variance-covariance matrix $ \ Sigma $

  1. decomposition_matrix, decomposition = get_decomposition_matrix(\Sigma)
  2. for $t = 1, 2, \cdots, T $ do
  3. $ ~~~~ $ Random number $ \ hat {\ theta} $ = sample_multivariate_normal ($ \ mu $, decomposition_matrix, decomposition)
  4. $ ~~~~ $ Action $ i \ leftarrow argmax_ {i} ~ a_ {i, t} ^ T \ hat {\ theta} $ and observe reward $ X (t) $
  5. When $ ~~~~ $ if $ t $ reaches the value to periodically update $ \ mu, \ Sigma $
  6. $ ~~~~~~~~ $ MAP estimation is performed from the feature amount of the selected arm up to batch update and the observed reward trajectory, and $ \ mu, \ Sigma $ is updated.
  7. ~~~~~~~~decomposition_matrix, decomposition = get_decomposition_matrix(\Sigma)
  8. ~~~~end if
  9. end for

Experiment 1

The environment when verifying Thompson sampling was executed on AWS SageMaker with the following instance types.

--Notebook instance type: ml.t2.medium

The environment for verifying the implemented Thompson sampling is as follows.

--Number of arms: 1000 --Number of feature dimensions: 30 --Batch size: 1000 --Initial value of covariance matrix: Diagonal only, all values are 0.1 --Initial value of average vector: All 0

In this environment, the average processing speed was measured when 8 trials were performed for 8000 time steps. The experimental results are shown in the figure below.

スクリーンショット 2020-03-31 18.41.13.png

The bar on the left uses numpy.random.multivariate_normal, and the bar on the right is a self-implementation. The result is that the processing speed is about twice as fast as the self-implementation.

Experiment 2

The environment when Thompson sampling was verified was executed on AWS SageMaker with the following instance types as in Experiment 1.

--Notebook instance type: ml.t2.medium

This time, it is a verification when the number of dimensions of the feature quantity is increased. The environment is as follows.

--Number of arms: 1000 --Number of feature dimensions: 300 --Batch size: 1000 --Initial value of covariance matrix: Diagonal only, all values are 0.1 --Initial value of average vector: All 0

Increased the number of feature dimensions from 30 to 300. Again, for the number of time steps of 8000, the average of 8 trials was measured as the processing speed. The experimental results are shown in the figure below.

スクリーンショット 2020-03-31 18.46.08.png

As in Experiment 1, the left bar uses numpy.random.multivariate_normal, and the right bar is a self-implemented one. Comparing this, you can see that the self-implementation is about 10 times faster than numpy.random.multivariate_normal.

Summary of verification results

I think that this speed improvement is due to the fact that the self-implementation using Cholesky decomposition is faster than numpy.random.multivariate_normal, and the effect comes out especially when the number of dimensions of the feature quantity is large. Can be done.   Not only this, it is thought that the matrix factorization is done only after the variance-covariance matrix is updated. numpy.random.multivariate_normal produces a multivariate normal distribution from the singular value decomposition of the covariance matrix, but no matrix factorization is required unless the covariance matrix is updated. The self-implementation separates the matrix factorization and the multivariate normal distribution, and the matrix factorization is performed only after the variance-covariance matrix is updated, and unnecessary processing can be reduced by reusing the matrix after decomposition.

To summarize the above, the reason why self-implementation is faster is

  1. Cholesky decomposition is faster than singular value decomposition
  2. Matrix factorization only after updating the covariance matrix It is thought that it will be.

reference

-[Bandit Problem Theory and Algorithm](https://www.amazon.co.jp/%E3%83%90%E3%83%B3%E3%83%87%E3%82%A3%E3%83] % 83% E3% 83% 88% E5% 95% 8F% E9% A1% 8C% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E3% 82% A2 % E3% 83% AB% E3% 82% B4% E3% 83% AA% E3% 82% BA% E3% 83% A0-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-% E6% 9C% AC% E5% A4% 9A-% E6% B7% B3% E4% B9% 9F / dp / 406152917X) -numpy.random.multivariate_normal was strangely slow, so I implemented it in a different way and it was more than 10 times faster

Recommended Posts

Speed improvement by self-implementation of numpy.random.multivariate_normal
Improvement of performance metrix by two-step learning model
Speed comparison of each language by Monte Carlo method
Visualization of data by prefecture
Calculation of similarity by MinHash
Comparison of calculation speed by implementation of python mpmath (arbitrary precision calculation) (Note)