[PYTHON] Sequential update of covariance to derivation and implementation of expressions

Trigger

A member of the same laboratory asked me if I would like to update the variance-covariance matrix sequentially, but is there any good way? I found this article . Here, we are deriving the formula for the sequential update of the mean and variance. Please refer to the relevant site for details on that part. Here, we derive an expression for the covariance that was not mentioned in detail on this site, and calculate the covariance matrix in Python3.

Derivation of covariance

Confirmation of letters and important expressions

Write down the characters and relational expressions to be used before deriving. It transforms quite a lot, so if you don't understand it while following the formula, please come back here and think about it.

data x=(x_1,x_2,\ ...\ ,x_n),\ y=(y_1,y_2,\ ...\ ,y_n)

Average of data \overline{x_n}=\frac{1}{n}\sum_{i=1}^nx_i\ ,\ \overline{y_n}=\frac{1}{n}\sum_{i=1}^ny_i

Covariance s_{xy}^{n}=\frac{1}{n}\sum_{i=1}^n{\left(x_i\ -\ \overline{x_n}\ \right)\left(y_i\ -\ \overline{y_n}\ \right)}

Recurrence formula for mean \overline{x_{n+1}}\ -\ \overline{x_{n}}\ =\ \frac{1}{n+1}\left(x_{n+1}\ -\ \overline{x_n}\right)\cdots\star

Derived the recurrence formula of covariance

Now the main subject! I'll play with the formula from here, but please be patient! Calculate $ M_ {n + 1} -M_n $ with $ M_n = ns_ {xy} ^ n $. If this is obtained, the recurrence formula of the covariance can be easily obtained.

\begin{align}
M_{n+1}-M_n &= \sum_{i=1}^{n+1}\left(x_i\ -\ \overline{x_{n+1}}\ \right)\left(y_i\ -\ \overline{y_{n+1}}\ \right) -\sum_{i=1}^n\left(x_i\ -\ \overline{x_n}\ \right)\left(y_i\ -\ \overline{y_n}\ \right)\\
            &= \sum_{i=1}^{n+1}\left(x_iy_i\ -\ x_i\overline{y_{n+1}}\ -\ \overline{x_{n+1}}y_i\ +\ \overline{x_{n+1}}\ \overline{y_{n+1}}\right)\\
            &\ -\sum_{i=1}^n\left(x_iy_i\ -\ x_i\overline{y_{n}}\ -\ \overline{x_{n}}y_i\ +\ \overline{x_{n}}\ \overline{y_{n}}\right)\\
            &=x_{n+1}y_{n+1}\ +\ (n+1)\overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\overline{x_n}\ \overline{y_n}\\
            &\ -\underline{\left(\overline{y_{n+1}}\sum_{i=1}^{n+1}x_i\ +\ \overline{x_{n+1}}\sum_{i=1}^{n+1}y_i\ -\ \overline{y_{n}}\sum_{i=1}^nx_i\ -\ \overline{x_{n}}\sum_{i=1}^ny_i\right)}\cdots\ast
\end{align}

By the way, is it all right so far? Let's calculate here with the underlined part as (1). For the sake of simplicity, we will introduce the following two characters. A_n=\sum_{i=1}^nx_i\ ,\ B_n=\sum_{i=1}^ny_i Let's get rid of (1)!

\begin{align}
(1) &=\overline{y_{n+1}}\ A_{n+1}\ +\ \overline{x_{n+1}}\ B_{n+1}\ -\ \overline{y_n}\ A_n\ -\ \overline{x_n}\ B_n\\
    &=2\left(\frac{1}{n+1}A_{n+1}B_{n+1}\ -\ \frac{1}{n}A_nB_n\right)\\
    &=2\left\{(n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\ \overline{x_n}\ \overline{y_n}\right\}
\end{align}

Yes, it's clean! Let's substitute this for the underlined part.

\begin{align}
\ast&=x_{n+1}y_{n+1}\ +\ (n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\ \overline{x_n}\ \overline{y_n}-2\left\{(n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\ \overline{x_n}\ \overline{y_n}\right\}\\
    &=x_{n+1}y_{n+1}\ -(n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ +\ n\ \overline{x_n}\ \overline{y_n}\\
    &=x_{n+1}y_{n+1}\ -(n+1)\underline{\left(\overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ \overline{x_n}\ \overline{y_n}\right)}\ -\ \overline{x_n}\ \overline{y_n}\cdots\ast\ast
\end{align}

The second underline. We will calculate here as (2).

\begin{align}
(2)&=\left(\overline{x_{n+1}}\ -\ \overline{x_n}\right)\left(\overline{y_{n+1}}\ -\ \overline{y_n}\right)+\overline{x_{n+1}}\ \overline{y_n}+\overline{x_{n}}\ \overline{y_{n+1}}\ -2\ \overline{x_{n}}\ \overline{y_n}\\
   &=\left(\overline{x_{n+1}}\ -\ \overline{x_n}\right)\left(\overline{y_{n+1}}\ -\ \overline{y_n}\right)+\overline{y_n}\left(\overline{x_{n+1}}\ -\overline{x_n}\right)+\overline{x_n}\left(\overline{y_{n+1}}\ -\overline{y_n}\right)\\
   &=\frac{x_{n+1}-\overline{x_n}}{n+1}\cdot \frac{y_{n+1}-\overline{y_n}}{n+1}+ \overline{y_n}\ \frac{x_{n+1}-\overline{x_n}}{n+1}+\overline{x_n}\ \frac{y_{n+1}-\overline{y_n}}{n+1}\ (\because\ \star)\\
   &=\frac{1}{n+1}\left\{ \frac{1}{n+1}\left(x_{n+1}-\overline{x_n}\right)\left(y_{n+1}-\overline{y_n}\right)+\overline{y_n}\ \left(x_{n+1}-\overline{x_n}\right)+\overline{x_n}\ \left(y_{n+1}-\overline{y_n}\right)\right\}
\end{align}

The goal is just around the corner!

\begin{align}
\ast\ast&=x_{n+1}y_{n+1}\ -\frac{1}{n+1}\left(x_{n+1}-\overline{x_n}\right)\left(y_{n+1}-\overline{y_n}\right)-\overline{y_n}\ \left(x_{n+1}-\overline{x_n}\right)-\overline{x_n}\ \left(y_{n+1}-\overline{y_n}\right)-\overline{x_n}\ \overline{y_n}\\
        &=\frac{n}{n+1}\ x_{n+1}y_{n+1}\ -\ \frac{n}{n+1}\ x_{n+1}\overline{y_n}\ -\ \frac{n}{n+1}\ \overline{x_n}\ y_{n+1}\ +\frac{n}{n+1}\ \overline{x_n}\ \overline{y_n}\\
        &=\frac{n}{n+1}\ \left(x_{n+1}\ -\ \overline{x_n}\right)\ \left(y_{n+1}\ -\ \overline{y_n}\right)
\end{align}

This completes the main transformation. I will do the final finishing.

M_{n+1}-M_n=\frac{n}{n+1}\left(x_{n+1}\ -\ \overline{x_n}\right)\left(y_{n+1}\ -\ \overline{y_n}\right)\\
\therefore\ M_{n+1} = \frac{n}{n+1}\left(x_{n+1}\ -\ \overline{x_n}\right)\left(y_{n+1}\ -\ \overline{y_n}\right)\ +\ M_n\\
\therefore\ s_{xy}^{n+1}\ =\ \frac{n}{(n+1)^2}\left(x_{n+1}\ -\ \overline{x_n}\right)\left(y_{n+1}\ -\ \overline{y_n}\right)\ +\ \frac{n}{n+1}s_{xy}^n

This completes the recurrence formula.

Implemented in Python

Now that we have a recurrence formula, let's actually implement it in Python. This time, we will implement it under the motivation that "vectors are given sequentially and we want to find the variance-covariance matrix of the vector set".

import numpy as np

def calc(next_val,times,var_cov_mat=None):
    '''
Do the calculation
    '''
    if times > 1:
        #Update the covariance matrix
        var_cov_mat = np.outer(next_val,n_mean)*(times/(times+1)**2) + var_cov_mat*times/(times+1)
    else:
        #Define initial state, variance-covariance matrix = identity matrix
        var_cov_mat = np.identity(len(next_val))

    return next_mean, var_cov_mat

This completes. After that, you can get the variance-covariance matrix sequentially by inserting more and more data.

Verification

This time we will use a 512 dimensional vector. Each component of the vector was randomly given a number greater than or equal to 0 and less than 1. The comparison method is to see the transition of the mean square error for each component of the matrix obtained by the above method and the matrix calculated by numpy once. Click here for the results. MSE_test3.png

The calculation of 500 times is performed in a little over 4 seconds, and it is transitioning to be within about 0.06.

Summary

This time, we derived a recurrence formula of covariance that enables sequential update of covariance and implemented it in Python. I think that the calculation result is written neatly, but if anyone says "I could write it more beautifully!", Please let me know. Regarding the implementation, as I wrote in the first introduction, I started from the place where I received the consultation, and I am implementing it in a form that reflects the content of the consultation I received, so if you want to implement it in another form, the first half I hope you can implement it using the expression part.

From here on, it has nothing to do with the main line, but this formula transformation was quite difficult. If you are interested in formula transformation, it may be a good idea to fill in the holes in the above part. I intended to fill it very carefully, so if you think there is no place to fill it, it may be interesting to calculate it by yourself from the beginning.

Referenced site

Sequential update of variance

Recommended Posts

Sequential update of covariance to derivation and implementation of expressions
Explanation and implementation of SocialFoceModel
Implementation of Bulk Update with mongo-go-driver
Explanation and implementation of PRML Chapter 4
Introduction and Implementation of JoCoR-Loss (CVPR2020)
Explanation and implementation of ESIM algorithm
Introduction and implementation of activation function
Explanation and implementation of simple perceptron
Derivation of multivariate t distribution and implementation of random number generation by python
Implementation of particle filters in Python and application to state space models
Implementation and experiment of convex clustering method
Explanation and implementation of Decomposable Attention algorithm
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
Build a python environment to learn the theory and implementation of deep learning
I tried to notify slack of Redmine update
Comparison of k-means implementation examples of scikit-learn and pyclustering
Script to tweet with multiples of 3 and numbers with 3 !!
Implementation of TRIE tree with Python and LOUDS
List of Python code to move and remember
Explanation of edit distance and implementation in Python
[Introduction to Python] Basic usage of lambda expressions
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to automate the article update of Livedoor blog with Python and selenium.