A member of the same laboratory asked me if I would like to update the variance-covariance matrix sequentially, but is there any good way? I found this article . Here, we are deriving the formula for the sequential update of the mean and variance. Please refer to the relevant site for details on that part. Here, we derive an expression for the covariance that was not mentioned in detail on this site, and calculate the covariance matrix in Python3.
Write down the characters and relational expressions to be used before deriving. It transforms quite a lot, so if you don't understand it while following the formula, please come back here and think about it.
data
Average of data
Covariance
Recurrence formula for mean
Now the main subject! I'll play with the formula from here, but please be patient! Calculate $ M_ {n + 1} -M_n $ with $ M_n = ns_ {xy} ^ n $. If this is obtained, the recurrence formula of the covariance can be easily obtained.
\begin{align}
M_{n+1}-M_n &= \sum_{i=1}^{n+1}\left(x_i\ -\ \overline{x_{n+1}}\ \right)\left(y_i\ -\ \overline{y_{n+1}}\ \right) -\sum_{i=1}^n\left(x_i\ -\ \overline{x_n}\ \right)\left(y_i\ -\ \overline{y_n}\ \right)\\
&= \sum_{i=1}^{n+1}\left(x_iy_i\ -\ x_i\overline{y_{n+1}}\ -\ \overline{x_{n+1}}y_i\ +\ \overline{x_{n+1}}\ \overline{y_{n+1}}\right)\\
&\ -\sum_{i=1}^n\left(x_iy_i\ -\ x_i\overline{y_{n}}\ -\ \overline{x_{n}}y_i\ +\ \overline{x_{n}}\ \overline{y_{n}}\right)\\
&=x_{n+1}y_{n+1}\ +\ (n+1)\overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\overline{x_n}\ \overline{y_n}\\
&\ -\underline{\left(\overline{y_{n+1}}\sum_{i=1}^{n+1}x_i\ +\ \overline{x_{n+1}}\sum_{i=1}^{n+1}y_i\ -\ \overline{y_{n}}\sum_{i=1}^nx_i\ -\ \overline{x_{n}}\sum_{i=1}^ny_i\right)}\cdots\ast
\end{align}
By the way, is it all right so far? Let's calculate here with the underlined part as (1).
For the sake of simplicity, we will introduce the following two characters.
\begin{align}
(1) &=\overline{y_{n+1}}\ A_{n+1}\ +\ \overline{x_{n+1}}\ B_{n+1}\ -\ \overline{y_n}\ A_n\ -\ \overline{x_n}\ B_n\\
&=2\left(\frac{1}{n+1}A_{n+1}B_{n+1}\ -\ \frac{1}{n}A_nB_n\right)\\
&=2\left\{(n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\ \overline{x_n}\ \overline{y_n}\right\}
\end{align}
Yes, it's clean! Let's substitute this for the underlined part.
\begin{align}
\ast&=x_{n+1}y_{n+1}\ +\ (n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\ \overline{x_n}\ \overline{y_n}-2\left\{(n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ n\ \overline{x_n}\ \overline{y_n}\right\}\\
&=x_{n+1}y_{n+1}\ -(n+1)\ \overline{x_{n+1}}\ \overline{y_{n+1}}\ +\ n\ \overline{x_n}\ \overline{y_n}\\
&=x_{n+1}y_{n+1}\ -(n+1)\underline{\left(\overline{x_{n+1}}\ \overline{y_{n+1}}\ -\ \overline{x_n}\ \overline{y_n}\right)}\ -\ \overline{x_n}\ \overline{y_n}\cdots\ast\ast
\end{align}
The second underline. We will calculate here as (2).
\begin{align}
(2)&=\left(\overline{x_{n+1}}\ -\ \overline{x_n}\right)\left(\overline{y_{n+1}}\ -\ \overline{y_n}\right)+\overline{x_{n+1}}\ \overline{y_n}+\overline{x_{n}}\ \overline{y_{n+1}}\ -2\ \overline{x_{n}}\ \overline{y_n}\\
&=\left(\overline{x_{n+1}}\ -\ \overline{x_n}\right)\left(\overline{y_{n+1}}\ -\ \overline{y_n}\right)+\overline{y_n}\left(\overline{x_{n+1}}\ -\overline{x_n}\right)+\overline{x_n}\left(\overline{y_{n+1}}\ -\overline{y_n}\right)\\
&=\frac{x_{n+1}-\overline{x_n}}{n+1}\cdot \frac{y_{n+1}-\overline{y_n}}{n+1}+ \overline{y_n}\ \frac{x_{n+1}-\overline{x_n}}{n+1}+\overline{x_n}\ \frac{y_{n+1}-\overline{y_n}}{n+1}\ (\because\ \star)\\
&=\frac{1}{n+1}\left\{ \frac{1}{n+1}\left(x_{n+1}-\overline{x_n}\right)\left(y_{n+1}-\overline{y_n}\right)+\overline{y_n}\ \left(x_{n+1}-\overline{x_n}\right)+\overline{x_n}\ \left(y_{n+1}-\overline{y_n}\right)\right\}
\end{align}
The goal is just around the corner!
\begin{align}
\ast\ast&=x_{n+1}y_{n+1}\ -\frac{1}{n+1}\left(x_{n+1}-\overline{x_n}\right)\left(y_{n+1}-\overline{y_n}\right)-\overline{y_n}\ \left(x_{n+1}-\overline{x_n}\right)-\overline{x_n}\ \left(y_{n+1}-\overline{y_n}\right)-\overline{x_n}\ \overline{y_n}\\
&=\frac{n}{n+1}\ x_{n+1}y_{n+1}\ -\ \frac{n}{n+1}\ x_{n+1}\overline{y_n}\ -\ \frac{n}{n+1}\ \overline{x_n}\ y_{n+1}\ +\frac{n}{n+1}\ \overline{x_n}\ \overline{y_n}\\
&=\frac{n}{n+1}\ \left(x_{n+1}\ -\ \overline{x_n}\right)\ \left(y_{n+1}\ -\ \overline{y_n}\right)
\end{align}
This completes the main transformation. I will do the final finishing.
M_{n+1}-M_n=\frac{n}{n+1}\left(x_{n+1}\ -\ \overline{x_n}\right)\left(y_{n+1}\ -\ \overline{y_n}\right)\\
\therefore\ M_{n+1} = \frac{n}{n+1}\left(x_{n+1}\ -\ \overline{x_n}\right)\left(y_{n+1}\ -\ \overline{y_n}\right)\ +\ M_n\\
\therefore\ s_{xy}^{n+1}\ =\ \frac{n}{(n+1)^2}\left(x_{n+1}\ -\ \overline{x_n}\right)\left(y_{n+1}\ -\ \overline{y_n}\right)\ +\ \frac{n}{n+1}s_{xy}^n
This completes the recurrence formula.
Now that we have a recurrence formula, let's actually implement it in Python. This time, we will implement it under the motivation that "vectors are given sequentially and we want to find the variance-covariance matrix of the vector set".
import numpy as np
def calc(next_val,times,var_cov_mat=None):
'''
Do the calculation
'''
if times > 1:
#Update the covariance matrix
var_cov_mat = np.outer(next_val,n_mean)*(times/(times+1)**2) + var_cov_mat*times/(times+1)
else:
#Define initial state, variance-covariance matrix = identity matrix
var_cov_mat = np.identity(len(next_val))
return next_mean, var_cov_mat
This completes. After that, you can get the variance-covariance matrix sequentially by inserting more and more data.
This time we will use a 512 dimensional vector. Each component of the vector was randomly given a number greater than or equal to 0 and less than 1. The comparison method is to see the transition of the mean square error for each component of the matrix obtained by the above method and the matrix calculated by numpy once. Click here for the results.
The calculation of 500 times is performed in a little over 4 seconds, and it is transitioning to be within about 0.06.
This time, we derived a recurrence formula of covariance that enables sequential update of covariance and implemented it in Python. I think that the calculation result is written neatly, but if anyone says "I could write it more beautifully!", Please let me know. Regarding the implementation, as I wrote in the first introduction, I started from the place where I received the consultation, and I am implementing it in a form that reflects the content of the consultation I received, so if you want to implement it in another form, the first half I hope you can implement it using the expression part.
From here on, it has nothing to do with the main line, but this formula transformation was quite difficult. If you are interested in formula transformation, it may be a good idea to fill in the holes in the above part. I intended to fill it very carefully, so if you think there is no place to fill it, it may be interesting to calculate it by yourself from the beginning.
Recommended Posts