I implemented Donald Knuth's unbiased sequential calculation algorithm in Python

This article introduces Donald Knuth's algorithm, which is a good variance calculation algorithm. I usually rely on the library to calculate the variance, and I haven't considered the algorithm in particular, so I hope I can take this opportunity to deepen my knowledge.

Problems of calculation by the unbiased variance formula

First, I'll explain why you need such an algorithm. The unbiased variance is expressed by the following formula.

\sigma^2 = \frac{1}{n(n-1)} \biggl(\sum_{i=1}^{n}x_i^2-\Bigl(\sum_{i=1}^{n}x_i\Bigr)^2\biggr) =\frac{1}{n-1}\Bigl(E[X^2]-E[X]^2\Bigr)

Now, here are two cumulative sums. The cumulative sum of 1. $ x $ itself and the cumulative sum of 2. $ x ^ 2 $. There are two main problems at this time. First, the accuracy of numerical calculations is not guaranteed when $ x $ is a very large number. Second, you have to do large calculations again when you have more samples. To solve these challenges, let's look at Donald Knuth's algorithm in the next section.

Donald Knuth's sequential calculation algorithm

To solve the above-mentioned problems, this algorithm was designed to calculate the variance sequentially. The algorithm is shown below. $ M_k $ represents the mean and $ s_k $ is the numerator of the variance.

Initialization: $ M_1 = x_1, S_1 = 0 $ Calculate the following recurrence formula in $ k \ geq 2 $

\begin{align}
M_k &= M_{k-1} + \frac{(x_k - M_{k-1})}{k}\\
S_k &= S_{k-1} + (x_k - M_{k-1})*(x_k - M_k)
\end{align}

The estimator of the unbiased variance to be calculated is $ s ^ 2 = S_k / (k-1) $ in $ k \ geq 2 $.

Implementation in Python

Let's implement it in Python to make sure that the above algorithm actually works. Since it is a recurrence formula, we will implement it with a recursive function. I made it possible to check each print statement so that you can see how it is being calculated sequentially.

def calc_var(x, k=0, M=0, s=0):
    print('k=', k)
    print('M=', M)
    print('s=', s)
    print('-----------------------')
    if k == 0:
        M = x[0]
        s = 0
    delta = x[k] - M
    M += delta/(k+1)
    s += delta*(x[k]-M)
    k += 1
    if k == len(x):
        return M, s/(k-1)
    else:
        return calc_var(x, k=k, M=M, s=s)

x = [3.3, 5, 7.2, 12, 4, 6, 10.3]
print(calc_var(x))

Execution result

k= 0
M= 0
s= 0
-----------------------
k= 1
M= 3.3
s= 0.0
-----------------------
k= 2
M= 4.15
s= 1.4449999999999996
-----------------------
k= 3
M= 5.166666666666667
s= 7.646666666666666
-----------------------
k= 4
M= 6.875
s= 42.6675
-----------------------
k= 5
M= 6.3
s= 49.279999999999994
-----------------------
k= 6
M= 6.25
s= 49.355
-----------------------
(6.828571428571428, 10.56904761904762)

Let's actually calculate and compare the mean and variance with numpy. The keyword argument ddof = 1 is for calculating as unbiased variance.

import numpy as np
x_arr = np.array(x)
print((x_arr.mean(), x_arr.var(ddof=1)))

Output result

(6.828571428571428, 10.569047619047621)

I was able to calculate sequentially like this. Actually, I don't think that python recursion is faster than numpy, so I don't think I have a chance to implement it myself. I received it. I hope it will be helpful to everyone.

Reference link

Accurately computing running variance https://www.johndcook.com/blog/standard_deviation/

Recommended Posts

I implemented Donald Knuth's unbiased sequential calculation algorithm in Python
I implemented Cousera's logistic regression in Python
I implemented Robinson's Bayesian Spam Filter in python
I implemented the inverse gamma function in python
Implemented SimRank in Python
Genetic algorithm in python
Relearn Python (Algorithm I)
Date calculation in python
Date calculation in Python
Implemented Shiritori in Python
Algorithm in Python (Dijkstra's algorithm)
Implemented in Python PRML Chapter 4 Classification by Perceptron Algorithm
I tried to implement GA (genetic algorithm) in Python
Implemented the algorithm of "Algorithm Picture Book" in Python3 (Heapsort)
I implemented a Vim-like replacement command in Slackbot #Python
I wrote python in Japanese
Algorithm in Python (primality test)
Shapley value calculation in Python
Implemented the algorithm of "Algorithm Picture Book" in Python3 (Bubble Sort)
Reproduce Euclidean algorithm in Python
Algorithm in Python (binary search)
Implement Dijkstra's Algorithm in python
I understand Python in Japanese!
What I learned in Python
Implemented the algorithm of "Algorithm Picture Book" in Python3 (selection sort)
6 Ball puzzle implemented in python
I compared the calculation time of the moving average written in Python
Algorithm in Python (breadth-first search, bfs)
Implemented image segmentation in python (Union-Find)
Quantum chemistry calculation in Python (Psi4)
Accelerometer Alan Variance Calculation in Python
[Python] I implemented peripheral Gibbs sampling
Develop an investment algorithm in Python 2
Widrow-Hoff learning rules implemented in Python
I wrote Fizz Buzz in Python
Implemented label propagation method in Python
I learned about processes in Python
I can't install scikit-learn in Python
I wrote the queue in Python
Implementing a simple algorithm in Python 2
Implemented Perceptron learning rules in Python
Run a simple algorithm in Python
I tried Line notification in Python
Implemented in 1 minute! LINE Notify in Python
I wrote the stack in Python
I put Python 2.7 in Sakura VPS 1GB.
I tried to implement PLSA in Python
A simple HTTP client implemented in Python
I made a payroll program in Python!
Ant book in python: Sec.2-5 Dijkstra's algorithm
Algorithm in Python (ABC 146 C Binary Search
Implemented in Python PRML Chapter 7 Nonlinear SVM
I tried to implement PLSA in Python 2
I tried to implement ADALINE in Python
Write a simple greedy algorithm in Python
I wanted to solve ABC159 in Python
I tried to implement PPO in Python
Implemented in Python PRML Chapter 5 Neural Networks
I searched for prime numbers in python
Implemented Stooge sort in Python3 (Bubble sort & Quicksort)
Let's make a combination calculation in Python