[PYTHON] Calculation of mutual information (continuous value) with numpy

motivation

I want to calculate the mutual information $ I (X; Y) $ of continuous random variables $ X $ and $ Y $ in Python. $ I(X;Y) = \int_Y \int_X p(x, y) \log \frac{p(x,y)}{p(x)p(y)} dx dy $

code

import numpy

def mutual_information(X, Y, bins=10):
    #Joint probability distribution p(x,y)Calculation
    p_xy, xedges, yedges = np.histogram2d(X, Y, bins=bins, density=True)
    
    # p(x)p(y)Calculation
    p_x, _ = np.histogram(X, bins=xedges, density=True)
    p_y, _ = np.histogram(Y, bins=yedges, density=True)
    p_x_y = p_x[:, np.newaxis] * p_y
    
    #dx and dy
    dx = xedges[1] - xedges[0]
    dy = yedges[1] - yedges[0]
    
    #Integral element
    elem = p_xy * np.ma.log(p_xy / p_x_y)
    #Mutual information and p(x, y), p(x)p(y)Output
    return np.sum(elem * dx * dy), p_xy, p_x_y

point

If you want to calculate the mutual information for the time being, you can use the above function. Incidentally, I will leave some important points for implementation.

Density of np.histogram2d

When I vaguely thought that the probability would be returned if density = True was set,np.sum (p_xy)did not become 1 and I was a little impatient. The point to note is that p_xy is ** probability density **, not probability.

Since $ X $ and $ Y $ are continuous variables, the histogram approximates the probability density. Therefore, if you add them together considering the width of the bin, it will be 1.

np.histogram and np.histogram2d return the probability density and bins (edges in the code). It is necessary to calculate dx and dy from this bin.

import numpy as np

N = 1000
X = np.random.normal(loc=0, scale=1, size=N)

p_x, edges = np.histogram(X, bins=10, density=True)

#If you take the sum of the probability densities without thinking, it will not be 1 as a matter of course.
print(np.sum(p_x))  #Output example: 1.580769264599771

#If you take the sum in consideration of the bin width, it becomes 1.
dx = edges[1] - edges[0]
print(np.sum(p_x * dx))  #Output example: 1.0000000000000002

Calculation of p_x_y

P_x_y in the code is trying to calculate $ p (x) p (y) $. Actually, I calculated it with the following code first, and it didn't work.

p_x_y = p_x * p_y

Correctly

p_x_y = p_x[:, np.newaxis] * p_y

is. In the former, p_x_y is the primary array, and in the latter, p_x_y is the secondary array.

Execution example

Execution example 1 (two sine waves)

Since they are not independent, there is a difference between $ p (x, y) $ and $ p (x) p (y) $, and the mutual information increases.

import matplotlib.pyplot as plt

#sine wave and cos wave
t = np.linspace(-5, 5, num=1000)
X = np.sin(2 * np.pi * t)
Y = np.cos(3 * np.pi * t)

#Calculation of mutual information
mi, p_xy, p_x_y = mutual_information(X, Y, bins=30)

#Result output
plt.figure(dpi=100)
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
ax1.set_title(r'$P_{XY}(x, y)$')
ax1.imshow(p_xy)
ax2.set_title(r'$P_{X}(x) P_{Y}(y)$')
ax2.imshow(p_x_y)
plt.suptitle('MI = {}'.format(mi))
plt.show()

image.png

Execution example 2 (independent normal distribution)

When the two variables are independent, $ p (x, y) $ and $ p (x) p (y) $ match, and the mutual information becomes small.

import matplotlib.pyplot as plt
#Two independent normal distributions
N = 10000
X = np.random.normal(size=N)
Y = np.random.normal(size=N)

#Calculation of mutual information
mi, p_xy, p_x_y = mutual_information(X, Y, bins=30)
Execution example
#Result output
plt.figure(dpi=100)
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
ax1.set_title(r'$P_{XY}(x, y)$')
ax1.imshow(p_xy)
ax2.set_title(r'$P_{X}(x) P_{Y}(y)$')
ax2.imshow(p_x_y)
plt.suptitle('MI = {}'.format(mi))
plt.show()

image.png

Recommended Posts

Calculation of mutual information (continuous value) with numpy
1. Statistics learned with Python 1-2. Calculation of various statistics (Numpy)
Sequential calculation of mean value with online algorithm
Error-free calculation with big.Float of golang
Play with numerical calculation of magnetohydrodynamics
Numerical calculation of differential equations with TensorFlow 2.0
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
Real-time calculation of mean values with coroutines
Calculation of probability / expected value of FGO star concentration
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Numpy leave? !! Partial differential of matrix with Sympy
Calculation speed of indexing for numpy quadratic array
Performs high-speed calculation of only specific descriptors with mordred
Add information to the bottom of the figure with Matplotlib
Take the value of SwitchBot thermo-hygrometer with Raspberry Pi
Log the value of SwitchBot thermo-hygrometer with Raspberry Pi
Extract the band information of raster data with python