[PYTHON] How to find the average amount of information (entropy) of the original probability distribution from a sample

** Notes on this article ** It is an amateur graffiti. I think there are many mistakes, improper terms, and no proof. Thank you for your understanding.

Purpose of this article

I want to find the average amount of information of the original probability distribution from the sample generated by the continuous probability distribution.

theory

The average information amount $ , h (X) , $ of a continuous probability distribution with a probability density function $ , , f , $ is expressed by Equation 1, but the number of samples $ , N , $ is If it is large enough, it can be obtained from the amount of information of each sample $ , , x_i , $ as shown in Equation 2 (should).

\begin{align}
&h(X) = \int_{\chi}^{}f(x)\,log\,f(x)\,dx\qquad\qquad ・ ・ ・ Equation 1\\
&h(X) \approx \frac{1}{N}\sum_{i=1}^{N}\,-logP(\,X = x_i\,)\,\qquad ・ ・ ・ Equation 2
\end{align}

In order to find $ , h (X) $ from Equation 2, it is necessary to find $ , P (, X = x_i ,) , $ for each sample $ , , x_i , $. .. From here, I will explain a little about how to find $ P (, X = x_i ,) , $ and $ , h (X) , $.

First, we define some quantities. -Set the distance between each sample as $ , d (x_i, , x_j) , $ $ ^ * $. -For each sample $ , , x_i , $, the total number of samples whose distance to $ x_i , $ is less than or equal to $ , , d , $ is $ , n_i. Set , $ (including yourself). ・ The volume of the area where the distance from a certain $ , , x , $ is less than $ , r , $ is $ , V (r) , $. .. By determining an appropriate $ , r , $ according to this definition, $ , P (, X = x_i ,) , $ can be approximately obtained from Equation 3.

P(\,X=x_i\,)\, \approx \frac{n_i}{NV(r)}\qquad ・ ・ ・ Equation 3

By substituting Equation 3 into Equation 2, Equation 4 is obtained.

\begin{align}
h(X,r) &\approx \frac{1}{N}\sum_{i=1}^{N}\,-log\frac{n_i}{NV(r)}\\
&= \,logV(r) + logN - \frac{1}{N}\sum_{i=1}^{N}\,log\,n_i\qquad ・ ・ ・ Equation 4
\end{align}

In order to establish the approximation of Equation 3, it is preferable that $ r , $ take as small a value as possible. However, as long as the number of samples is finite, if $ r , $ is extremely small, the law of large numbers cannot be satisfied, and the approximation of Equation 3 breaks down. Therefore, it is necessary to think about how to properly determine $ r , $ while looking at the actual data.     $ ^ * \ lim_ {d (x_i, , x_j) \ to 0} P (, X = x_i ,) = lim_ {d (x_i, , x_j) \ to 0} P (, X = x_j) Anything should be fine as long as it meets ,) , , $

application

Generate a sample from an appropriate two-dimensional Gaussian distribution and find $ h (X) , $.

import numpy as np
from matplotlib import pyplot as plt


def calc_d(x):
    N = len(x)
    x_tiled = np.tile(x, (N, 1, 1))
    d = np.linalg.norm(x_tiled - x_tiled.transpose((1, 0, 2)), axis=2)
    return d


#Apply the area of a circle formula because the number of dimensions is 2.
def calc_v(r):
    v = np.pi * np.power(r, 2)
    return v


def calc_h(d, v, N, r):
    n = np.sum(d <= r, axis=0)
    h = np.log(v) + np.log(N) - np.sum(np.log(n)) / N
    return h


#Generate data from a suitable 2D Gaussian distribution
data = np.random.normal(0, 1, (1000, 2))
#h while changing r(X)To calculate
r_list = [(i + 1) * 0.01 for i in range(10000)]  #The range of r was decided appropriately
d = calc_d(data)
N = len(data)
h_list = [calc_h(d, calc_v(r), N, r) for r in r_list]
#Draw a graph
#Plot the calculated value with a solid blue line
plt.figure(0)
plt.plot(r_list, h_list, color='blue', linestyle='solid')
#Plot the value calculated from the sample variance with a blue dotted line
Z = np.cov(data[:, 0], data[:, 1])
h_s = 0.5 * np.log(np.linalg.det(2 * np.pi * np.e * Z))
plt.plot(r_list, [h_s for _ in range(len(r_list))], color='blue', linestyle='dotted')
#Plot the value calculated from the population variance with an orange dotted line
h_u = np.log(2 * np.pi * np.e)
plt.plot(r_list, [h_u for _ in range(len(r_list))], color='orange', linestyle='dotted')
plt.xlim([0, 3])
plt.ylim([0, 5])
plt.show()

When executed, such a graph is obtained.

Figure_0.png

The horizontal axis represents $ , r , $, and the vertical axis represents $ , h (X, r) , $. As explained in theory, the smaller $ r , $, the closer $ , h (X, r) , $ is to the true value, but if it is too small, it will in turn diverge toward negative infinity. I will go. Looking at the graph, it seems that it is somehow good to decide $ , r , $ so that the slope is the smallest. In fact, considering that $ , \ frac {\ partial} {\ partial r} h (X, r) = 0 , $ holds when the approximation of Equation 3 holds, this method of determination is not so strange. I think, but there is no proof, so overconfidence is prohibited.

Recommended Posts

How to find the average amount of information (entropy) of the original probability distribution from a sample
How to calculate the amount of calculation learned from ABC134-D
How to find the scaling factor of a biorthogonal wavelet
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
How to find the memory address of a Pandas dataframe value
How to calculate the volatility of a brand
How to find the area of the Voronoi diagram
[Circuit x Python] How to find the transfer function of a circuit using Lcapy
Steps to calculate the likelihood of a normal distribution
How to post a ticket from the Shogun API
I tried to find the entropy of the image with python
[Ubuntu] How to delete the entire contents of a directory
How to find the optimal number of clusters in k-means
Inherit the standard library to find the average value of Queue
How to quickly count the frequency of appearance of characters from a character string in Python?
How to sample from any probability density function in Python
How to get a list of links from a page from wikipedia
How to connect the contents of a list into a string
How to take a screenshot of the Chrome screen (prevent it from cutting off in the middle)
How to determine the existence of a selenium element in Python
How to check the memory size of a variable in Python
How to check the memory size of a dictionary in Python
How to output the output result of the Linux man command to a file
How to get the vertex coordinates of a feature in ArcPy
How to extract the desired character string from a line 4 commands
How to create a large amount of test data in MySQL? ??
Find all patterns to extract a specific number from the set
[NNabla] How to remove the middle tier of a pre-built network
[Python] A simple function to find the center coordinates of a circle
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
How to find out the number of CPUs without using the sar command
[Introduction to Python] How to sort the contents of a list efficiently with list sort
[NNabla] How to add a quantization layer to the middle layer of a trained model
How to put a line number at the beginning of a CSV file
How to get a sample report from a hash value using VirusTotal's API
How to create a wrapper that preserves the signature of the function to wrap
Get the average salary of a job with specified conditions from indeed.com
How to play a video while watching the number of frames (Mac)
How to create a clone from Github
How to check the version of Django
How to save the feature point information of an image in a file and use it for matching
How to operate Linux from the console
How to create a repository from media
How to access the Datastore from the outside
Try to find the probability that it is a multiple of 3 and not a multiple of 5 when one is removed from a card with natural numbers 1 to 100 using Ruby and Python.
How to pass the execution result of a shell command in a list in Python
How to mention a user group in slack notification, how to check the id of the user group
A programming beginner tried to find out the execution time of sorting etc.
Find out how to divide a file with a certain number of lines evenly
[NNabla] How to get the output (variable) of the middle layer of a pre-built network
How to count the number of elements in Django and output to a template
How to access the contents of a Linux disk on a Mac (but read-only)
A memorandum of how to execute the! Sudo magic command in Jupyter Notebook
SSH login to the target server from Windows with a click of a shortcut
[Numpy, scipy] How to calculate the square root of a semi-fixed definite matrix
How to find the coefficient of the trendline that passes through the vertices in Python
Find the ideal property by scraping! A few minutes walk from the property to the destination
How to make a Raspberry Pi that speaks the tweets of the specified user
How to get a list of files in the same directory with python
I tried to create a model with the sample of Amazon SageMaker Autopilot
[Introduction to Python] How to get the index of data with a for statement