[PYTHON] Steps to calculate the likelihood of a normal distribution

I have stumbled upon the "likelihood" of elementary statistics. Let's observe how the (simultaneous) probability of the normal distribution changes while moving the command.

normal distribution

As you are familiar with every time, the normal distribution is expressed as follows. For the sake of clarity, the left side is expressed as $ P (x) $.

P(x)={1 \over \sqrt{2\pi\sigma^{2}}} \exp \left(-{1 \over 2}{(x-\mu)^2 \over \sigma^2} \right)
Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy.random as rd
import matplotlib.gridspec as gridspec
%matplotlib inline
plt.rcParams['font.size']=15

def plt_legend_out():
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
code
m = 10
s = 3

min_x = m-4*s
max_x = m+4*s

x = np.linspace(min_x, max_x, 201)
y = (1/np.sqrt(2*np.pi*s**2))*np.exp(-0.5*(x-m)**2/s**2)

plt.xlim(min_x, max_x)
plt.ylim(0,max(y)*1.1)
plt.plot(x,y)
plt.show()

image.png

Let's randomly extract 10 pieces of data from the normal distribution.

image.png

code
plt.figure(figsize=(8,1))
rd.seed(7)
data = rd.normal(10, 3, 10, )
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.tick_params(left=False,labelleft=False)
plt.axhline(y=0,color='gray',lw=0.5)
plt.show()

Simultaneous probability

The probability that the above data will occur at the same time is as follows.

\begin{eqnarray}
\prod_{i=1}^NP(x) &=& P(x_1, x_2,\cdots,x_{10})\\
&=& P(x_1)P(x_2)\cdots P(x_{10})
\end{eqnarray}

However, with the above expression, there are problems when calculating. The product of the probabilities will be a fairly small value. In the calculation after the decimal point, zero increases like 0.01 x 0.01 = 0.001. If this continues 10 times or 100 times, there will be many zeros, which is confusing. So, let's take $ log $.

\begin{eqnarray}
\prod_{i=1}^{10}\log{P(x_i)} &=& \log{P(x_1,x_2,\cdots,x_{10})}\\
&=& \log{(P(x_1)×P(x_2)\cdots×P(x_{10}))}\\
&=& \log{P(x_1)}+\log{P(x_2)}+\cdots+\log{P(x_{10})}\\
&=& \color{red}{\sum_{i=1}^{10}\log{P(x_i)}}
\end{eqnarray}

By taking $ log $, I was able to replace it with the problem of addition. $ P (x) $ in this case is a normal distribution. Therefore, the simultaneous probability of the above data is as follows.

\begin{eqnarray}
\color{red}{\sum_{i=1}^{10}\log{P(x_i)}} &=& \sum_{i=1}^N{1 \over \sqrt{2\pi\sigma^{2}}} \exp \left(-{1 \over 2}{(x_i-\mu)^2 \over \sigma^2} \right)
\end{eqnarray}

Implementation

Now, let's fix $ \ sigma = 3 $ and consider a normal distribution of $ \ mu = 5, 10, 15 $. Let's plot the data extracted earlier as well. Visually, the conditions $ \ mu = 10 $ and $ \ sigma = 3 $ seem to fit well.

image.png

Functions
def norm_dens(x,m,s):
    return (1/np.sqrt(2*np.pi*s**2))*np.exp(-0.5*(x-m)**2/s**2)

def log_likelihood(x,m,s):
    L = np.prod([norm_dens(x_i,m,s) for x_i in x])
    l = np.log(L)
    return l
code
logp_ymin = -10 ;logp_ymax = 0
d_ymin = -0.01 ; d_ymax = 0.2

plt.figure(figsize=(10,2))
plt.subplots_adjust(hspace=0.1, wspace=0.1)

x = np.linspace(0, 20, 100)

###########

plt.subplot(131)
m=5 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.ylabel('density')
plt.axhline(y=0,color='gray',lw=0.5)
plt.ylim(d_ymin,d_ymax)
plt.xlim(0,20)
plt.tick_params(bottom=False,labelbottom=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

###########

plt.subplot(132)
m=10 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.axhline(y=0,color='gray',lw=0.5)
plt.xlim(0,20)
plt.ylim(d_ymin,d_ymax)
plt.tick_params(bottom=False,labelbottom=False,left=False,labelleft=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

###########

plt.subplot(133)
m=15 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.axhline(y=0,color='gray',lw=0.5)
plt.xlim(0,20)
plt.ylim(d_ymin,d_ymax)
plt.tick_params(bottom=False,labelbottom=False,left=False,labelleft=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

plt.show()

Now, let's plot $ \ log {P (x)} $ as well. Under the conditions of $ \ mu = 10 $ and $ \ sigma = 3 $, $ \ sum {\ log {P (x)}} $ was the largest.

image.png

code
logp_ymin = -10 ;logp_ymax = 0
d_ymin = -0.01 ; d_ymax = 0.2

plt.figure(figsize=(10,2))
plt.subplots_adjust(hspace=0.1, wspace=0.1)

x = np.linspace(0, 20, 100)

###########

plt.subplot(131)
m=5 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.ylabel('density')
plt.axhline(y=0,color='gray',lw=0.5)
plt.ylim(d_ymin,d_ymax)
plt.xlim(0,20)
plt.tick_params(bottom=False,labelbottom=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

###########

plt.subplot(132)
m=10 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.axhline(y=0,color='gray',lw=0.5)
plt.xlim(0,20)
plt.ylim(d_ymin,d_ymax)
plt.tick_params(bottom=False,labelbottom=False,left=False,labelleft=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

###########

plt.subplot(133)
m=15 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.axhline(y=0,color='gray',lw=0.5)
plt.xlim(0,20)
plt.ylim(d_ymin,d_ymax)
plt.tick_params(bottom=False,labelbottom=False,left=False,labelleft=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

plt.show()

Next, let's fix $ \ mu = 10 $ and see the normal distribution of $ \ sigma = 2,3,5 $. Although the big difference is not seen, $ \ sum {\ log {P (x)}} $ is the largest under the conditions of $ \ mu = 10 $ and $ \ sigma = 3 $. The distribution of $ \ sigma = 5 $ seems too wide in light of the data. In other words, it can be seen that the data and distribution fit well under the condition that $ \ sum {\ log {P (x)}} $ is large (simultaneous probability is large).

image.png

code
logp_ymin = -6 ;logp_ymax = 0
d_ymin = -0.01 ; d_ymax = 0.25

plt.figure(figsize=(10,4))
plt.subplots_adjust(hspace=0.1, wspace=0.1)

x = np.linspace(0, 20, 100)

###########

plt.subplot(231)
m=10 ; s=2
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.ylabel('density')
plt.axhline(y=0,color='gray',lw=0.5)
plt.ylim(d_ymin,d_ymax)
plt.xlim(0,20)
plt.tick_params(bottom=False,labelbottom=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

plt.subplot(234)
xl = [np.log(norm_dens(x,m,s)) for x in data]
plt.scatter(data,xl,color='red')
plt.xlim(0,20)
plt.ylabel('log p(x)')
plt.ylim(logp_ymin,logp_ymax)
plt.xlabel('x')
for i in range(len(data)):plt.plot([data[i],data[i]],[0,xl[i]],color='gray',lw=0.5,ls='dashed')
plt.text(3,-10,'$\sum{\log{p(x)}}$='+str(np.round(np.sum(xl),1)))

###########

plt.subplot(232)
m=10 ; s=3
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.axhline(y=0,color='gray',lw=0.5)
plt.xlim(0,20)
plt.ylim(d_ymin,d_ymax)
plt.tick_params(bottom=False,labelbottom=False,left=False,labelleft=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

plt.subplot(235)
xl = [np.log(norm_dens(x,m,s)) for x in data]
plt.scatter(data,xl,color='red')
plt.xlim(0,20)
plt.ylim(logp_ymin,logp_ymax)
plt.tick_params(left=False,labelleft=False)
plt.xlabel('x')
for i in range(len(data)):plt.plot([data[i],data[i]],[0,xl[i]],color='gray',lw=0.5,ls='dashed')
plt.text(3,-10,'$\sum{\log{p(x)}}$='+str(np.round(np.sum(xl),1)))

###########

plt.subplot(233)
m=10 ; s=5
y = norm_dens(x,m,s)
plt.plot(x,y,label='$\mu=$'+str(m)+'$, \sigma=$'+str(s))
plt.scatter(data, np.zeros_like(data), c="r", s=50)
plt.axhline(y=0,color='gray',lw=0.5)
plt.xlim(0,20)
plt.ylim(d_ymin,d_ymax)
plt.tick_params(bottom=False,labelbottom=False,left=False,labelleft=False)
plt.title('$\mu=$'+str(m)+', $\sigma=$'+str(s))

plt.subplot(236)
xl = [np.log(norm_dens(x,m,s)) for x in data]
plt.scatter(data,xl,color='red')
plt.xlim(0,20)
plt.ylim(logp_ymin,logp_ymax)
for i in range(len(data)):plt.plot([data[i],data[i]],[0,xl[i]],color='gray',lw=0.5,ls='dashed')
plt.tick_params(left=False,labelleft=False)
plt.xlabel('x')
plt.text(3,-10,'$\sum{\log{p(x)}}$='+str(np.round(np.sum(xl),1)))
plt.show()

Finally, let's explore $ \ mu $ and $ \ sigma $ in detail. Searching is just a comprehensive search. As a result, we were able to estimate a value close to the true value.

image.png

code
mus = np.linspace(8, 12, 50)
ss  = np.linspace(2, 4, 50)
lmu = [] ; ls = [] ; lll = []

for mu in mus:
    for s in ss:
        lmu.append(mu)
        ls.append(s)
        lll.append(log_likelihood(data,mu,s))

plt.scatter(lmu,ls,c=lll,alpha=0.8)
plt.xlabel('$\mu$')
plt.ylabel('$\sigma$')
plt.colorbar()
plt.scatter(10,3,color='r')
plt.text(10.1,3.1,'true',color='r')

pmu,ps,pll = pd.DataFrame([lmu,ls,lll]).T.sort_values(2,ascending=False).reset_index(drop=True).loc[0,:].to_numpy()
plt.scatter(pmu,ps,color='b')
plt.text(pmu+0.1,ps+0.1,'predicted',color='b')

plt.title('Simultaneous probability')
plt.show()

From simultaneous probability to likelihood

This time I knew that "the data came from the normal distribution". But in reality, we will assume the probability distribution behind the data. In that case, the parameter of the distribution is estimated by replacing the expression "likelihood" from "simultaneous probability". What you are doing is the same.

Reference URL

Recommended Posts

Steps to calculate the likelihood of a normal distribution
How to calculate the volatility of a brand
[Python] Note: A self-made function that finds the area of the normal distribution
Explain the nature of the multivariate normal distribution graphically
Defeat the probability density function of the normal distribution
Calculate the probability of outliers on a boxplot
Verification of normal distribution
[Numpy, scipy] How to calculate the square root of a semi-fixed definite matrix
A memo to visually understand the axis of pandas.Panel
Calculate the product of matrices with a character expression?
Python Note: The mystery of assigning a variable to a variable
Calculate the number of changes
[Ubuntu] How to delete the entire contents of a directory
Understanding the meaning of complex and bizarre normal distribution formulas
How to calculate the amount of calculation learned from ABC134-D
How to find the scaling factor of a biorthogonal wavelet
Try to model a multimodal distribution using the EM algorithm
Is there a secret to the frequency of pi numbers?
How to connect the contents of a list into a string
How to find the average amount of information (entropy) of the original probability distribution from a sample
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
Generate a normal distribution with SciPy
Steps to create a Django project
A story that struggled to handle the Python package of PocketSphinx
Conditional branch due to the existence of a shell script file
How to check the memory size of a variable in Python
Test the goodness of fit of the distribution
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
[Go] Create a CLI command to change the extension of the image
How to check the memory size of a dictionary in Python
How to output the output result of the Linux man command to a file
[Python3] Define a decorator to measure the execution time of a function
How to get the vertex coordinates of a feature in ArcPy
I want to create a histogram and overlay the normal distribution curve on it. matplotlib edition
Write a script to calculate the distance with Elasticsearch 5 system painless
A command to easily check the speed of the network on the console
Create a function to get the contents of the database in Go
Carefully derive the interquartile range of the standard normal distribution from the beginning
[python] A note that started to understand the behavior of matplotlib.pyplot
Supplement to the explanation of vscode
The story of writing a program
[NNabla] How to remove the middle tier of a pre-built network
[Python] A simple function to find the center coordinates of a circle
[Python] A program that rotates the contents of the list to the left
I made a calendar that automatically updates the distribution schedule of Vtuber
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
I tried to visualize the age group and rate distribution of Atcoder
[Python] A program that calculates the number of socks to be paired
Try to estimate the parameters of the gamma distribution while simply implementing MCMC
Various methods to numerically create the inverse function of a certain function Introduction
[Introduction to Python] How to sort the contents of a list efficiently with list sort
[Linux] Command to get a list of commands executed in the past
Semi-automatically generate a description of the package to be registered on PyPI
[NNabla] How to add a quantization layer to the middle layer of a trained model
How to put a line number at the beginning of a CSV file
Calculate the probability of being a squid coin with Bayes' theorem [python]
How to create a wrapper that preserves the signature of the function to wrap
Allow Slack to notify you of the end of a time-consuming program process
Python code to determine the monthly signal of a relative strength investment
I made a program to check the size of a file in Python
I tried to display the altitude value of DTM in a graph