[PYTHON] (Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.

Introduction

I'm currently studying Bayesian inference. This time, I would like to note my understanding of the ** EM algorithm in a mixed Gaussian distribution **.

As I continue to study, I feel that ** simple examples are fine, so understanding formulas and words will progress very quickly by thinking while illustrating and embodying them **. Therefore, I would like to make the article easy to understand with implementation as much as possible.

I have referred to this article a lot this time. It is very easy to understand from the concept to the formula transformation and implementation.

Thorough explanation of EM algorithm https://qiita.com/kenmatsu4/items/59ea3e5dfa3d4c161efb

What is the EM algorithm?

The EM algorithm (Expectation Maximazation algorithm) is an algorithm ** used for learning and optimizing models that include hidden variables **.

Mixing coefficient

As an explanation of terms, first deepen your understanding of the mixing coefficient.

Consider the probability model $ p (x) $ of $ x $ when the following two-dimensional observation $ x $ is obtained. At this time, it seems that it is generated from two clusters A and B, so consider a model that reflects this.

image.png

Assuming that it is determined by the Gaussian distribution, it can be expressed as follows.

\begin{align}
p(x) &= \pi_A\mathcal N(x|\mu_A, \Sigma_A) +\pi_B\mathcal N(x|\mu_B, \Sigma_B)\\

\end{align}

However,

will do. The generalization is as follows.

\begin{align}
p(x) &= \sum_{k=1}^{K} \pi_k\mathcal N(x|\mu_k, \Sigma_k) \hspace{1cm}(Equation 1)

\end{align}

This $ π_k $ is called ** mixing coefficient ** and satisfies the following.

\sum_{k=1}^{K} π_k =1\\

0 \leqq π_k \leqq 1

However, $ K $: Number of clusters. In other words, the mixing coefficient is a numerical value that represents ** the weight in each cluster (= which cluster has the highest probability of existence) **.

Burden rate

Next, consider the term burden rate. Let $ π_k $ = $ p (k) $ be the prior probability of selecting the $ k $ th cluster \mathcal N(x|\mu_k, \Sigma_k)=p(x|k)TokWhen givenxGiven the conditional probability ofxPeripheral density of

p(x) = \sum_{k=1}^{K} p(k)p(x|k)\hspace{1cm}(Equation 2)\\

It can be expressed as. This is equivalent to the formula $ 1 $ above. By the way, $ p (k | x) $ at this time is called the burden rate. This burden rate is also expressed as $ γ_k (x) $, and using Bayes' theorem,

\begin{align}
γ_k(x) &\equiv p(k|x)\\
&=\frac {p(k)p(x|k)}{\sum_lp(l)p(x|l)}\\
&=\frac {π_k\mathcal N(x|\mu_k, \Sigma_k)}{\sum_lπ_l\mathcal N(x|\mu_l, \Sigma_l)}

\end{align}

Can be expressed as. What does this burden rate mean? I would like to confirm it while implementing it.

Implement and check

EM.ipynb


import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats as st
# ======================================
# Parameters
K = 4
n = 300
xx = np.linspace(-4, 10, n)

mu = [-1.2, 0.4, 2, 4]
sigma = [1.1, 0.7, 1.5, 2.0]
pi = [0.2, 0.3, 0.3, 0.2]

# Density function
pdfs = np.zeros((n, K))
for k in range(K):
    pdfs[:, k] = pi[k]*st.norm.pdf(xx, loc=mu[k], scale=sigma[k])

# =======================================
# Visualization
plt.figure(figsize=(14, 6))
for k in range(K):
    plt.plot(xx, pdfs[:, k])
plt.title("pdfs")
plt.show()

plt.figure(figsize=(14, 6))
plt.stackplot(xx, pdfs[:, 0], pdfs[:, 1], pdfs[:, 2], pdfs[:, 3])
plt.title("stacked")
plt.show()

image.png

As you can see in the above figure, the burden rate is the middle value of the density function of the mixed Gaussian distribution given a certain point $ x $, and is the ratio of $ k $ to each cluster.

Detour about Gaussian distribution (3 types of covariance matrix)

Now, in learning the Gaussian distribution, the covariance matrix $ Σ $ may be calculated as $ Σ = σ ^ 2I $. To think about what this means, let's summarize the three types of covariance matrices.

General symmetric covariance matrix

Consider a Gaussian distribution of the $ D $ dimension. At this time, the covariance matrix $ Σ $ can be expressed as follows.


\Sigma = \begin{pmatrix}
σ_{1}^2 & σ_{12} &・ ・ ・& σ_{1D}^2 \\
σ_{12} & σ_{2}^2\\
・\\
・\\
・\\
σ_{1D}& σ_{22} &・ ・ ・& σ_{D}^2\\
\end{pmatrix}\\

Such a covariance matrix is called a general symmetric type. This covariance matrix has $ D × (D + 1) / 2 $ free parameters (count the variables in the above matrix).

The covariance matrix is diagonal

Next, consider the case where the covariance matrix is diagonal.

\Sigma =diag(σ_i^2)=\begin{pmatrix}
σ_{1}^2 & 0 &・ ・ ・& 0 \\
0 & σ_{2}^2\\
・\\
・\\
・\\
0& 0 &・ ・ ・& σ_{D}^2\\
\end{pmatrix}\\

In this case, the free parameters are $ D $ equal to the number of dimensions.

The covariance matrix is proportional to the identity matrix (isotropic)

Finally, consider the case where the covariance matrix is proportional to the identity matrix. This is called the isotropic covariance matrix.

\Sigma =σ^2\bf I= σ^2\begin{pmatrix}
1 & 0 &・ ・ ・& 0 \\
0 & 1\\
・\\
・\\
・\\
0& 0 &・ ・ ・& 1\\
\end{pmatrix}\\

In such a case, there is only one free parameter, $ σ $. Now, the probability densities for these three cases are shown below.

image.png

By reducing the number of free parameters, the calculation becomes easier and the calculation speed increases. On the other hand, you can see that the expressive power of the probability density decreases. In the case of general symmetry, it can also represent diagonal or isotropic shapes.   There is an idea to solve it by introducing ** latent variables and non-observed variables ** in order to achieve both quick calculation and ensuring expressiveness. It is common practice to enhance expressiveness by this latent variable and multiple Gaussian distributions (= mixed Gaussian distributions).

Applying EM algorithm to mixed Gaussian distribution

Now, let's return to the main topic. The EM algorithm used as the subject of this time corresponds to 3. below.

image.png

Given a $ N × K $ matrix $ \ bf Z $ with the latent variable $ \ bf z ^ T $ as the row vector, the log-likelihood function can be expressed as follows.

\begin{align}
ln \hspace{1mm} p(\bf X| π,μ,Σ) &=\sum_{n=1}^{N}ln\Bigl\{ \sum_{k=1}^{K} \pi_k\mathcal N(x|\mu_k, \Sigma_k) \Bigr\}

\end{align}

By maximizing this ** log-likelihood function, it is possible to predict unknown data $ x $ with high probability **. In other words, the maximum likelihood function is calculated.

Click here for the detailed content of the idea of likelihood. I have referred to it in great detail.

[Statistics] What is likelihood? Let's explain graphically. https://qiita.com/kenmatsu4/items/b28d1b3b3d291d0cc698

However, it is very difficult to perform the function maximum likelihood analysis this time ($ log-Σ $ is very difficult to solve). Therefore, consider a method called ** EM algorithm **.

The ** EM algorithm for a mixed Gaussian distribution ** is as follows.

image.png

Given a Gaussian mixed model (or set by yourself), the goal is to adjust the ** parameters of the mean, variance, and mixing factors of each element ** to maximize the likelihood function.

I thought it was similar to updating weight parameters in a neural network. The weight parameter is updated using the gradient of the loss function to find the weight parameter that minimizes the loss function. At this time, mathematically finding the gradient (= derivative) of the loss function itself requires a huge amount of calculation. Therefore, the gradient is calculated by an algorithm called the inverse error propagation method.

Similarly, in the case of the EM algorithm, $ π, μ, and Σ $ are calculated and updated respectively. Then, if the convergence test is made and the criteria are satisfied, the maximum likelihood function is used.

Parameter update

The EM algorithm in a mixed Gaussian distribution requires updating the burden factor $ γ (z_ {nk}) $, the mixing factor $ π_k $, the mean $ μ_k $, and the covariance matrix $ Σ_k $. It is necessary to differentiate this calculation and set it to 0 and solve it. This article is very detailed on how to solve it, so I would be grateful if you could take a look.

Thorough explanation of EM algorithm https://qiita.com/kenmatsu4/items/59ea3e5dfa3d4c161efb

As a result, each parameter can be expressed as:


γ(z_{nk}) =\frac {π_k\mathcal N(x_n|\mu_k, \Sigma_k)}{\sum_{l=1}^{K}π_l\mathcal N(x|\mu_l, \Sigma_l)}\\

π_k = \frac {N_k}{N}\\
μ_k = \frac {1}{N_k}\sum_{n=1}^{N}γ(z_{nk})\bf{ x_n}\\
\Sigma_k = \frac{1}{N_k}\sum_{n=1}^{N}γ(z_{nk})(\bf x_n -μ_k)(\bf x_n -μ_k)^T\\

As you can see, $ π, μ, and Σ $ all depend on $ γ (z_ {nk}) $. Also, when trying to find the maximum likelihood function in a single Gaussian distribution, it is the value when this $ γ (z_ {nk}) $ becomes 1. Then, it is easy to understand because each of them simply asks for the mean value and covariance.

Try to implement

The program is stored here. https://github.com/Fumio-eisan/EM_20200509/upload

gmm_anim.gif

The points are the mean values, and the solid lines are the contour lines of the probability density distribution. You can see that it is gradually optimized for each cluster.

At the end

This time, we have summarized the EM algorithm for the mixed Gaussian distribution. I thought it would be easier to understand by visually understanding before mathematical understanding.

I am weak in formula expansion and implementation, so I will move my hands further to deepen my understanding. I would also like to write an article about common EM algorithms.

The program is stored here. https://github.com/Fumio-eisan/EM_20200509/upload

Recommended Posts

(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to organize the evaluation indexes used in machine learning (regression model)
I tried to predict the change in snowfall for 2 years by machine learning
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to compress the image using machine learning
I tried to make a real-time sound source separation mock with Python machine learning
I tried the super-resolution algorithm "PULSE" in a Windows environment
[Machine learning] I tried to summarize the theory of Adaboost
Try to model a multimodal distribution using the EM algorithm
I tried to divide with a deep learning language model
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I also tried to imitate the function monad and State monad with a generator in Python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried machine learning with liblinear
I tried to describe the traffic in real time with WebSocket
I tried to process the image in "sketch style" with OpenCV
I tried to process the image in "pencil style" with OpenCV
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Introduction ~
A story that didn't work when I tried to log in with the Python requests module
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
I tried "Implementing a genetic algorithm (GA) in python to solve the traveling salesman problem (TSP)"
I tried to create a linebot (implementation)
Record the steps to understand machine learning
Estimating mixed Gaussian distribution by EM algorithm
I tried to classify guitar chords in real time using machine learning
I tried to understand the decision tree (CART) that makes the classification carefully
A beginner of machine learning tried to predict Arima Kinen with python
I tried to display the altitude value of DTM in a graph
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
9 Steps to Become a Machine Learning Expert in the Shortest Time [Completely Free]
I tried to save the data with discord
I tried to integrate with Keras in TFv1.1
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①
I tried to predict the horses that will be in the top 3 with LightGBM
EM algorithm calculation for mixed Gaussian distribution problem
Machine learning beginners tried to make a horse racing prediction model with python
[Azure] I tried to create a Linux virtual machine in Azure of Microsoft Learn
I tried to extract a line art from an image with Deep Learning
I tried to predict the presence or absence of snow by machine learning.
I tried to process and transform the image and expand the data for machine learning
In IPython, when I tried to see the value, it was a generator, so I came up with it when I was frustrated.
A machine learning beginner tried to create a sheltie judgment AI in one day
I wanted to know the number of lines in multiple files, so I tried to get it with a command
I tried to understand the support vector machine carefully (Part 1: I tried the polynomial / RBF kernel using MakeMoons as an example).
Introduction to AI creation with Python! Part 2 I tried to predict the house price in Boston with a neural network
I tried to create a model with the sample of Amazon SageMaker Autopilot
Introducing the book "Creating a profitable AI with Python" that allows you to learn machine learning in the shortest course
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to build an environment with WSL + Ubuntu + VS Code in a Windows environment
I tried to solve the virtual machine placement optimization problem (simple version) with blueqat
I tried to create a reinforcement learning environment for Othello with Open AI gym
I tried to create a class to search files with Python's Glob method in VBA