[PYTHON] Try to model a multimodal distribution using the EM algorithm

Overview

When the distribution obtained from the sample is multimodal, it is not appropriate to model with a simple Gaussian distribution. A multimodal distribution can be modeled using a ** mixed Gaussian distribution ** that combines multiple Gaussian distributions. This article provides an example of using the EM algorithm to determine the parameters of a ** mixed Gaussian distribution **.

First from the single peak type distribution

Maximum likelihood estimation

Let the sample be $ x_n (n = 1,…, N) $. By maximum likelihood estimation of the Gaussian distribution, the mean and variance can be obtained in the following forms.

\mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n \\\
\sigma^2_{ML}=\frac{1}{N-1}\sum_{n=1}^N (x_n-\mu_{ML})^2

The variance value uses an unbiased estimator.

procedure

  1. Generate 10000 samples from a Gaussian distribution with mean 3.0 and variance 2.0.
  2. Estimate the mean and variance from the obtained sample using maximum likelihood estimation.

Source code

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import math

#Function to plot the histogram
def draw_hist(xs, bins):
    plt.hist(xs, bins=bins, normed=True, alpha=0.5)

#A function that finds the mean and variance from a given sample using maximum likelihood estimation
def predict(data):
    mu = np.mean(data)
    var = np.var(data, ddof=1)  #Use an unbiased estimator
    return mu, var

def main():
    #Average mu,Generate N samples that follow a Gaussian distribution with standard deviation std
    mu = 3.0
    v = 2.0
    std = math.sqrt(v)
    N = 10000
    data = np.random.normal(mu, std, N)
    #Maximum likelihood estimation,Find the mean and variance
    mu_predicted, var_predicted = predict(data)
    #Find the standard deviation from the variance value
    std_predicted = math.sqrt(var_predicted)
    print("original: mu={0}, var={1}".format(mu, v))
    print(" predict: mu={0}, var={1}".format(mu_predicted, var_predicted))

    #Result plot
    draw_hist(data, bins=40)
    xs = np.linspace(min(data), max(data), 200)
    norm = mlab.normpdf(xs, mu_predicted, std_predicted)
    plt.plot(xs, norm, color="red")
    plt.xlim(min(xs), max(xs))
    plt.xlabel("x")
    plt.ylabel("Probability")
    plt.show()


if __name__ == '__main__':
    main()

Execution result

The histogram (blue) represents the sample and the red line represents the Gaussian distribution modeled using the estimated values.

original: mu=3.0, var=2.0
 predict: mu=2.98719564872, var=2.00297779707

single_sample.png

We were able to find a model of a suitable Gaussian distribution.

Binekaze-class distribution

data set

Old Faithful-Uses geyser data. You can download it from the link below. Old Faithful The contents of the dataset are as follows.

This time, we will use only the most recent eruption duration (first column) as a specimen. From this specimen, the following bimodal distribution was obtained. mult_sample.png

The feeling of seeing the distribution It seems that a simple Gaussian distribution cannot be modeled. Now, let's model using a mixed Gaussian distribution that combines two Gaussian distributions. When modeling, it is necessary to determine the parameters of mean $ \ mu_1, \ mu_2 $, variance $ \ sigma_1 ^ 2, \ sigma_2 ^ 2 $, and mixing probability $ \ pi $. Assuming that the Gaussian distribution is $ \ phi (x | \ mu, \ sigma ^ 2) $, the mixed Gaussian distribution is given by the following equation.

y=(1-\pi)\phi(x|\mu_1, \sigma^2_1)+\pi\phi(x|\mu_2, \sigma^2_2)

EM algorithm

The mean and variance of the Gaussian distribution can be found by analytically solving the maximization of the likelihood function (PRML. See PRML /) §2.3.4). However, the maximization of the likelihood function of the mixed Gaussian distribution is difficult to solve analytically, so the maximization is performed using the EM algorithm, which is one of the optimization methods. The EM algorithm is an algorithm that repeats two steps, E step and M step.

See §8.5 of The Elements of Statistical Learning for detailed algorithm content. The English version of the PDF can be downloaded for free.

Source code

# -*- coding: utf-8 -*-

import numpy as np
import math
import random
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

#Average m,Gaussian distribution of variance v
def gaussian(x, m, v):
    p = math.exp(- pow(x - m, 2) / (2 * v)) / math.sqrt(2 * math.pi * v)
    return p

#E step
def e_step(xs, ms, vs, p):
    burden_rates = []
    for x in xs:
        d = (1 - p) * gaussian(x, ms[0], vs[0]) + p * gaussian(x, ms[1], vs[1])
        n = p * gaussian(x, ms[1], vs[1])
        burden_rate = n / d
        burden_rates.append(burden_rate)
    return burden_rates


#M step
def m_step(xs, burden_rates):
    d = sum([1 - r for r in burden_rates])
    n = sum([(1 - r) * x for x, r in zip(xs, burden_rates)])
    mu1 = n / d

    n = sum([(1 - r) * pow(x - mu1, 2) for x, r in zip(xs, burden_rates)])
    var1 = n / d

    d = sum(burden_rates)
    n = sum([r * x for x, r in zip(xs, burden_rates)])
    mu2 = n / d

    n = sum(r * pow(x - mu2, 2) for x, r in zip(xs, burden_rates))
    var2 = n / d

    N = len(xs)
    p = sum(burden_rates) / N

    return [mu1, mu2], [var1, var2], p


#Log-likelihood function
def calc_log_likelihood(xs, ms, vs, p):
    s = 0
    for x in xs:
        g1 = gaussian(x, ms[0], vs[0])
        g2 = gaussian(x, ms[1], vs[1])
        s += math.log((1 - p) * g1 + p * g2)
    return s

#Function to plot the histogram
def draw_hist(xs, bins):
    plt.hist(xs, bins=bins, normed=True, alpha=0.5)

def main():
    #Read the first column of the dataset
    fp = open("faithful.txt")
    data = []
    for row in fp:
        data.append(float((row.split()[0])))
    fp.close()
    #mu, vs,Set the initial value of p
    p = 0.5
    ms = [random.choice(data), random.choice(data)]
    vs = [np.var(data), np.var(data)]
    T = 50  #Number of iterations
    ls = []  #Save the calculation result of the log-likelihood function
    #EM algorithm
    for t in range(T):
        burden_rates = e_step(data, ms, vs, p)
        ms, vs, p = m_step(data, burden_rates)
        ls.append(calc_log_likelihood(data, ms, vs, p))

    print("predict: mu1={0}, mu2={1}, v1={2}, v2={3}, p={4}".format(
        ms[0], ms[1], vs[0], vs[1], p))
    #Result plot
    plt.subplot(211)
    xs = np.linspace(min(data), max(data), 200)
    norm1 = mlab.normpdf(xs, ms[0], math.sqrt(vs[0]))
    norm2 = mlab.normpdf(xs, ms[1], math.sqrt(vs[1]))
    draw_hist(data, 20)
    plt.plot(xs, (1 - p) * norm1 + p * norm2, color="red", lw=3)
    plt.xlim(min(data), max(data))
    plt.xlabel("x")
    plt.ylabel("Probability")

    plt.subplot(212)
    plt.plot(np.arange(len(ls)), ls)
    plt.xlabel("step")
    plt.ylabel("log_likelihood")
    plt.show()

if __name__ == '__main__':
    main()

Execution result

predict: mu1=2.01860781706, mu2=4.27334342119, v1=0.0555176191851, v2=0.191024193785, p=0.651595365985

EM.png

The figure above shows the results of modeling a sample from a dataset with a mixed Gaussian distribution. The figure below shows how the log-likelihood increases as the EM algorithm is repeated.

Other

Initial value in EM algorithm

The mean was a randomly selected sample, the variance was the variance of the sample, and the mixing probability was 0.5 as the initial value. There are many ways to choose the initial value, but it seems that this alone will result in a dissertation (?).

About multi-peak distribution

Since this sample has a bimodal distribution, we combined two Gaussian distributions. Of course, you can combine three or more Gaussian distributions, but the number of parameters that need to be determined will increase.

About the dimensions of the dataset

This time we used a one-dimensional sample, but the EM algorithm can also be used for multivariate Gaussian distributions. The multivariate Gaussian distribution has significantly more parameters to determine than the one-dimensional Gaussian distribution (mean, covariance matrix).

Recommended Posts

Try to model a multimodal distribution using the EM algorithm
Try to edit a new image using the trained StyleGAN2 model
Try to infer using a linear regression model on android [PyTorch Mobile]
Try to solve the traveling salesman problem with a genetic algorithm (Theory)
How to fix the initial population with a genetic algorithm using DEAP
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
How to try the friends-of-friends algorithm with pyfof
Try to solve the traveling salesman problem with a genetic algorithm (execution result)
(Python) Try to develop a web application using Django
Steps to calculate the likelihood of a normal distribution
Try to evaluate the performance of machine learning / regression model
Finding a solution to the N-Queen problem with a genetic algorithm (2)
I tried to simulate ad optimization using the bandit algorithm.
Try using the Twitter API
Try a similar search for Image Search using the Python SDK [Search]
Try to evaluate the performance of machine learning / classification model
I made a function to check the model of DCGAN
How to generate a query using the IN operator in Django
Try using the Twitter API
I made a VGG16 model using TensorFlow (on the way)
[Introduction to Tensorflow] Understand Tensorflow properly and try to make a model
Try to select a language
Finding a solution to the N-Queen problem with a genetic algorithm (1)
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 3 [Character recognition using a model]
Let's write a program to solve the 4x4x4 Rubik's Cube! 2. Algorithm
I tried to implement anomaly detection using a hidden Markov model
The simplest way to build a Spleeter usage environment using Windows
Try to write a program that abuses the program and sends 100 emails
Try to operate the database using Python's ORM Peewee (August 2019 version)
Evaluate the performance of a simple regression model using LeaveOneOut cross-validation
Finding the optimum value of a function using a genetic algorithm (Part 1)
Try to solve the function minimization problem using particle swarm optimization
Try to draw a Bezier curve
EM algorithm calculation for mixed Gaussian distribution problem
EM of mixed Gaussian distribution
Gaussian mixed model EM algorithm [statistical machine learning]
Mixed Gaussian distribution and logsumexp
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
Variational Bayesian estimation of mixed Gaussian distribution
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
Try to model a multimodal distribution using the EM algorithm
Try using pynag to configure Nagios
Try to get statistics using e-Stat
Try using the Python Cmd module
Cython to try in the shortest
Creating a learning model using MNIST
The fastest way to try EfficientNet
The easiest way to try PyQtGraph
How to divide and process a data frame using the groupby function
How to make a model for object detection using YOLO in 3 hours
Try to estimate the parameters of the gamma distribution while simply implementing MCMC
I learned scraping using selenium to make a horse racing prediction model.
Try using the Python web framework Django (1)-From installation to server startup
Try to get the road surface condition using big data of road surface management
[NNabla] How to add a quantization layer to the middle layer of a trained model
Try using n to downgrade the version of Node.js you have installed
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
When I try to go back using chainer, it fits a little
Try to build a pipeline to store the result in Bigquery by hitting the Youtube API regularly using Cloud Composer
Try using the Wunderlist API in Python
Try implementing a Shazam-like voice fingerprint algorithm
Try using the web application framework Flask
Try using the Kraken API in Python
Try using the $ 6 discount LiDAR (Camsense X1)
How to draw a graph using Matplotlib
Try using the HL band in order
Try to face the integration by parts
Try to make a kernel of Jupyter