[PYTHON] Mixed Gaussian distribution and logsumexp

Overview

The mixed Gaussian distribution is a model of weighted addition of multiple Gaussian distributions and can represent a multimodal distribution. In addition, the EM algorithm can be used to optimize the parameters of this model. The expected value of a latent variable is calculated by the E-step of this EM algorithm, but if the calculation is performed naive at that time, problems of overflow and underflow may occur. At that time, this problem can be avoided by using the famous numerical calculation method called logsumexp.

Overview of mixed Gaussian distribution and logsumexp

Regarding the parameter optimization by the EM algorithm of the mixed Gaussian distribution, only the part related to logsumexp is briefly shown. For more details, ["Pattern recognition and machine learning"](http://www.amazon.co.jp/%E3%83%91%E3%82%BF%E3%83%BC%E3%83%B3% E8% AA% 8D% E8% AD% 98% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E4% B8% 8B-% E3% 83% 99% E3% 82% A4% E3% 82% BA% E7% 90% 86% E8% AB% 96% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% B5% B1% E8% A8% 88% E7% 9A% 84% E4% BA% 88% E6% B8% AC-CM-% E3% 83% 93% E3% 82% B7% E3% 83% A7% E3 % 83% 83% E3% 83% 97 / dp / 4621061240 / ref = pd_sim_14_1? ie = UTF8 & refRID = 1GS4SDYG8VEYR40YSADB) Please refer to. (For the time being, I will give you the materials when I gave a lecture to SlideShare PRML 9.0-9.2)

\begin{aligned}
p(x|\theta) &= \sum_k \pi_k N(x_n|\mu_k,\Sigma_k)\\
\pi_k&:mixing ratio
\end{aligned}

The distribution of the observed variable $ x $ of the mixed Gaussian distribution is expressed by the above equation. In order to apply the EM algorithm, we introduce the latent variable $ z $ and express the joint distribution of $ x and z $ as follows.

\begin{aligned}
p(x,z) &= \prod_k \pi_k^{z_k} N(x_n|\mu_k,\Sigma_k)^{z_k}\\
z&: 1ofK
\end{aligned}

Using these, the expected value of the latent variable z can be calculated from Bayes' theorem as follows.

\begin{aligned}
\gamma(z_{nk})
&= E[z_{nk}] \\ 
&=\frac{\pi_k N(x_n|\mu_k,\Sigma_k)}{\sum_{k^\prime}\pi_{k^\prime} N(x_n|\mu_{k^\prime},\Sigma_{k^\prime})}
\end{aligned}

In the E-step in the optimization by the mixed Gauss EM algorithm, the expected value of this latent variable is calculated. Looking at the denominator, we can see that underflow can occur in the exponential operation of the Gaussian distribution even when the operation is performed on the log scale because there is a sum operation of the Gaussian distribution. Therefore, it is necessary to devise something, and one method is logsumexp.

logsumexp

\log(\sum^N_{i=1} \exp(x_i))

Even if the result of this calculation itself does not overflow or underflow, the individual $ \ exp (x_i) $ may overflow or underflow. Therefore, the formula is transformed as follows.

\begin{aligned}
\log(\sum^N_{i=1} \exp(x_i)) 
&= \log\{\exp(x_{max})\sum^N_{i=1} \exp(x_i - x_{max})\} \\
& = \log\{\sum^N_{i=1} \exp(x_i - x_{max})\}  + x_{max}
\end{aligned}

By performing the calculation in this way, the argument of the exponential operation falls within the range of $ [x_ {min} --x_ {max}, 0] $, so the possibility of overflow and underflow is greatly reduced.

This logsumexp is also implemented in Python's machine learning library scikit-learn and is easy to use (Scikit Learn Utilities for Developers. )). Also, the implementation of logsumexp in scikit learn is sklearn / utils / extmath.py It is in, and it is as follows. You can see that it is exactly the same as the formula shown above except for the part corresponding to axis.


def logsumexp(arr, axis=0):
    """Computes the sum of arr assuming arr is in the log domain.
    Returns log(sum(exp(arr))) while minimizing the possibility of
    over/underflow.
    Examples
    --------
    >>> import numpy as np
    >>> from sklearn.utils.extmath import logsumexp
    >>> a = np.arange(10)
    >>> np.log(np.sum(np.exp(a)))
    9.4586297444267107
    >>> logsumexp(a)
    9.4586297444267107
    """
    arr = np.rollaxis(arr, axis)
    # Use the max to normalize, as with the log this is what accumulates
    # the less errors
    vmax = arr.max(axis=0)
    out = np.log(np.sum(np.exp(arr - vmax), axis=0))
    out += vmax
    return out

Implementation of mixed Gaussian model

Based on the above, I implemented the mixed Gauss model, so I will leave it (https://github.com/seataK/machine-learning/blob/master/GaussianMixture/GaussianMixture.py).

import numpy as np
import random
import pylab as plt
from sklearn.utils import extmath
from sklearn.cluster import KMeans
import sys


na = np.newaxis


class DataFormatter:
    def __init__(self, X):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)

    def standarize(self, X):
        return (X - self.mean[na, :]) / self.std[na, :]


def log_gaussian(X, mu, cov):
    d = X.shape[1]
    det_sig = np.linalg.det(cov)
    A = 1.0 / (2*np.pi)**(d/2.0) * 1.0 / det_sig**(0.5)
    x_mu = X - mu[na, :]
    inv_cov = np.linalg.inv(cov)
    ex = - 0.5 * np.sum(x_mu[:, :, na] * inv_cov[na, :, :] *
                        x_mu[:, na, :], axis=(1, 2))
    return np.log(A) + ex


class GMM:
    def __init__(self,
                 K=2,
                 max_iter=300,
                 diag=False):
        self.K = K
        self.data_form = None
        self.pi = None
        self.mean = None
        self.cov = None
        self.max_iter = max_iter
        self.diag = diag

    def fit(self, _X):
        self.data_form = DataFormatter(_X)
        X = self.data_form.standarize(_X)
        N = X.shape[0]
        D = X.shape[1]
        K = self.K

        # init parameters using K-means
        kmeans = KMeans(n_clusters=self.K)

        kmeans.fit(X)

        self.mean = kmeans.cluster_centers_

        self.cov = np.array([[[1 if i == j else 0
                             for i in range(D)]
                             for j in range(D)]
                             for k in range(K)])

        self.pi = np.ones(K) / K

        # Optimization
        for _ in range(self.max_iter):
            # E-step

            gam_nk = self._gam(X)

            # M-step
            Nk = np.sum(gam_nk, axis=0)

            self.pi = Nk / N

            self.mean = np.sum(gam_nk[:, :, na] * X[:, na, :],
                               axis=0) / Nk[:, na]

            x_mu_nkd = X[:, na, :] - self.mean[na, :, :]

            self.cov = np.sum(gam_nk[:, :, na, na] *
                              x_mu_nkd[:, :, :, na] *
                              x_mu_nkd[:, :, na, :],
                              axis=0) / Nk[:, na, na]

            if(self.diag):
                for k in range(K):
                    var = np.diag(self.cov[k])
                    self.cov[k] = np.array([[var[i] if i == j else 0
                                             for i in range(D)]
                                            for j in range(D)])

    def _gam(self, X):
        log_gs_nk = np.array([log_gaussian(X, self.mean[i], self.cov[i])
                              for i in range(self.K)]).T

        log_pi_gs_nk = np.log(self.pi)[na, :] + log_gs_nk

        log_gam_nk = log_pi_gs_nk[:, :] - extmath.logsumexp(log_pi_gs_nk, axis=1)[:, na]

        return np.exp(log_gam_nk)

    def predict(self, _X):
        X = self.data_form.standarize(_X)

        gam_nk = self._gam(X)

        return np.argmax(gam_nk, axis=1)

Other

In logsumexp, many operations of log and exp appear, so the calculation speed seems to be slow (logsumexp is the black history of humankind). If speed becomes a bottleneck, it may be better to use another method.

Recommended Posts

Mixed Gaussian distribution and logsumexp
EM of mixed Gaussian distribution
Basic statistics and Gaussian distribution
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
Variational Bayesian estimation of mixed Gaussian distribution
Estimating mixed Gaussian distribution by EM algorithm
Distribution and test
EM algorithm calculation for mixed Gaussian distribution problem
OS and Linux distribution
Python implementation mixed Bernoulli distribution
Hypothesis test and probability distribution
Sorting with mixed numbers and letters
Mixed normal distribution implementation in python
Mixed positional arguments, tuples and dictionaries