[Python] Clustering with an infinitely mixed Gaussian model

One of the clustering methods is to estimate a mixed Gaussian model. This is a method of expressing and clustering a dataset with multiple Gaussian distributions. Normally, these Gaussian distributions assume a finite number, but if there were an infinite number, what kind of clustering would it be?

In this article, referring to Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" Introducing the outline and implementation of "infinitely mixed Gaussian model" and "nonparametric". I recommend this book because it is very easy to understand.

I would like to omit the detailed theory and focus on the outline and story. Also, the implementation should be limited to the source code.

Previous article [Python] I implemented peripheralized Gibbs sampling However, I am conscious that the flow can be conveyed even if I read only this article.

Introduction of infinite mixed Gauss model

In a normal mixed Gaussian model, the dataset is represented by a finite number of Gaussian distributions, and each data is considered to be associated with the Gaussian distribution in it. The number of claspers (= Gaussian distribution) must be determined in advance, but the appropriate number of clusters varies depending on the data set, and it is not easy to grasp it.

If the number of clusters is automatically determined during clustering, you will not have to worry about the above issues. The ** infinitely mixed Gaussian model ** makes this possible.

The infinitely mixed Gaussian model is based on the idea that there are an infinite number of Gaussian distributions, and the data in the dataset is associated with a finite number of them. The following figure shows the result of clustering, which will be shown later. (Because it is troublesome, I will use the figure again. Please understand.)

This dataset is represented by a Gaussian distribution of three. Whereas a normal mixed Gaussian model thinks that there are exactly three Gaussian distributions in this world, an infinitely mixed Gaussian model thinks that there are an infinite number of potential Gaussian distributions that are not tied to a dataset. Thinking this way, in addition to the probability that your data will connect to three existing clusters, you can also calculate the probability that your data will connect to an unknown cluster where you don't know where the average is.

In other words, in clustering with an infinitely mixed Gaussian model, data is assigned to a new cluster, while existing clusters disappear without being connected to any data, increasing or decreasing the number of clusters, and finally converging to an appropriate number. You can.

Calculation of probability distribution

I will post the calculation formula from here. In a normal mixed Gaussian distribution, the *** categorical distribution *** is applied to the probability distribution of $ z_i $ (when $ x_ {1: N} $ [^ 4] is not observed).

`python`


P(z_i = k | \pi) = \pi_k

Where $ \ pi: = (\ pi_1, \ cdots, \ pi_K) $ is a point on the $ (K-1) $ dimension alone. [^ 1] As a method of determining the value of $ \ pi $, there is a method of assuming a uniform distribution or a method of Bayesian estimation assuming a prior distribution [^ 2].

On the other hand, the infinitely mixed Gaussian model applies a stochastic process called the ** Chinese cooking process **. This is a mechanism to calculate the probability distribution of the $ i $ th data cluster $ z_i $ from the $ z_ {1: N} ^ {\ backslash i} $ data cluster other than the $ i $ th data.

`python`


P(z_i = k | z_{1:N}^{\backslash i}, \alpha) =
\left\{
\begin{array}{ll}
\frac{n_k}{N - 1 + \alpha} & (k = 1, \cdots, K) \\
\frac{\alpha}{N - 1 + \alpha} & (k = K + 1)
\end{array}
\right.

$ k = 1, \ cdots, K $ corresponds to the existing cluster and $ k = K + 1 $ corresponds to the new cluster. Here, $ n_k $ is the number of $ j $ such that $ i \ neq j, z_j = k $, and $ \ alpha $ is a hyperparameter that takes a positive value. As you can see from the formula, the larger $ \ alpha $, the more likely it is to be assigned to a new cluster. From now on, I will write this probability distribution as $ {\ rm CRP} (k | z_ {1: N} ^ {\ backslash i}, \ alpha) $.

Next, consider the probability distribution when observing $ x_ {1: N} $. In a normal mixed Gaussian model, it can be calculated as follows.

`python`


\begin{align}
P(z_i = k | x_{1:N}, \mu_{1:K}, \Sigma_{1:K}, \pi) 
&= \frac{P(z_i = k | \pi) N(x_i | \mu_k, \Sigma_k)}{\sum_{l=1}^{K}P(z_i = l | \pi) N(x_i | \mu_l, \Sigma_l)} \\
&= \frac{\pi_k N(x_i | \mu_k, \Sigma_k)}{\sum_{l=1}^{K}\pi_l N(x_i | \mu_l, \Sigma_l)}
\end{align}

$ N (x | \ mu, \ Sigma) $ is the probability density function of a multivariate Gaussian distribution. In the infinitely mixed Gaussian model, the probability distribution can be obtained by changing $ P (z_i = k | \ pi) $ in the above equation to the equation for the Chinese cooking process.

`python`


\begin{align}
P(z_i = k | x_{1:N}, \mu_{1:K}, \Sigma_{1:K}, z_{1:N}^{\backslash i}, \alpha) 
&= \frac{{\rm CRP}(k | z_{1:N}^{\backslash i}, \alpha) N(x_i | \mu_k, \Sigma_k)}{\sum_{l=1}^{K+1}{\rm CRP}(l | z_{1:N}^{\backslash i}, \alpha) N(x_i | \mu_l, \Sigma_l)}
\end{align}

There is one caveat here. For existing clusters $ k = 1, \ cdots, K $, it is possible to estimate the mean $ \ mu_k $ and the covariance matrix $ \ Sigma_k $ from the data belonging to each cluster. However, for the new cluster $ k = K + 1 $, the direct mean and variance-covariance matrix cannot be estimated because there is no data belonging to the cluster. Therefore, this time we will adopt the following method.

-Assuming a prior distribution for $ \ mu_k $, marginalize the joint distribution $ P (x_i, \ mu_k | \ cdots) $, and find the probability distribution without explicitly estimating $ \ mu_k $. -Let $ \ Sigma_k $ be the matrix $ \ sigma ^ 2 I $ common to all clusters. ($ \ Sigma ^ 2 $ is a hyperparameter)

For means, Bayesian inference and marginalization calculate the probability distribution from only prior distributions. For the covariance matrix, fixing it as above is sufficient for clustering, so we will not estimate it this time. [^ 5] Of course, it is also possible to estimate the variance and cluster it.

Since it is nonsense to separate cases, this method is also applied when $ k = 1, \ cdots, K + 1 $. Please refer to Previous article for marginalization.

Clustering with nonparametric Bayes

In order to estimate a cluster of $ N $ data, $ z_ {1: N} $, we will probabilistically sample one by one from $ z_1 $ to $ z_N $. This method is called ** Gibbs sampling **, and the method based on the infinite dimensional probability distribution like this time is especially called ** nonparametric Bayes **.

To perform this method, we need to make some assumptions about the initial clustering $ z_ {1: N} $. This time, with the same cluster $ k = 1 $ as the initial clustering, sampling from $ z_1 $ to $ z_N $ will be repeated multiple times. Initially there is only one cluster, but it is an image that data far from the average is combined with a new cluster and class ring progresses while increasing the number of clusters.

Implementation

Implemented clustering with an infinitely mixed Gaussian model. It is based on the implementation of Last time, but it has been brushed up overall and the difference may be large.

Source code

`python`


from collections import Counter
from functools import partial
from typing import Sequence, Tuple
import numpy as np
from scipy.special import logsumexp


def _log_gaussian(x: np.ndarray, mu: np.ndarray, var: float) -> np.ndarray:
    #2 in the previous article*np.I omitted pi, but for the sake of clarity, I will take it into account this time.
    d = x.shape[0]
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return - d * np.log(2 * np.pi * var) / 2 - norm_squared / (2 * var)


#Array class to store the cluster
class InfiniteClusterArray(object):
    def __init__(self, array: Sequence[int]):
        self._array = np.array(array, dtype=int)
        self._counter = Counter(array)
        self._n_clusters = max(array) + 1

    def update(self, i: int, k: int):
        assert k <= self._n_clusters

        pre_value = self._array[i]
        if pre_value == k:
            return

        if self._counter[pre_value] > 0:
            self._counter[pre_value] -= 1
        self._array[i] = k
        self._counter[k] += 1

        if k == self._n_clusters:
            self._n_clusters += 1

        if not self._counter[pre_value]:
            self._array -= 1 * (self._array > pre_value)
            self._counter = Counter(self._array)
            self._n_clusters -= 1

    @property
    def array(self) -> np.ndarray:
        return self._array.copy()

    @property
    def n_clusters(self) -> int:
        return self._n_clusters

    def count(self, k: int, excluded=None) -> int:
        assert (excluded is None) or isinstance(excluded, int)
        if excluded is None:
            return self._counter[k]

        counted = self._counter[k] - bool(self._array[excluded] == k)
        return counted

    def __getitem__(self, i: int) -> int:
        return self._array[i]

            

#Array class to store dataset
class DataArray(object):
    def __init__(self, array: np.ndarray):
        assert array.ndim == 2
        self._array = array

    @property
    def n(self) -> int:
        return self._array.shape[0]

    @property
    def d(self) -> int:
        return self._array.shape[1]

    def mean(self, excluded=None) -> np.ndarray:
        assert (excluded is None) or isinstance(excluded, Sequence)

        if excluded is None:
            return self._array.mean(axis=0)
        else:
            return self._array[excluded].mean(axis=0)

    def __getitem__(self, i):
        return self._array.__getitem__(i)

    def __iter__(self):
        for i in range(self.n):
            yield self._array[i]

    def __str__(self) -> str:
        return f'DataArray({self._array})'


#Gibbs sampling class of infinitely mixed Gauss
class IGMMGibbsSampling(object):
    def __init__(self, var: float, var_pri: float, alpha: float, seed=None):
        #The prior distribution mean is set to the mean for the entire dataset.
        assert (var > 0) and (var_pri > 0)
        assert alpha > 0

        self.var = var
        self.var_pri = var_pri
        self.alpha = alpha
        self._random = np.random.RandomState(seed)

    def predict_clusters(self, X: np.ndarray, n_iter: int, init_z=None) -> np.ndarray:
        #Method to estimate cluster
        assert X.ndim == 2
        X = DataArray(X)
        mu_pri = X.mean()

        if init_z is None:
            init_z = np.zeros(X.n, dtype=int)
        z = InfiniteClusterArray(init_z)

        for _ in range(n_iter):
            compute_probs = partial(self._compute_pmf_zi, X, z, mu_pri)
            for i in range(X.n):
                probs = compute_probs(i)
                z_i = self._random.multinomial(1, probs)
                z_i = np.where(z_i)[0][0]  # One-hot to index
                z.update(i, z_i)

        return z.array

    def _compute_pmf_zi(self, X: DataArray, z: InfiniteClusterArray, mu_pri: np.ndarray, i: int) -> np.ndarray:
        mu, var = self._compute_marginal_distribution(X, z, mu_pri, i)
        pi = np.array([
            z.count(k, excluded=i) / (X.n - 1 + self.alpha)
            if k < z.n_clusters
            else self.alpha / (X.n - 1 + self.alpha)
            for k in range(z.n_clusters + 1)
        ])

        log_pdf_xi = _log_gaussian(X[i], mu, var)
        with np.errstate(divide='ignore'):
            pmf_zi = np.exp(np.log(pi) + log_pdf_xi - logsumexp(np.log(pi) + log_pdf_xi)[np.newaxis])

        return pmf_zi

    def _compute_marginal_distribution(self, X: DataArray, z: InfiniteClusterArray,
                                       mu_pri: np.ndarray, i: int) -> Tuple[np.ndarray, float]:

        n_vec = np.array([
            z.count(k, excluded=i)
            for k in range(z.n_clusters + 1)
        ], dtype=np.int)

        tmp_vec = 1 / (n_vec / self.var + 1 / self.var_pri)

        mu = np.tile(mu_pri / self.var_pri, (z.n_clusters + 1, 1))  # Shape (z.n_clusters + 1, X.d)
        for k in range(z.n_clusters):
            excluded_list = [
                j for j in range(X.n)
                if (j != i) and (z[j] == k)
            ]
            if excluded_list:
                mean = X.mean(excluded=excluded_list)
                mu[k] += n_vec[k] * mean / self.var
        mu *= tmp_vec[:, np.newaxis]
        var = tmp_vec + self.var

        return mu, var

Clustering practice

As a test, try clustering using the implemented self-made class. This time, we will apply it to the following dataset.

Source code

I'm using a Jupyter Notebook.

`python`


import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
%matplotlib inline

np.random.seed(0)
cmap = plt.rcParams['axes.prop_cycle'].by_key()['color']

pi = [0.6, 0.2, 0.2]
mu = np.array([[1, 2], [-2, 0], [3, -2]])
var = [0.8, 0.4, 0.2]
size = 100
ndim = 2

def make_data():
    z = np.random.multinomial(1, pvals=pi)
    z = np.where(z==1)[0][0]
    cov = var[z] * np.eye(2)
    
    return np.random.multivariate_normal(mu[z], cov)

X = np.array([make_data() for _ in range(100)])

fig = plt.figure()
ax = plt.axes()
ax.scatter(X[:, 0], X[:, 1], color='gray')
ax.set_aspect('equal')
plt.show()

This is the dataset shown earlier, which has not been clustered yet. The number of data is 100, and it is created by randomly selecting from three Gaussian distributions. Therefore, it is appropriate to divide it into three clusters.

Now let's cluster.

`python`


model = IGMMGibbsSampling(var=0.5, var_pri=4, alpha=0.1, seed=0)  #Pass hyperparameters to constructor
clusters = model.predict_clusters(X, n_iter=10)  #Performing clustering

#Visualization
fig = plt.figure()
ax = plt.axes()
for k in range(max(clusters) + 1):
    data = np.array([
        X[i, :] for i in range(X.shape[0])
        if clusters[i] == k
    ])
    ax.scatter(data[:, 0], data[:, 1], label=f'cluster {k + 1}', color=cmap[k])

ax.set_aspect('equal')
plt.legend()
plt.show()

The variance common in the cluster var is a small $ 0.5 $ in the dataset, the variance var_pri in the mean prior distribution is a large $ 4 $ to make it look like a uniform distribution, and the Chinese cooking process parameter ʻalphais $ 0.1. It is set to $. By the way, the mean prior distribution of means is implemented to be the mean of the entire dataset. We are passing the dataset and the number of iterationsn_iter to the clustering method predict_clusters`. Here, the number of iterations is set to 10.

The result is as follows.

Even though I didn't specify the number of clusters and started from the same cluster state, they were divided into 3 clusters.

Since it's a big deal, let's examine the transition of clustering for each number of iterations. The clustering results at n_iter = 2, 4, 6, 8, 10 are shown.

n_iter = 2

n_iter = 4

n_iter = 6

n_iter = 8

n_iter = 10

It seems that a large number of clusters were created from the initial state, and then the number of clusters gradually decreased to three. By the way, even if I increased n_iter from here, the number of clusters basically maintained 3 states.

Occasionally, a cluster with only one data was created, and the number of clusters became four.

Adjustment of hyperparameters

As mentioned above, the larger the hyperparameters $ \ alpha $ in the Chinese cooking process, the easier it is for new clusters to form. Originally, there was a motive to omit the estimation of the number of clusters, and if you struggle to adjust this $ \ alpha $, the significance of introducing it will be diminished. When I tried it while changing $ \ alpha $, I got the impression that basically there is no problem if it is made smaller. The following figure shows the result of $ \ alpha = 0.0001 $ and 10 iterations.

It is firmly divided into three clusters. Even if $ \ alpha $ is made quite small, it seems that it can be clustered properly without a big impact on the number of iterations.

By the way, when $ \ alpha = 1 $, 5 clusters were formed at the time of 10 iterations, and it did not decrease to 3 after that. It seems that the number of clusters cannot be determined well if $ \ alpha $ is too large. It has the same properties as the step width of the optimization algorithm used in machine learning.

in conclusion

So far, I have tried clustering with an infinitely mixed Gaussian model. It was interesting because it was divided into the number of clusters as intended.

[^ 1]: Set of $ (\ pi_1, \ cdots, \ pi_K) $ that satisfies $ \ pi_1, \ cdots, \ pi_K \ ge 0 $ and $ \ sum_ {k = 1} ^ K \ pi_k = 1 $ Is called the $ (K-1) $ dimension unit. Originally used in geometry, the $ 0 $ dimension unit represents a point, the $ 1 $ dimension unit represents a line segment, and the $ 2 $ dimension unit represents a triangle. If you read a book on Bayesian inference, you will see this set used in the context of probability. [^ 2]: There is also a method of maximum likelihood estimation of $ \ pi $ from the observation of $ z_ {1: N} ^ {\ backslash i} $, but this method always uses $ \ when the cluster $ k $ disappears. Note that pi_k = 0 $ and the cluster will not be revived. [^ 4]: $ N $ variables $ x_1, \ cdots, x_N $ will be abbreviated as $ x_ {1: N} $. Also, I will write $ x_ {1: N} ^ {\ backslash i} $ for $ N-1 $ variables excluding the $ i $ th. [^ 5]: Even if $ \ Sigma_k $ is a fixed value common to clusters, I think that clustering can be performed with the same accuracy as k-means. This is because the k-means method is equivalent to the maximum likelihood estimation of each $ \ mu_k $ and $ z_i $ in the mixed Gaussian model with fixed variance as described above.