[Python] Implementation of clustering using a mixed Gaussian model

Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" is read and written in the book. I implemented the model in Python. Nonparametric Bayes is an extension of the Gaussian mixed model. I really want to implement nonparametric Bayes, but in preparation for that, I will implement a mixed Gaussian model.

-"Introduction to Machine Learning by Bayesian Inference" Approximate Inference of Poisson Mixed Model Implemented Only with Python numpy -Mixed Gaussian distribution and logsumexp

Environment Python: 3.7.5 numpy: 1.18.1 pandas: 0.25.3 matplotlib: 3.1.2 seaborn: 0.9.0

Symbols used in this article

Before we get into the main story, let's take a look at some of the mathematical symbols used in this article that are rarely seen elsewhere.

-$ x_ {1: N} $: $ x_1, \ cdots, x_N $ -$ x_ {1: N} ^ {\ backslash i} $: $ x_ {1: N} $ minus $ x_i $

In addition, the vectors in this article are column vectors and are not shown in bold.

What is a mixed Gaussian model?

A Gaussian Mixture Model (GMM) is a dataset model in which data generated from multiple Gaussian distributions are mixed. Let's take Iris Dataset as an example. This is a dataset of iris characteristics and types, with a mixture of three types of iris "setosa", "versicolor" and "virginica". Many of you are familiar with it because it is often used for exercises in machine learning classification. There are four explanatory variables in this dataset, but for the sake of simplicity, we took out two, "petal_length" and "petal_width", and created a scatter plot of "color-coded version" and "color-coded version for each type of iris". I will.

iris_valualization_1.png

Source code

python


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
df = sns.load_dataset('iris')
ax1.scatter(df['petal_length'], df['petal_width'], color='gray')
ax1.set_title('without label')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for sp in df['species'].unique():
    x = df[df['species'] == sp]['petal_length']
    y = df[df['species'] == sp]['petal_width']
    ax2.scatter(x, y, label=sp)
ax2.legend()
ax2.set_title('with label')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()

Looking at the figure on the right, it can be interpreted that each data is generated according to the multivariate Gaussian distribution [^ 1], which has a different mean for each type of iris. This is a mixed Gaussian model. A rough average is shown in the table below.

type petal_length petal_width
setosa About 1.5 About 0.25
versicolor About 4.5 About 1.3
virsinica About 6.0 About 2.0

By using the mixed Gaussian model, the average of multiple Gaussian distributions can be estimated from the unlabeled dataset on the left, and the data and each Gaussian distribution can be linked for clustering.

Mathematical explanation

The mixed Gaussian model is described as a probabilistic model. Let the cluster of data $ x_i $ be $ z_i $ and the multivariate Gaussian distribution corresponding to each cluster $ 1, \ cdots, k $ be $ N (\ mu_k, \ Sigma_k) $. In other words, when $ x_i $ belongs to the classra $ k $, its generation probability can be written as: $ D $ is the dimension of $ x_i $ and $ N $ is the number of data. ..

python


\begin{align}
P(x_i | z_i=k, x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}) 
&= N(x | \mu_k, \Sigma_k)
\end{align}

On the other hand, $ z_i $ is also expressed stochastically. Suppose $ z_i $ is generated from a categorical distribution with $ \ pi: = (\ pi_1, \ cdots, \ pi_K) ^ T $ as parameters. $ K $ is the number of clusters.

python


P(z_i = k | x_k, \pi) = \pi_k

From the above, the probability that the cluster $ z_i $ of data $ x_i $ is $ k $ can be written as follows. [^ 2]

python


P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}, \pi)
= \frac{\pi_k N(x_i| \mu_k, \Sigma_k)}{\sum_{l=1}^K \pi_l N(x_i| \mu_l, \Sigma_l)} \qquad\cdots (1)

Clustering is done by estimating each $ z_i $ from the dataset using this formula.

How to estimate the cluster

In order to estimate the cluster $ z_ {1: N} $ of each data from the formula $ (1) $, the parameters of the multivariate Gaussian distribution corresponding to each cluster $ \ mu_ {1: K}, \ Sigma_ { You need to know 1: K} $ and the categorical distribution parameter $ \ pi $. These are not given explicitly, so you should either estimate from the dataset or assume appropriate values. This time, we estimate the mean $ \ mu_ {1: K} $ of the multivariate Gaussian distribution from the dataset and assume the remaining parameters to the following values: $ I_D $ is the identity matrix of $ D \ times D $ and $ \ sigma ^ 2 $ is the hyperparameter. [^ 6]

python


\Sigma_1 = \cdots = \Sigma_K = \sigma^2 I_D \\
\pi = (1 / K, \cdots, 1 / K)^T

Then, in order to estimate $ \ mu_ {1: K} $ and $ z_ {1: N} $, we adopt the method of probabilistically sampling these one by one. This method is called ** Gibbs sampling **. [^ 9] [^ 10]

Sampling of $ z_i $ can be done according to the formula $ (1) $, so let's consider the probability distribution of $ \ mu_k $ below. I want to estimate the probability distribution, so it seems that Bayesian estimation can be used. The mean $ \ mu_ {\ rm pri} $ of the conjugate prior distribution of $ \ mu_k $ and the covariance matrix $ \ Sigma_ {\ rm pri} $ are defined as follows. [^ 7] These are common to all $ k = 1, \ cdots, K $. $ \ sigma_ {\ rm pri} ^ 2 $ is a hyperparameter.

python


\begin{align}
\mu_{\rm pri} &= (0, \cdots, 0)^T \\
\Sigma_{\rm pri} &= \sigma_{\rm pri}^2I_D
\end{align}

At this time, the posterior distribution of $ \ mu_k $ is as follows. $ n_k $ is the number of data belonging to the cluster $ k $ of $ x_ {1: N} $, and $ \ bar {x} _k $ is the average of them.

python


\begin{align}
\mu_{k, {\rm pos}} &= \Sigma_{k, {\rm pos}}(n_k \Sigma_k^{-1} \overline{x}_k + \Sigma_{\rm pri}^{-1}\mu_{\rm pri})  \\
\Sigma_{k, {\rm pos}} &= (n_k \Sigma_{k}^{-1} + \Sigma_{\rm pri}^{-1})^{-1} \\
\mu_k &= N(\mu | \mu_{k, {\rm pos}}, \Sigma_{k, {\rm pos}}) \qquad\cdots (2)
\end{align}

This time, $ \ Sigma_k $ and $ \ Sigma_ {\ rm pri} ^ {-1} $ are both constant multiples of $ I_D $, so $ \ mu_ {k, {\ rm pos}} $ and $ \ Sigma_ { k, {\ rm pos}} $ can be transformed as follows.

python


\begin{align}
\mu_{k, {\rm pos}} &= \Sigma_{k, {\rm pos}}\left(\frac{n_k}{\sigma^2} \overline{x}_k + \frac{1}{\sigma_{\rm pri}^2}\mu_{\rm pri}\right)  \\
\Sigma_{k, {\rm pos}} &= \left(\frac{n_k}{\sigma^2} + \frac{1}{\sigma_{\rm pri}^{2}}\right)^{-1}
\end{align}

From the above, it can be seen that sampling of $ \ mu_k $ should be performed according to formula (2).

Implementation

Here are some important points in the implementation.

Implementation of an array that can count the number of data in the cluster

From the formula $ (2) $, we need to calculate the number of data each cluster has. So we implement an array class ClusterArray with a count method that returns that number. [^ 8] In addition to the count method, the methods that are likely to be used are delegated from numpy.ndarray.

python


import numpy as np
from collections import Counter

class ClusterArray(object):
    def __init__(self, array):
        #array is a one-dimensional list, array
        self._array = np.array(array, dtype=np.int)
        self._counter = Counter(array)

    @property
    def array(self):
        return self._array.copy()

    def count(self, k):
        return self._counter[k]

    def __setitem__(self, i, k):
        #Self when executed._counter is also updated
        pre_value = self._array[i]
        if pre_value == k:
            return

        if self._counter[pre_value] > 0:
            self._counter[pre_value] -= 1
        self._array[i] = k
        self._counter[k] += 1

    def __getitem__(self, i):
        return self._array[i]

Remove extra calculations

You can calculate the probability density function directly in the formula $ (1) $, but you can reduce the amount of calculation by removing the factors unrelated to $ k $.

python


\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}, \pi)
&= \frac{\pi_k \exp\{- \frac{1}{2}\log|\Sigma_k| - \frac{1}{2} (x_i - \mu_k)^T\Sigma_k(x_i - \mu_k)\}}{\sum_{l=1}^K \pi_l \exp\{- \frac{1}{2}\log|\Sigma_l|- \frac{1}{2} (x_i - \mu_l)^T\Sigma_l(x_i - \mu_l)\}}  \\
&= \frac{\pi_k \exp\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\}}{\sum_{l=1}^{K}\pi_l \exp\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_l \|_2^2\}}
\end{align}

Implement the method log_deformed_gaussian to calculate the contents of $ \ exp $ of this expression and use it for the calculation.

python


def log_deformed_gaussian(x, mu, var):
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return -norm_squared / (2 * var)

About logsumexp

For calculations like $ \ log (\ sum_ {i = 1} \ exp f_i (x)) $, logsumexp is a technique to prevent overflow and underflow. However, this is not very efficient because it uses $ \ log $ and $ \ exp $ many times. Therefore, we will not use logsumexp this time. [^ 5]

For logsumexp, the following article will be very helpful. Mixed Gaussian distribution and logsumexp

Overall implementation

I implemented it while keeping the above in mind. With scikit-learn in mind, clustering can be executed with the fit method. The source code is long, so I fold it.

Source code

python


import numpy as np
from collections import Counter


def log_deformed_gaussian(x, mu, var):
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return -norm_squared / (2 * var)


class ClusterArray(object):
    def __init__(self, array):
        #array is a one-dimensional list, array
        self._array = np.array(array, dtype=np.int)
        self._counter = Counter(array)

    @property
    def array(self):
        return self._array.copy()

    def count(self, k):
        return self._counter[k]

    def __setitem__(self, i, k):
        #Self when executed._counter is also updated
        pre_value = self._array[i]
        if pre_value == k:
            return

        if self._counter[pre_value] > 0:
            self._counter[pre_value] -= 1
        self._array[i] = k
        self._counter[k] += 1

    def __getitem__(self, i):
        return self._array[i]


class GaussianMixtureClustering(object):
    def __init__(self, K, D, var=1, var_pri=1, seed=None):
        self.K = K  #Number of clusters
        self.D = D  #Explanatory variable dimensions(Set at the time of constructor for ease of implementation)
        self.z = None

        #Parameter setting of probability distribution
        self.mu = np.zeros((self.K, self.D))
        self.var = var  #Fixed, common to all clusters
        self.pi = np.full(self.K, 1 / self.K)  #Fixed, common to all clusters

        #Prior distribution settings
        self.mu_pri = np.zeros(self.D)
        self.var_pri = var_pri

        self._random = np.random.RandomState(seed)

    def fit(self, X, n_iter):
        init_z = self._random.randint(0, self.K, X.shape[0])
        self.z = ClusterArray(init_z)

        for _ in range(n_iter):
            for k in range(self.K):
                self.mu[k] = self._sample_mu_k(X, k)
            for i, x_i in enumerate(X):
                self.z[i] = self._sample_zi(x_i)

    def _sample_zi(self, x_i):
        log_probs_xi = log_deformed_gaussian(x_i, self.mu, self.var)

        probs_zi = np.exp(log_probs_xi) * self.pi
        probs_zi = probs_zi / probs_zi.sum()

        z_i = self._random.multinomial(1, probs_zi)
        z_i = np.where(z_i)[0][0]
        return z_i

    def _sample_mu_k(self, X, k):
        xk_bar = np.array([x for i, x in enumerate(X) if self.z[i] == k]).mean(axis=0)
        var_pos = 1 / (self.z.count(k) / self.var + 1 / self.var_pri)
        mu_pos = var_pos * (xk_bar * self.z.count(k) / self.var + self.mu_pri / self.var_pri)

        mu_k = self._random.multivariate_normal(mu_pos, var_pos * np.eye(self.D))
        return mu_k

Try clustering

Let's cluster the opening Iris Dataset using the implemented mixed Gaussian model. The hyperparameters are $ \ sigma ^ 2 = 0.1, \ sigma_ {\ rm pri} ^ 2 = 1 $, and the number of Gibbs sampling iterations is 10. Comparing the labels of the actual dataset with the clustering results, we get:

clustering_valualization.png

Source code

python


import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


#Data set loading
df = sns.load_dataset('iris')
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values

#Clustering with mixed Gaussian model
gmc = GMMClustering(K=3, D=4, var=0.1, seed=1)
gmc.fit(X, n_iter=10)
df['GMM_cluster'] = gmc.z.array

#Visualization of results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for sp in df['species'].unique():
    x = df[df['species'] == sp]['petal_length']
    y = df[df['species'] == sp]['petal_width']
    ax1.scatter(x, y, label=sp)
ax1.legend()
ax1.set_title('species')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for k in range(gmc.K):
    x = df[df['GMM_cluster'] == k]['petal_length']
    y = df[df['GMM_cluster'] == k]['petal_width']
    ax2.scatter(x, y, label=k)
ax2.legend()
ax2.set_title('GMM cluster')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()

The boundary between "versicolor" and "virginica" is suspicious, but clustering according to the label is almost complete. I also tried to visualize the clustering result using pair plot of seaborn.

clustering_valualization_2.png

Source code

python


sns.pairplot(
    df.drop(columns=['species']),
    vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
    hue='GMM_cluster'
)

Clustering is done nicely. If you look at the diagonal diagrams of pairplot, you can see that the dataset is represented by a mixed Gaussian model.

in conclusion

Machine learning professional series "Nonparametric Bayes point process and statistical machine learning mathematics" implements a mixed Gaussian model and is easy I tried clustering with various datasets. This time we dealt with a mixed Gaussian model, but you can also think of models like the "mixed Bernoulli model" and the "mixed Poisson model" that can be used for clustering. Next time, I will write about peripheral Gibbs sampling. This is a necessary technique for performing nonparametric Bayes.

(Added on January 21, 2020) I was able to write the next article. [Python] I implemented peripheralized Gibbs sampling

[^ 1]: The variance is different, but for the sake of simplicity, we are only considering the mean. [^ 2]: Can be derived using Bayes' theorem. [^ 5]: I've decided that it's okay not to use logsumexp as I can avoid overflow and underflow by setting $ \ sigma ^ 2 $ which is neither too big nor too small for the dataset. [^ 6]: $ \ sigma ^ 2 $ is the variance. All covariances are $ 0 $. It may seem like a rough assumption, but if you decide $ \ sigma ^ 2 $ properly, you can still cluster enough, and if you simplify it, you can reduce the amount of calculation. Of course, estimating these from the dataset should allow for more accurate clustering. [^ 7]: The conjugate prior of $ \ mu_k $ is a multivariate Gaussian distribution. [^ 8]: I think it's okay to keep the Counter instance outside without creating a class like this. I like to encapsulate these things into classes. [^ 9]: Gibbs sampling is a technique for approximating joint distributions that are difficult to calculate analytically. This time, we approximate $ P (\ mu_ {1: K}, z_ {1: N} | \ cdots) $. [^ 10]: Other methods can be used to estimate the mixed Gaussian distribution, but we recognize that Gibbs sampling is needed to extend to nonparametric Bayes. To be honest, I don't understand it properly, so please let me know if you can understand it.

Recommended Posts

[Python] Implementation of clustering using a mixed Gaussian model
[Python] Clustering with an infinitely mixed Gaussian model
Implementation of desktop notifications using Python
Implementation of VGG16 using Keras created without using a trained model
Try clustering with a mixed Gaussian model on a Jupyter Notebook
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
PRML Chapter 14 Conditional Mixed Model Python Implementation
Python implementation of continuous hidden Markov model
Time series analysis using a general Gaussian state-space model using Python [Implementation example considering exogenous and seasonality]
Maximum likelihood estimation implementation of topic model in python
Cut a part of the string using a Python slice
A python implementation of the Bayesian linear regression class
Implementation of a convolutional neural network using only Numpy
Variational Bayesian inference implementation of topic model in python
A reminder about the implementation of recommendations in Python
I tried using Python (3) instead of a scientific calculator
Implementation of TF-IDF using gensim
python: Basics of using scikit-learn ①
EM of mixed Gaussian distribution
Python implementation of particle filters
Python implementation mixed Bernoulli distribution
Implementation of quicksort in Python
A memorandum of using eigen3
A simple Python implementation of the k-nearest neighbor method (k-NN)
A memo of writing a basic function in Python using recursion
A record of patching a python package
A good description of Python decorators
Image capture of firefox using python
I made a Line-bot using Python!
Create a python GUI using tkinter
Python implementation of self-organizing particle filters
Mixed normal distribution implementation in python
Implementation of a simple particle filter
Implementation of CRUD using REST API with Python + Django Rest framework + igGrid
Implementation of a two-layer neural network 2
I tried to make a regular expression of "amount" using Python
Drawing a silverstone curve using python
[Python] A memorandum of beautiful soup4
A brief summary of Python collections
Removal of haze using Python detailEnhanceFilter
Visualization of mixed matrices using sklearn.metrics.ConfusionMatrixDisplay
I tried to make a regular expression of "time" using Python
Implementation of life game in Python
I tried to make a regular expression of "date" using Python
Clustering and visualization using Python and CytoScape
Evaluate the performance of a simple regression model using LeaveOneOut cross-validation
Python implementation of non-recursive Segment Tree
[Python] Mixed Gauss model with Pyro
General Gaussian state-space model in Python
Implementation of Light CNN (Python Keras)
Implementation of original sorting in Python
Implementation of Dijkstra's algorithm with python
Creating a learning model using MNIST
Create a survival prediction model for Kaggle Titanic passengers without using Python
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
Implement a model with state and behavior (3) --Example of implementation by decorator
Implementation and experiment of convex clustering method
Explanation of production optimization model by Python
Python: Basics of image recognition using CNN
Getting a combination of elements using itertools
[Python] Extension using inheritance of matplotlib (NavigationToolbar2TK)