[PYTHON] Clustering G-means that automatically determines the number of clusters

This article is the entry for the 6th day of Freee Data People Advent Calendar 2019. I started writing in the middle of the night of the previous day and wrote while saying hehehe.

Introduction

Do you know a library called PyClustering? PyClustering is a clustering-specific library available from Python and C ++. An algorithm called G-means has been newly implemented in such PyClustering v0.9.2. I saw the name G-means for the first time + I couldn't find an article in Japanese, so I looked it up and summarized it. Since the algorithm itself is simple, it may be easiest to read the Paper directly. not.

G-means

G-means is an extension of K-means and is an algorithm that automatically determines the number of clusters, which was a parameter of K-means. There is a similar method called X-means, but there is "I examined the X-means method that automatically estimates the number of clusters" It is introduced in an easy-to-understand manner with the code. (By the way, in Pyclustering, you can also use X-means)

algorithm

The algorithm is as follows. It is the same as X-means to start with a small number of clusters and subdivide it only with different stop conditions.

  1. Determine some center points as appropriate.
  2. Clustering by K-means is performed using the center point determined in 1. as the initial value.
  3. Perform a statistical test (Anderson–Darling test) to see if the sample in each cluster created in 2. follows the Gaussian distribution. [^ 1]
  4. If the null hypothesis is rejected, split the cluster in two. If the null hypothesis is not rejected, determine the cluster. [^ 2]
  5. Repeat steps 2 and 4 until the cluster is no longer split.

[^ 1]: G-means is like Gaussian's G. [^ 2]: The paper uses $ \ alpha = 0.0001 $, but Pyclustering Then it seems that it is done with $ \ alpha = 0.01 $. Since the paper says that it is better to reduce the probability of Type I Error (false positive), is this implementation okay? I will think.

That's all for the algorithm, but in the paper

――How to determine a new center point when the cluster is divided in 4. --Anderson-Ingenuity to make the Darling test easier

Is also mentioned.

How to determine a new center point initial value when dividing a cluster

--When the center point in the cluster is $ \ mathbf {c} $, the two points $ \ mathbf {c} \ pm \ mathbf {m} $ are set as the new center point initial values. --Two types of $ \ mathbf {m} $ have been proposed. ――It will be decided at random. --Principal component analysis is performed on the sample in the cluster, and when the first principal component is $ \ mathbf {s} $ and the eigenvalue of the principal component (variance of the principal component) is $ \ lambda $, $ \ mathbf {m} = \ mathbf {s} \ sqrt {2 \ lambda / \ pi} $. [^ 3]

[^ 3]: Is it an image if you divide the axis in the direction in which the samples are scattered?

In the paper, it was said that the method using the second principal component was adopted, but in Pyclustering [it seems to adopt the first method](https://github.com/annoviko/pyclustering/ blob / 3457a2457c9aa385f625d628cec19dd6fada8326 / pyclustering / cluster / gmeans.py # L331).

Ingenuity to make the Anderson-Darling test easier

If the sample to be clustered is left as it is, it is difficult to perform statistical testing in high dimensions, so it is projected in one dimension. Specifically, the sample $ \ mathbf {x} $ is projected onto the one-dimensional $ x ^ {\ prime} $ as follows.

x^{\prime} = \frac{\langle \mathbf{x}, \mathbf{v} \rangle}{\| \mathbf{v} \|^2}

However, $ \ mathbf {v} $ is $ \ mathbf {c} \ _1 $, $ \ mathbf {c} , respectively, for the two points determined in the above "How to determine the initial value of the new center point when dividing the cluster". It is $ \ mathbf {c} \ _1 - \ mathbf {c} \ _2 $ when it is _2 $. Does that mean projecting onto the first principal component (the axis in the direction of the spread) and testing whether it follows a normal distribution?

Practice

Now that we know the algorithm, I would like to actually perform clustering using Pyclustering. For the time being [here](https://qiita.com/deaikei/items/8615362d320c76e2ce0b#%E3%81%95%E3%82%89%E3%81%AB%E3%83%A2%E3%83%A4% E3% 83% 83% E3% 81% A8% E3% 81% 97% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% 95% E3% 81% I tried to target the same data as 9B% E3% 81% A6% E3% 81% BF% E3% 82% 8B).

Dummy data generation
from sklearn.datasets import make_blobs

X, Y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=8,
                  cluster_std=1.5,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)

Correct label

Correct label plot
import seaborn as sns
sns.set(style="whitegrid")
sns.scatterplot(
    X[:, 0], X[:, 1], hue=Y, legend="full"
);

Since centers = 8, naturally 8 colors will come out.

正解ラベル

G-means

Well, the main subject is from here. What does it look like when you try using G-means?

gmeans result plot
from pyclustering.cluster import gmeans
import numpy as np
import itertools

gmeans_instance = gmeans.gmeans(X).process()

clusters = gmeans_instance.get_clusters()
centers = gmeans_instance.get_centers()

labels_size = len(
    list(itertools.chain.from_iterable(clusters))
)
labels = np.zeros((1, labels_size))
for n, n_th_cluster in np.ndenumerate(clusters):
    for img_num in n_th_cluster:
        labels[0][img_num] = n[0]
labels = labels.ravel()

sns.scatterplot(
    X[:, 0], X[:, 1], hue=labels, legend="full"
)

Oh. The number of clusters is not eight, but they are clustered in a comfortable way. Isn't this pretty cool?

G-means

If you want to cluster in a situation where you have no idea how many clusters you have, using G-means seems like a good option.

The X-means of the rival horse is ...

Finally, I would like to compare the results by using Pyclustering's X-means as a rival horse.

xmeans result plot
from pyclustering.cluster import xmeans
import numpy as np
import itertools

xmeans_instance = xmeans.xmeans(X).process()

clusters = xmeans_instance.get_clusters()
centers = xmeans_instance.get_centers()

labels_size = len(
    list(itertools.chain.from_iterable(clusters))
)
labels = np.zeros((1, labels_size))
for n, n_th_cluster in np.ndenumerate(clusters):
    for img_num in n_th_cluster:
        labels[0][img_num] = n[0]
labels = labels.ravel()

sns.scatterplot(
    X[:, 0], X[:, 1], hue=labels, legend="full"
)

Oh? [Referenced X-means article](https://qiita.com/deaikei/items/8615362d320c76e2ce0b#%E3%81%95%E3%82%89%E3%81%AB%E3%83%A2% E3% 83% A4% E3% 83% 83% E3% 81% A8% E3% 81% 97% E3% 81% 9F% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% 92% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0% E3% 81% It's completely different from 95% E3% 81% 9B% E3% 81% A6% E3% 81% BF% E3% 82% 8B).

X-means

The G-means paper also contains experimental results that X-means overfits and overestimates the number of clusters, so the results are uncomfortable ... (I haven't been able to dig deep here) As far as Pyclustering is used, it seems better to use G-means than X-means.

in conclusion

We have summarized the G-means algorithm that was recently (?) Implemented in Pyclustering and the results of actually using it. G-means gave better results than X-means when I used it lightly, so it may be better to use G-means when you want to "try clustering for the time being!".

Huh. The Advent Calendar managed to make it in time ... Do you want to go to bed ... It's early morning! !! !!

Recommended Posts

Clustering G-means that automatically determines the number of clusters
[Python] A program that counts the number of valleys
10. Counting the number of lines
Get the number of digits
Reuse the results of clustering
Calculate the number of changes
A tool that automatically turns the gacha of a social game
How to find the optimal number of clusters in k-means
[Nonparametric Bayes] Estimating the number of clusters using the Dirichlet process
Get the number of views of Qiita
Get the number of Youtube subscribers
[Python] A program that calculates the number of chocolate segments that meet the conditions
I made a calendar that automatically updates the distribution schedule of Vtuber
[Python] A program that calculates the number of socks to be paired
The story of developing a web application that automatically generates catchphrases [MeCab]
This and that of the inclusion notation.
Count / verify the number of method calls.
Count the number of characters with echo
Tensorflow, it seems that even the eigenvalues of the matrix can be automatically differentiated
Let's do clustering that gives a nice bird's-eye view of the text dataset
Set the color on the poster side so that the color of the Youtube subtitles changes automatically.
[Python] Automatically totals the total number of articles posted by Qiita using the API
zsh settings that facilitate the use of virtualenv
Calculate the total number of combinations with python
Divide the string into the specified number of characters
One liner that lists the colors of matplotlib
Find the number of days in a month
Clustering the posture of snapshots of fashion site, Wear
Minimize the number of polishings by combinatorial optimization
Determine the number of classes using the Starges formula
A python script that gets the number of jobs for a specified condition from indeed.com
[Python] A program that finds the shortest number of steps in a game that crosses clouds
A script that can perform stress tests according to the number of CPU cores
I made a calendar that automatically updates the distribution schedule of Vtuber (Google Calendar edition)