--k-means: Minimize the squared error from the center of gravity of the cluster. --k-medoids: Perform the EM procedure so that the sum of dissimilarities from the cluster medoids (points belonging to the cluster that minimize the sum of dissimilarities) is minimized. --x-means: Controls cluster division based on BIC. --g-menas: Control cluster division by Anderson darling test, assuming the data is based on a normal distribution. --gx-means: The above two extensions. --etc (See the readme of pyclustering. There are various)
It would be nice if humans could see the data immediately and know the number of clusters, but that is rare, so I want a quantitative judgment method.
According to sklearn cheat sheet
Is also useful, but in my experience, it was rare for me to get a beautiful elbow (a point where the graph becomes jerky), and I was often confused about the number of clusters.
There is x-means as a method of clustering with the number of clusters fully automatically.
Below, how to use the library "pyclustering" that contains various clustering methods including x-means.
pyclustering is a library of clustering algorithms implemented in both python and C ++.
Dependent packages: scipy, matplotlib, numpy, PIL
pip install pyclustering
In addition to the EM step in k-means, x-means determines a new step: whether it is appropriate for a cluster to be represented by two or one normal distributions, and two are If appropriate, the operation is to divide the cluster into two.
Below, jupyter notebook is used.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn import cluster, preprocessing #Wine dataset df_wine_all=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None) #Variety(Row 0, 1-3)And color (10 rows) and amount of proline(13 rows)To use df_wine=df_wine_all[[0,10,13]] df_wine.columns = [u'class', u'color', u'proline'] #Data shaping X=df_wine[["color","proline"]] sc=preprocessing.StandardScaler() sc.fit(X) X_norm=sc.transform(X) #plot %matplotlib inline x=X_norm[:,0] y=X_norm[:,1] z=df_wine["class"] plt.figure(figsize=(10,10)) plt.subplot(4, 1, 1) plt.scatter(x,y, c=z) plt.show # x-means from pyclustering.cluster.xmeans import xmeans from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer xm_c = kmeans_plusplus_initializer(X_norm, 2).initialize() xm_i = xmeans(data=X_norm, initial_centers=xm_c, kmax=20, ccore=True) xm_i.process() #Plot the results z_xm = np.ones(X_norm.shape) for k in range(len(xm_i._xmeans__clusters)): z_xm[xm_i._xmeans__clusters[k]] = k+1 plt.subplot(4, 1, 2) plt.scatter(x,y, c=z_xm) centers = np.array(xm_i._xmeans__centers) plt.scatter(centers[:,0],centers[:,1],s=250, marker='*',c='red') plt.show
The top is a figure colored for each original data class, and the bottom is the clustering result by x-means. The ★ mark is the center of gravity of each class.
In the code
xm_c = kmeans_plusplus_initializer (X_norm, 2) .initialize (), the initial value of the number of clusters is set to 2, but it clusters properly to 3.
I am running x-means with
For the x-means instance (
xm_i in the above code), if you look at the instance variables before and after learning, you can see what the learning result looks like. For example
Can be obtained with
dict_keys(['_xmeans__pointer_data', '_xmeans__clusters', '_xmeans__centers', '_xmeans__kmax', '_xmeans__tolerance', '_xmeans__criterion', '_xmeans__ccore'])
I think you should look at various things such as.
A copy of the data to be clustered.
A list showing which line of the original data (\ _xmeans__pointer_data) belongs to each cluster.
The number of elements in the list is the same as the number of clusters, each element is also a list, and the number of the line belonging to the cluster is stored.
A list consisting of the coordinates (list) of the centroid of each cluster
Maximum number of clusters (set value)
A constant that defines the stop condition for x-means iteration. The algorithm terminates when the maximum change in the center of gravity of the cluster falls below this constant.
It is a judgment condition of cluster division. Default: BIC
This is the setting value for whether to use C ++ code instead of python code.