I posted Article about k-means before. Since k-means has a problem of dependency on initial values, an algorithm called k-means ++ has been developed to overcome it. This time, I summarized what I learned about k-means ++.
In understanding k-means ++, I referred to the following.
-Introduction to Machine Learning for Language Processing (Natural Language Processing Series) Daiya Takamura (Author), Manabu Okumura (Supervised) Publisher; Corona Publishing -Essence of Machine Learning Koichi Kato (Author) Publisher; SB Creative Co., Ltd. -K-means ++ method-Wikipedia
k-means is an algorithm that first divides the data into appropriate clusters and then adjusts the data so that it is divided well using the average of the clusters. It is called the k-means method (k-means method) because it is an algorithm that creates k clusters of arbitrary designation.
Specifically, k-means follows the following process.
** Since k-means allocates clusters randomly at the beginning, its initial value may result in clustering that is far from optimal. Also, it takes a lot of time for the result to converge depending on the initial value. ** **
k-means++
k-means ++ is an algorithm that aims to overcome the above initial value dependency problem. ** k-means ++ is designed based on the idea that the centers of the initial clusters should be separated from each other **, and the initial cluster allocation is stochastically allocated according to the distance between data points. I will.
As mentioned above, we are trying to solve the initial value dependence problem by probabilistically determining the initial cluster center point based on the distance between the data points. ** **
The following is an implementation of the k-means method without using a library. Based on the k-means code described in Essence of Machine Learning, I modified it by myself when changing to k-means ++. I am.
import numpy as np
class KMeans_pp:
def __init__(self, n_clusters, max_iter = 1000, random_seed = 0):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.random_state = np.random.RandomState(random_seed)
def fit(self, X):
#Randomly determine the first cluster point
tmp = np.random.choice(np.array(range(X.shape[0])))
first_cluster = X[tmp]
first_cluster = first_cluster[np.newaxis,:]
#Calculate the square of the distance between the first cluster point and the other data points and divide each by its sum
p = ((X - first_cluster)**2).sum(axis = 1) / ((X - first_cluster)**2).sum()
r = np.random.choice(np.array(range(X.shape[0])), size = 1, replace = False, p = p)
first_cluster = np.r_[first_cluster ,X[r]]
#When the number of clusters to be divided is 3 or more
if self.n_clusters >= 3:
#Repeat until you can specify the specified number of cluster points
while first_cluster.shape[0] < self.n_clusters:
#Calculate the square of the distance between each cluster point and each data point
dist_f = ((X[:, :, np.newaxis] - first_cluster.T[np.newaxis, :, :])**2).sum(axis = 1)
#Which cluster point is closest to you?
f_argmin = dist_f.argmin(axis = 1)
#Derivation of the square of the distance between the closest cluster point and each data point
for i in range(dist_f.shape[1]):
dist_f.T[i][f_argmin != i] = 0
#Probabilistically derive new cluster points
pp = dist_f.sum(axis = 1) / dist_f.sum()
rr = np.random.choice(np.array(range(X.shape[0])), size = 1, replace = False, p = pp)
#Add new cluster points as initial values
first_cluster = np.r_[first_cluster ,X[rr]]
#Do the first labeling
dist = (((X[:, :, np.newaxis] - first_cluster.T[np.newaxis, :, :]) ** 2).sum(axis = 1))
self.labels_ = dist.argmin(axis = 1)
labels_prev = np.zeros(X.shape[0])
count = 0
self.cluster_centers_ = np.zeros((self.n_clusters, X.shape[1]))
#Ends when the cluster to which each data point belongs does not change or exceeds a certain number of iterations
while (not (self.labels_ == labels_prev).all() and count < self.max_iter):
#Calculate the centroid of each cluster at that time
for i in range(self.n_clusters):
XX = X[self.labels_ == i, :]
self.cluster_centers_[i, :] = XX.mean(axis = 0)
#Brute force the distance between each data point and the center of gravity of each cluster
dist = ((X[:, :, np.newaxis] - self.cluster_centers_.T[np.newaxis, :, :]) ** 2).sum(axis = 1)
#Remember the previous cluster label. If the previous label and the label do not change, the program ends.
labels_prev = self.labels_
#As a result of recalculation, assign the label of the cluster closest to the distance.
self.labels_ = dist.argmin(axis = 1)
count += 1
self.count = count
def predict(self, X):
dist = ((X[:, :, np.newaxis] - self.cluster_centers_.T[np.newaxis, :, :]) ** 2).sum(axis = 1)
labels = dist.argmin(axis = 1)
return labels
The following is a verification of whether clustering is really possible with this algorithm.
import matplotlib.pyplot as plt
#Create a suitable dataset
np.random.seed(0)
points1 = np.random.randn(80, 2)
points2 = np.random.randn(80, 2) + np.array([4,0])
points3 = np.random.randn(80, 2) + np.array([5,8])
points = np.r_[points1, points2, points3]
np.random.shuffle(points)
#Create a model to divide into 3 clusters
model = KMeans_pp(3)
model.fit(points)
print(model.labels_)
Then the output will look like this. You can see that the labels are brilliantly assigned to three.
[2 1 0 2 1 1 0 1 2 0 1 1 0 1 0 0 1 1 0 2 0 1 2 0 1 2 0 2 1 2 1 1 1 0 1 0 1
2 2 1 1 1 1 2 0 1 1 1 0 2 1 0 2 1 0 1 0 2 2 2 2 2 1 0 1 0 0 1 1 1 1 1 0 1
0 0 0 2 1 0 2 0 1 1 0 1 2 0 2 2 2 0 0 0 2 0 0 0 2 0 2 1 1 1 1 1 0 1 2 1 2
0 1 2 2 1 2 0 2 2 2 0 0 2 0 2 1 2 2 0 1 2 1 2 2 2 1 0 2 1 1 2 0 0 0 2 1 1
1 0 0 0 1 1 2 0 1 0 0 0 2 0 0 0 0 0 2 2 1 2 0 2 2 0 1 2 2 2 2 1 0 2 1 2 2
0 2 0 0 0 2 0 1 2 0 0 0 1 1 1 1 2 2 0 2 2 2 0 0 1 1 2 2 0 1 1 2 1 2 2 0 2
1 2 0 2 1 2 0 0 2 2 0 2 0 2 1 2 2 0]
Let's illustrate this with matplotlib.
markers = ["+", "*", "o", '+']
color = ['r', 'b', 'g', 'k']
for i in range(4):
p = points[model.labels_ == i, :]
plt.scatter(p[:, 0], p[:, 1], marker = markers[i], color = color[i])
plt.show()
Here is the output. You can see that clustering is completed without any problems.
Also, try to output the number of trials until the clustering converges.
print(model.count)
4
This time, the number of trials until the clustering of k-means ++ converged was four. When counting the number of trials in the same way with ordinary k-means, the result was 3 times. Originally, k-means ++ is said to take less time to converge, so I would like to verify this by myself.
Next I would like to challenge clustering by mixed normal distribution and clustering by EM algorithm.