KMEANS first determines an appropriate initial position and repeats the estimation and center value calculation until it converges. The better the initial position, the higher the accuracy and the fewer repetitions until convergence. There are two initialization methods, RANDOM and k-means ++. As the name suggests, RANDOM randomly selects samples for the number of clusters and uses them as the initial position. In k-means ++, the first point is randomly selected, but 2 From the second onward, create a probability distribution with D (x) ^ 2 so that the probability increases as the distance increases, and select as many objects as far as possible as the number of clusters. The explanation here is very easy to understand → https://www.medi-08-data-06.work/entry/kmeans
Since I was able to do KMEANS with a microcomputer, it is necessary to perform initialization processing that is as light as KMEANS ++ and does not consume memory.
① Calculate the average value of all distances between the center values ② Estimate the cluster to which the sample belongs ③ If the distance of ② is larger than half of ① (the distance is considered as the diameter and its radius), it is set as the new center value. ④ If the cluster is updated with a new sample according to ③, go to ①. If not, go to ②. Repeat ① to ④ for all target data
The fact that the distance at the time of cluster estimation of a sample is larger than the distance average between the medians means that the sample is a new cluster and the distribution must be wider than the current center value, so until the distance average between the medians is widened. The idea is to expand it. It's hard to understand, but I can't explain it well, so please take a look at the source.
iris
processing | Iteration average | Iteration distribution | Correct answer average | Correct answer dispersion |
---|---|---|---|---|
random | 6.418 | 2.174 | 0.763 | 0.1268 |
k-means++ | 5.38 | 1.63 | 0.804 | 0.09 |
New method | 6.104 | 2.088 | 0.801 | 0.09 |
seeds
processing | Iteration average | Iteration distribution | Correct answer average | Correct answer dispersion |
---|---|---|---|---|
random | 9.109 | 3.237 | 0.921 | 0.0049 |
k-means++ | 7.096 | 2.4284 | 0.921 | 0.0051 |
New method | 7.368 | 2.56 | 0.921 | 0.0046 |
Although it was not as good as k-means ++, I think that the result is clearly better than random, the number of iterations is small, the accuracy is high, and the variance is small, so the variation in the result seems to be small. It would be nice if there was a little more and there was a sample of about 8 clusters, but there was no convenient one, so I decided to test only iris and seeds. By the way, it seems that the method is highly compatible with each sample because it has higher accuracy and reduced number of iterations than the actual test used in the actual work. The source code is posted below.
def Kmeans_Predict(means, x):
distances = []
for m in means:
distances.append(np.linalg.norm(m - x))
predict = np.argmin(distances)
return distances[predict], predict
#Calculate the average distance between medians
def KmeansInit_CalcRadiusAverageDistance(means):
length = len(means)
avrDistance = 0
cnt = 0
for i in range(length):
for j in range(i):
if j == i: continue
avrDistance += np.linalg.norm(means[i] - means[j])
cnt += 1
return (avrDistance / cnt/ 2)
def KmeansInit_FarawayCentroids(n_clusters, x):
means = np.zeros((n_clusters, x.shape[1]))
distanceThreshold = 0
for cnt in range(1):
for ix in x:
distance, predict = Kmeans_Predict(means, ix)
if distance > distanceThreshold:
#If it is larger than the average distance between the medians, then the sample is a new cluster.
means[predict] = ix
distanceThreshold = KmeansInit_CalcRadiusAverageDistance(means)
else:
#If it is the same as or less than the average distance between the center values, it is a reasonable position and the center value is not updated.
pass
return means
import math
import numpy as np
import sklearn.cluster
import sklearn.preprocessing
import sys
def Kmeans_Predict(means, x):
distances = []
for m in means:
distances.append(np.linalg.norm(m - x))
predict = np.argmin(distances)
return distances[predict], predict
def KmeansInit_CalcRadiusAverageDistance(means):
length = len(means)
avrDistance = 0
cnt = 0
for i in range(length):
for j in range(i):
if j == i: continue
avrDistance += np.linalg.norm(means[i] - means[j])
cnt += 1
return (avrDistance / cnt/ 2)
def KmeansInit_FarawayCentroids(n_clusters, x):
means = np.zeros((n_clusters, x.shape[1]))
distanceThreshold = 0
for cnt in range(1):
for ix in x:
distance, predict = Kmeans_Predict(means, ix)
if distance > distanceThreshold:
means[predict] = ix
distanceThreshold = KmeansInit_CalcRadiusAverageDistance(means)
return means
def loadIris():
data = np.loadtxt("./iris_dataset.txt", delimiter="\t", dtype=str)
length = len(data)
x = data[:,0:4].astype(np.float)
names = data[:,4]
nameList = np.unique(data[:,4])
y = np.zeros(length)
for i, name in enumerate(nameList):
y[names == name] = i
return x, y
def loadSeeds():
data = np.loadtxt("./seeds_dataset.txt", delimiter="\t", dtype=str)
length = len(data)
x = data[:,0:7].astype(np.float)
y = data[:,7].astype(float)
return x, y
def KmeansModel(init, n_clusters):
return sklearn.cluster.KMeans(
n_clusters=n_clusters,
init=init,
n_init=1,
max_iter=100,
tol=1e-5,
verbose=3
)
def CalcAccuracy(y, predicts):
answers = np.unique(y)
clusters = np.sort(np.unique(predicts))
accuracy = []
ignoreCluster = []
for ans in answers:
pred = predicts[y == ans]
total = []
totalClusters = []
for c in clusters:
if c in ignoreCluster:
continue
total.append(np.sum(pred == c))
totalClusters.append(c)
maxIdx = np.argmax(total)
ignoreCluster.append(totalClusters[maxIdx])
acc = total[maxIdx] / len(pred)
accuracy.append(acc)
return accuracy
def KmeansTestSub(init, n_clusters, x, y):
model = KmeansModel(init, n_clusters)
model.fit(x)
predicts = model.predict(x)
accuracy = CalcAccuracy(y, predicts)
return [model.n_iter_, np.mean(accuracy)] + list(accuracy)
def shuffleData(x, y):
idxs = np.arange(len(x))
np.random.shuffle(idxs)
x, y = x[idxs], y[idxs]
return x, y
def KmeansTest(dataset, n_clusters, prefix="", test_count=1000):
x, y = dataset
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(x)
x = scaler.transform(x)
# farway
report = []
for i in range(test_count):
x, y = shuffleData(x, y)
init = KmeansInit_FarawayCentroids(n_clusters, x)
rep = KmeansTestSub(init, n_clusters, x, y)
report.append(rep)
report = np.vstack([report, np.mean(report, axis=0)])
report = np.vstack([report, np.std(report, axis=0)])
np.savetxt("./{}report_farway.txt".format(prefix), report, delimiter="\t")
# random
report = []
for i in range(test_count):
rep = KmeansTestSub("random", n_clusters, x, y)
report.append(rep)
report = np.vstack([report, np.mean(report, axis=0)])
report = np.vstack([report, np.std(report, axis=0)])
np.savetxt("./{}report_random.txt".format(prefix), report, delimiter="\t")
# k-means++
report = []
for i in range(test_count):
rep = KmeansTestSub("k-means++", n_clusters, x, y)
report.append(rep)
report = np.vstack([report, np.mean(report, axis=0)])
report = np.vstack([report, np.std(report, axis=0)])
np.savetxt("./{}report_kmeansPP.txt".format(prefix), report, delimiter="\t")
def run():
KmeansTest(loadIris(), prefix="iris_", n_clusters=3)
KmeansTest(loadSeeds(), prefix="seeds_", n_clusters=3)
if __name__ == "__main__":
run()
# python kmeans_test.py
that's all
Recommended Posts