What is KMEANS initialization processing?

KMEANS first determines an appropriate initial position and repeats the estimation and center value calculation until it converges. The better the initial position, the higher the accuracy and the fewer repetitions until convergence. There are two initialization methods, RANDOM and k-means ++. As the name suggests, RANDOM randomly selects samples for the number of clusters and uses them as the initial position. In k-means ++, the first point is randomly selected, but 2 From the second onward, create a probability distribution with D (x) ^ 2 so that the probability increases as the distance increases, and select as many objects as far as possible as the number of clusters. The explanation here is very easy to understand → https://www.medi-08-data-06.work/entry/kmeans

Motivation

Since I was able to do KMEANS with a microcomputer, it is necessary to perform initialization processing that is as light as KMEANS ++ and does not consume memory.

The way I came up with

① Calculate the average value of all distances between the center values ② Estimate the cluster to which the sample belongs ③ If the distance of ② is larger than half of ① (the distance is considered as the diameter and its radius), it is set as the new center value. ④ If the cluster is updated with a new sample according to ③, go to ①. If not, go to ②. Repeat ① to ④ for all target data

The fact that the distance at the time of cluster estimation of a sample is larger than the distance average between the medians means that the sample is a new cluster and the distribution must be wider than the current center value, so until the distance average between the medians is widened. The idea is to expand it. It's hard to understand, but I can't explain it well, so please take a look at the source.

1000 test results by iris and seeds

iris

processing	Iteration average	Iteration distribution	Correct answer average	Correct answer dispersion
random	6.418	2.174	0.763	0.1268
k-means++	5.38	1.63	0.804	0.09
New method	6.104	2.088	0.801	0.09

seeds

processing	Iteration average	Iteration distribution	Correct answer average	Correct answer dispersion
random	9.109	3.237	0.921	0.0049
k-means++	7.096	2.4284	0.921	0.0051
New method	7.368	2.56	0.921	0.0046

Although it was not as good as k-means ++, I think that the result is clearly better than random, the number of iterations is small, the accuracy is high, and the variance is small, so the variation in the result seems to be small. It would be nice if there was a little more and there was a sample of about 8 clusters, but there was no convenient one, so I decided to test only iris and seeds. By the way, it seems that the method is highly compatible with each sample because it has higher accuracy and reduced number of iterations than the actual test used in the actual work. The source code is posted below.

Source code of initialization process

def Kmeans_Predict(means, x):
    distances = []
    for m in means:
        distances.append(np.linalg.norm(m - x))
    predict = np.argmin(distances)
    return distances[predict], predict

#Calculate the average distance between medians
def KmeansInit_CalcRadiusAverageDistance(means):
    length = len(means)
    avrDistance = 0
    cnt = 0
    for i in range(length):
        for j in range(i):
            if j == i: continue
            avrDistance += np.linalg.norm(means[i] - means[j])
            cnt += 1
    return (avrDistance / cnt/ 2)

def KmeansInit_FarawayCentroids(n_clusters, x):
    means = np.zeros((n_clusters, x.shape[1]))
    distanceThreshold = 0
    for cnt in range(1):
        for ix in x:
            distance, predict = Kmeans_Predict(means, ix)
            if distance > distanceThreshold:
                #If it is larger than the average distance between the medians, then the sample is a new cluster.
                means[predict] = ix
                distanceThreshold = KmeansInit_CalcRadiusAverageDistance(means)
            else:
                #If it is the same as or less than the average distance between the center values, it is a reasonable position and the center value is not updated.
                pass
    return means

Whole test source code

import math
import numpy as np
import sklearn.cluster
import sklearn.preprocessing
import sys

def Kmeans_Predict(means, x):
    distances = []
    for m in means:
        distances.append(np.linalg.norm(m - x))
    predict = np.argmin(distances)
    return distances[predict], predict

def KmeansInit_CalcRadiusAverageDistance(means):
    length = len(means)
    avrDistance = 0
    cnt = 0
    for i in range(length):
        for j in range(i):
            if j == i: continue
            avrDistance += np.linalg.norm(means[i] - means[j])
            cnt += 1
    return (avrDistance / cnt/ 2)

def KmeansInit_FarawayCentroids(n_clusters, x):
    means = np.zeros((n_clusters, x.shape[1]))
    distanceThreshold = 0
    for cnt in range(1):
        for ix in x:
            distance, predict = Kmeans_Predict(means, ix)
            if distance > distanceThreshold:
                means[predict] = ix
                distanceThreshold = KmeansInit_CalcRadiusAverageDistance(means)
    return means

def loadIris():
    data = np.loadtxt("./iris_dataset.txt", delimiter="\t", dtype=str)
    length = len(data)
    x = data[:,0:4].astype(np.float)
    names = data[:,4]
    nameList = np.unique(data[:,4])
    y = np.zeros(length)
    for i, name in enumerate(nameList):
        y[names == name] = i
    return x, y

def loadSeeds():
    data = np.loadtxt("./seeds_dataset.txt", delimiter="\t", dtype=str)
    length = len(data)
    x = data[:,0:7].astype(np.float)
    y = data[:,7].astype(float)
    return x, y


def KmeansModel(init, n_clusters):
    return sklearn.cluster.KMeans(
        n_clusters=n_clusters,
        init=init,
        n_init=1,
        max_iter=100,
        tol=1e-5,
        verbose=3
    )

def CalcAccuracy(y, predicts):
    answers = np.unique(y)
    clusters = np.sort(np.unique(predicts))
    accuracy = []
    ignoreCluster = []
    for ans in answers:
        pred = predicts[y == ans]
        total = []
        totalClusters = []
        for c in clusters:
            if c in ignoreCluster:
                continue
            total.append(np.sum(pred == c))
            totalClusters.append(c)
        maxIdx = np.argmax(total)
        ignoreCluster.append(totalClusters[maxIdx])
        acc = total[maxIdx] / len(pred)
        accuracy.append(acc)
    return accuracy

def KmeansTestSub(init, n_clusters, x, y):
    model = KmeansModel(init, n_clusters)
    model.fit(x)
    predicts = model.predict(x)
    accuracy = CalcAccuracy(y, predicts)
    return [model.n_iter_, np.mean(accuracy)] + list(accuracy)

def shuffleData(x, y):
    idxs = np.arange(len(x))
    np.random.shuffle(idxs)
    x, y = x[idxs], y[idxs]
    return x, y

def KmeansTest(dataset, n_clusters, prefix="", test_count=1000):
    x, y = dataset
    scaler = sklearn.preprocessing.StandardScaler()
    scaler.fit(x)
    x = scaler.transform(x)
    
    # farway
    report = []
    for i in range(test_count):
        x, y = shuffleData(x, y)
        init = KmeansInit_FarawayCentroids(n_clusters, x)
        rep = KmeansTestSub(init, n_clusters, x, y)
        report.append(rep)
    report = np.vstack([report, np.mean(report, axis=0)])
    report = np.vstack([report, np.std(report, axis=0)])
    np.savetxt("./{}report_farway.txt".format(prefix), report, delimiter="\t")
    
    # random
    report = []
    for i in range(test_count):
        rep = KmeansTestSub("random", n_clusters, x, y)
        report.append(rep)
    report = np.vstack([report, np.mean(report, axis=0)])
    report = np.vstack([report, np.std(report, axis=0)])
    np.savetxt("./{}report_random.txt".format(prefix), report, delimiter="\t")
    
    # k-means++
    report = []
    for i in range(test_count):
        rep = KmeansTestSub("k-means++", n_clusters, x, y)
        report.append(rep)
    report = np.vstack([report, np.mean(report, axis=0)])
    report = np.vstack([report, np.std(report, axis=0)])
    np.savetxt("./{}report_kmeansPP.txt".format(prefix), report, delimiter="\t")

def run():
    KmeansTest(loadIris(), prefix="iris_", n_clusters=3)
    KmeansTest(loadSeeds(), prefix="seeds_", n_clusters=3)

if __name__ == "__main__":
    run()
# python kmeans_test.py

[PYTHON] I made a KMEANS initialization process that is faster and more accurate than random.