# [PYTHON] k-means and kernel k-means

I started writing my study content on another blog using the waiting period at home, but since the tex function there was shit, I will move some articles to qiita. .. ..

I learned k-means, the most basic [citation needed] algorithm for unsupervised learning, and tried it easily.

k-means One of the methods to separate given data into multiple clusters.

• The algorithm is simple and easy to understand
• Interpretation of results is not difficult

Since it is classified by the distance from the center of the cluster, it is almost always the intuitive result when plotting.

• The number of clusters must be specified in advance (it is not something that recognizes the number of clusters in advance)
• Not suitable for things that cannot be linearly separated

## algorithm

given

• Dataset $x_i$
• Number of clusters $K$

Method of calculation

1. Randomly assign cluster $k_i$ to all data $x_i$ $(x_i, k_i)$ 1.Central for each cluster(m_k)Seekingm_k |N_k| = \sum_{i \in N_k} x_i
2. Calculate the distance to all cluster centers for all data and update to the shortest cluster
3. Repeat 2-3 until there are no more cluster updates

Where $N_k$ is the set of data contained in the cluster $k$ at that time.

## Example of use

Experiment with the following parameters in sklearn's datasets.make_circles.

• n_samples = 100
• factor = 0.1
• noise = 0.01

With the above settings, two clusters, a circle with a radius of 0.1 and a circle with a radius of 1, are obtained under a noise of 0.01.

### code

# k:Number of clusters
def kmeans(data, k):
df = pd.DataFrame()
df['x'] = data.T[0]
df['y'] = data.T[1]
clusters = np.random.randint(0, k, len(df))
df['c'] = clusters

while True:
before_clusters = list(df.c.copy())
centers = df.groupby('c').mean().reset_index().drop(columns=['c']).T

tmp = df.drop(columns=['c']).T
new_clusters = [min(centers.columns, key=lambda k:((tmp[col] - centers[k]) ** 2).sum()) for col in tmp.columns]
if before_clusters == new_clusters:
break
else:
df['c'] = new_clusters

return df.c


### Experimental result

The result of turning k-means is as follows. Since the data are clearly not linearly separable, it can be seen that the classification results are not correctly classified as expected.

kernel k-means

• Non-linear separation is also possible

• The classification result depends greatly on how to select the kernel function and kernel parameters.

## algorithm

given

• Dataset $x_i$
• Number of clusters $K$
• Kernel function (and required parameters for that kernel function)

Almost the same as normal k-means. When finding the center of each cluster, the center of the feature space may be found. That is, considering the cluster $k$, we define the vector $m_k$, which minimizes the following quantity, as the center of the cluster $k$.

\displaystyle{
\begin{aligned}
\sum_{i \in N_k} (\phi(x_i) - m_k)^ 2.
\end{aligned}
}


M_k] to minimize this can be obtained as follows. \displaystyle{ \begin{aligned} m_k = \frac{1}{|N_k|} \sum_{i \in N_k} \phi(x_i). \end{aligned} }  Using this, the distance between the data x_i $and the cluster$ k is as follows. \displaystyle{ \begin{aligned} (\phi(x_i) - m_k)^ 2 &= K(x_i, x_i) -\frac{2}{|N_k|} \sum_{j \in N_k} K(x_i, x_j) + \frac{1}{|N_k|^ 2} \sum_{m, n \in N_k} K(x_m, x_n) \end{aligned} }  Where K (x, y) is a kernel function. ## Example of use Experiment with the following parameters in sklearn's datasets.make_circles. • n_samples = 100 • factor = 0.1 • noise = 0.01 The kernel function uses the following RBF kernel. \displaystyle{ \begin{aligned} K(x, y) = \exp \left(\frac{(x-y)^ 2}{\sigma^2} \right) \end{aligned} }  ### code # kernel:Kernel function that receives 2 variables def kernel_kmeans(data, k, kernel): df = pd.DataFrame() df['x'] = data.T[0] df['y'] = data.T[1] clusters = np.random.randint(0, k, len(df)) df['c'] = clusters self_corr = {} #In-class correlation is heavy when calculated many times, so cache it def compute_self_corr(): for j in range(0, k): sub_df = df[df.c == j].drop(columns=['c']).T ret = sub_df.apply(lambda col1: sub_df.apply(lambda col2: kernel(col1, col2), axis=0).mean(), axis=0).mean() self_corr[j] = ret #Calculate the distance to the cluster center in the feature space def compute_distance_from_center_j(x, j): sub_df = df[df.c == j].drop(columns=['c']).T ret = kernel(x, x) ret += sub_df.apply(lambda col: -2 * kernel(x, col), axis=0).mean() ret += self_corr[j] return ret while True: compute_self_corr() before_clusters = list(df.c.copy()) tmp = df.drop(columns=['c']).T new_clusters = [min(range(0, k), key=lambda compute_distance_from_center_j(tmp[col], j)) for col in tmp.columns] if before_clusters == new_clusters: break else: df['c'] = new_clusters return df.c  ### Experimental result The result of turning kernel k-means is as follows. The kernel parameter \ sigma $is$ 0.5 $. ## If kernel k-means works (does not work) Consider the following situation. • Number of clusters 2 • Cluster$ c_r $uniformly distributed on the circumference of radius$ r $and cluster$ c_R $uniformly distributed on the circumference of radius$ R $• r \lt R • The number of data is large enough, and the sum can be replaced with sufficient accuracy by integration. • The data is correctly classified. That is, the data on the radius$ r $is classified into cluster$ c_r $and the other is classified into cluster$ c_R $. • Kernel functions use RBF kernel Under the above situation, new data$ D = (x, y) = (R, 0) $is added and classified. In this case, if the distance to enter the class$ c $is written as$ C (c) $, the distance to the classes$ c_r $and$ c_R is as follows. \displaystyle{ \begin{aligned} C(c_r) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\ C(c_R) &= 1 - \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right). \end{aligned} }  Where I_ \ alpha (x) $is a modified Bessel function. About this, when it is$ r \ ll \ sigma \ ll R \displaystyle{ \begin{aligned} C(c_r) &= 2, \\ C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}. \end{aligned} }  Obviously C (c_R) \ lt C (c_r) $, so it is classified as the correct cluster. Also, for$ \ sigma \ ll r, R \displaystyle{ \begin{aligned} C(c_r) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\ C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}. \end{aligned} }  If you look at the area where the two are equal,r=RThere is no choice but otherwiseC(c_r) \lt C(c_R)It becomes an incorrect classification. The important thing here is\sigmaWhen\delta r := |R - r|It doesn't matter how big or small the scale is. Alsor, R \ll \sigmaWhen \displaystyle{ \begin{aligned} C(c_r) &= 0, \\ C(c_R) &= 0. \end{aligned} }  Obviously it cannot be classified correctly. Also, considering the reading order of r / \ sigma and R / \ sigma $, it can be seen that they are not classified correctly because they have the same expression as normal k-means. Similarly, consider the case where the data$ d = (x, y) = (r, 0) is added. The distance to the center of the class \displaystyle{ \begin{aligned} C(c_r) &= 1 - \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\ C(c_R) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right). \end{aligned} }  About this, when it is r \ ll \ sigma \ ll R \displaystyle{ \begin{aligned} C(c_r) &= 0, \\ C(c_R) &= 1 + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}. \end{aligned} }  Obviously C (c_r) \ lt C (c_R) $, so it is classified as the correct cluster. Also, for$ \ sigma \ ll r, R \displaystyle{ \begin{aligned} C(c_r) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\ C(c_R) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}. \end{aligned} }  In this case as well, if you examine the area where the two are equal, there is only r = R $, and otherwise it is$ C (c_r) \ lt C (c_R) $, which is an incorrect classification. Also, when it is$ r, R \ ll \ sigma \displaystyle{ \begin{aligned} C(c_r) &= 0, \\ C(c_R) &= 0. \end{aligned} }  Obviously it cannot be classified correctly. Similarly, considering the reading order of r / \ sigma and R / \ sigma $, the formula is the same as that of normal k-means. As you can see from these results, the classification works well when the resolution of the kernel function,$ \ sigma $, is between the scales of the two clusters. Perhaps the RBF kernel cannot look in the direction, only the size, especially larger or smaller than the given scale$ \ sigma $. Therefore, if the kernel scale is not between the scales of each class, it will not work as expected. Let's do a concrete numerical experiment. Let's change$ \ sigma $under the same parameters$ r = 0.1, R = 1 $as before. Let's see how it behaves especially in the areas of$ \ sigma \ ll r, R $,$ r \ ll \ sigma R $,$ r, R \ ll \ sigma $. • For$ \ sigma \ ll r, R $\sigma = 0.01, r=0.1, R=1 • For$ r \ ll \ sigma R $\sigma = 0.5, r=0.1, R=1 • For$ r, R \ ll \ sigma \$

\sigma = 2, r=0.1, R=1

# Impressions

kernel k-means is not easy to use because it depends heavily on the kernel and its parameters. However, ordinary k-means works only for linearly separable ones, and is not easy to use. .. ..

# reference

--Hajipata --Castella book

• https://research.miidas.jp/2019/07/kernel-kmeans%E3%81%AEnumpy%E5%AE%9F%E8%A3%85/