I started writing my study content on another blog using the waiting period at home, but since the tex function there was shit, I will move some articles to qiita. .. ..

I learned k-means, the most basic [citation needed] algorithm for unsupervised learning, and tried it easily.

k-means One of the methods to separate given data into multiple clusters.

advantage

- The algorithm is simple and easy to understand
- Interpretation of results is not difficult

Since it is classified by the distance from the center of the cluster, it is almost always the intuitive result when plotting.

Disadvantage

- The number of clusters must be specified in advance (it is not something that recognizes the number of clusters in advance)
- Not suitable for things that cannot be linearly separated

given

- Dataset $ x_i $
- Number of clusters $ K $

Method of calculation

- Randomly assign cluster $ k_i $ to all data $ x_i $ $ (x_i, k_i) $
1.Central for each cluster(
m_k )Seekingm_k |N_k| = \sum_{i \in N_k} x_i - Calculate the distance to all cluster centers for all data and update to the shortest cluster
- Repeat 2-3 until there are no more cluster updates

Where $ N_k $ is the set of data contained in the cluster $ k $ at that time.

Experiment with the following parameters in sklearn's datasets.make_circles.

- n_samples = 100
- factor = 0.1
- noise = 0.01

With the above settings, two clusters, a circle with a radius of 0.1 and a circle with a radius of 1, are obtained under a noise of 0.01.

```
# k:Number of clusters
def kmeans(data, k):
df = pd.DataFrame()
df['x'] = data.T[0]
df['y'] = data.T[1]
clusters = np.random.randint(0, k, len(df))
df['c'] = clusters
while True:
before_clusters = list(df.c.copy())
centers = df.groupby('c').mean().reset_index().drop(columns=['c']).T
tmp = df.drop(columns=['c']).T
new_clusters = [min(centers.columns, key=lambda k:((tmp[col] - centers[k]) ** 2).sum()) for col in tmp.columns]
if before_clusters == new_clusters:
break
else:
df['c'] = new_clusters
return df.c
```

The result of turning k-means is as follows. Since the data are clearly not linearly separable, it can be seen that the classification results are not correctly classified as expected.

kernel k-means

advantage

- Non-linear separation is also possible

Disadvantage

- The classification result depends greatly on how to select the kernel function and kernel parameters.

given

- Dataset $ x_i $
- Number of clusters $ K $
- Kernel function (and required parameters for that kernel function)

Almost the same as normal k-means. When finding the center of each cluster, the center of the feature space may be found. That is, considering the cluster $ k $, we define the vector $ m_k $, which minimizes the following quantity, as the center of the cluster $ k $.

```
\displaystyle{
\begin{aligned}
\sum_{i \in N_k} (\phi(x_i) - m_k)^ 2.
\end{aligned}
}
```

$ M_k] to minimize this can be obtained as follows.

```
\displaystyle{
\begin{aligned}
m_k = \frac{1}{|N_k|} \sum_{i \in N_k} \phi(x_i).
\end{aligned}
}
```

Using this, the distance between the data $ x_i $ and the cluster $ k $ is as follows.

```
\displaystyle{
\begin{aligned}
(\phi(x_i) - m_k)^ 2 &= K(x_i, x_i) -\frac{2}{|N_k|} \sum_{j \in N_k} K(x_i, x_j) + \frac{1}{|N_k|^ 2} \sum_{m, n \in N_k} K(x_m, x_n)
\end{aligned}
}
```

Where $ K (x, y) $ is a kernel function.

Experiment with the following parameters in sklearn's datasets.make_circles.

- n_samples = 100
- factor = 0.1
- noise = 0.01

The kernel function uses the following RBF kernel.

```
\displaystyle{
\begin{aligned}
K(x, y) = \exp \left(\frac{(x-y)^ 2}{\sigma^2} \right)
\end{aligned}
}
```

```
# kernel:Kernel function that receives 2 variables
def kernel_kmeans(data, k, kernel):
df = pd.DataFrame()
df['x'] = data.T[0]
df['y'] = data.T[1]
clusters = np.random.randint(0, k, len(df))
df['c'] = clusters
self_corr = {}
#In-class correlation is heavy when calculated many times, so cache it
def compute_self_corr():
for j in range(0, k):
sub_df = df[df.c == j].drop(columns=['c']).T
ret = sub_df.apply(lambda col1: sub_df.apply(lambda col2: kernel(col1, col2), axis=0).mean(), axis=0).mean()
self_corr[j] = ret
#Calculate the distance to the cluster center in the feature space
def compute_distance_from_center_j(x, j):
sub_df = df[df.c == j].drop(columns=['c']).T
ret = kernel(x, x)
ret += sub_df.apply(lambda col: -2 * kernel(x, col), axis=0).mean()
ret += self_corr[j]
return ret
while True:
compute_self_corr()
before_clusters = list(df.c.copy())
tmp = df.drop(columns=['c']).T
new_clusters = [min(range(0, k), key=lambda compute_distance_from_center_j(tmp[col], j)) for col in tmp.columns]
if before_clusters == new_clusters:
break
else:
df['c'] = new_clusters
return df.c
```

The result of turning kernel k-means is as follows. The kernel parameter $ \ sigma $ is $ 0.5 $.

Consider the following situation.

- Number of clusters 2
- Cluster $ c_r $ uniformly distributed on the circumference of radius $ r $ and cluster $ c_R $ uniformly distributed on the circumference of radius $ R $
r \lt R - The number of data is large enough, and the sum can be replaced with sufficient accuracy by integration.
- The data is correctly classified. That is, the data on the radius $ r $ is classified into cluster $ c_r $ and the other is classified into cluster $ c_R $.
- Kernel functions use RBF kernel

Under the above situation, new data $ D = (x, y) = (R, 0) $ is added and classified. In this case, if the distance to enter the class $ c $ is written as $ C (c) $, the distance to the classes $ c_r $ and $ c_R $ is as follows.

```
\displaystyle{
\begin{aligned}
C(c_r) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 - \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}
```

Where $ I_ \ alpha (x) $ is a modified Bessel function. About this, when it is $ r \ ll \ sigma \ ll R $

```
\displaystyle{
\begin{aligned}
C(c_r) &= 2, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
```

Obviously $ C (c_R) \ lt C (c_r) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $

```
\displaystyle{
\begin{aligned}
C(c_r) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
```

If you look at the area where the two are equal,

```
\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}
```

Obviously it cannot be classified correctly. Also, considering the reading order of $ r / \ sigma and R / \ sigma $, it can be seen that they are not classified correctly because they have the same expression as normal k-means.

Similarly, consider the case where the data $ d = (x, y) = (r, 0) $ is added. The distance to the center of the class

```
\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}
```

About this, when it is $ r \ ll \ sigma \ ll R $

```
\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 1 + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
```

Obviously $ C (c_r) \ lt C (c_R) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $

```
\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}
```

In this case as well, if you examine the area where the two are equal, there is only $ r = R $, and otherwise it is $ C (c_r) \ lt C (c_R) $, which is an incorrect classification. Also, when it is $ r, R \ ll \ sigma $

```
\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}
```

Obviously it cannot be classified correctly. Similarly, considering the reading order of $ r / \ sigma and R / \ sigma $, the formula is the same as that of normal k-means.

As you can see from these results, the classification works well when the resolution of the kernel function, $ \ sigma $, is between the scales of the two clusters. Perhaps the RBF kernel cannot look in the direction, only the size, especially larger or smaller than the given scale $ \ sigma $. Therefore, if the kernel scale is not between the scales of each class, it will not work as expected.

Let's do a concrete numerical experiment. Let's change $ \ sigma $ under the same parameters $ r = 0.1, R = 1 $ as before. Let's see how it behaves especially in the areas of $ \ sigma \ ll r, R $, $ r \ ll \ sigma R $, $ r, R \ ll \ sigma $.

- For $ \ sigma \ ll r, R $

- For $ r \ ll \ sigma R $

- For $ r, R \ ll \ sigma $

kernel k-means is not easy to use because it depends heavily on the kernel and its parameters. However, ordinary k-means works only for linearly separable ones, and is not easy to use. .. ..

--Hajipata --Castella book

- https://research.miidas.jp/2019/07/kernel-kmeans%E3%81%AEnumpy%E5%AE%9F%E8%A3%85/

Recommended Posts