[PYTHON] k-means and kernel k-means

I started writing my study content on another blog using the waiting period at home, but since the tex function there was shit, I will move some articles to qiita. .. ..

I learned k-means, the most basic [citation needed] algorithm for unsupervised learning, and tried it easily.

k-means One of the methods to separate given data into multiple clusters.

Advantages, etc.

advantage

Since it is classified by the distance from the center of the cluster, it is almost always the intuitive result when plotting.

Disadvantage

algorithm

given

Method of calculation

  1. Randomly assign cluster $ k_i $ to all data $ x_i $ $ (x_i, k_i) $ 1.Central for each cluster(m_k)Seekingm_k |N_k| = \sum_{i \in N_k} x_i
  2. Calculate the distance to all cluster centers for all data and update to the shortest cluster
  3. Repeat 2-3 until there are no more cluster updates

Where $ N_k $ is the set of data contained in the cluster $ k $ at that time.

Example of use

Experiment with the following parameters in sklearn's datasets.make_circles.

With the above settings, two clusters, a circle with a radius of 0.1 and a circle with a radius of 1, are obtained under a noise of 0.01.

code

# k:Number of clusters
def kmeans(data, k):
  df = pd.DataFrame()
  df['x'] = data.T[0]
  df['y'] = data.T[1]
  clusters = np.random.randint(0, k, len(df))
  df['c'] = clusters

  while True:
    before_clusters = list(df.c.copy())
    centers = df.groupby('c').mean().reset_index().drop(columns=['c']).T

    tmp = df.drop(columns=['c']).T
    new_clusters = [min(centers.columns, key=lambda k:((tmp[col] - centers[k]) ** 2).sum()) for col in tmp.columns]
    if before_clusters == new_clusters:
      break
    else:
      df['c'] = new_clusters

  return df.c

Experimental result

The result of turning k-means is as follows. Since the data are clearly not linearly separable, it can be seen that the classification results are not correctly classified as expected. k-means.png

kernel k-means

Advantages, etc.

advantage

Disadvantage

algorithm

given

Almost the same as normal k-means. When finding the center of each cluster, the center of the feature space may be found. That is, considering the cluster $ k $, we define the vector $ m_k $, which minimizes the following quantity, as the center of the cluster $ k $.

\displaystyle{
\begin{aligned}
\sum_{i \in N_k} (\phi(x_i) - m_k)^ 2.
\end{aligned}
}

$ M_k] to minimize this can be obtained as follows.

\displaystyle{
\begin{aligned}
m_k = \frac{1}{|N_k|} \sum_{i \in N_k} \phi(x_i).
\end{aligned}
}

Using this, the distance between the data $ x_i $ and the cluster $ k $ is as follows.

\displaystyle{
\begin{aligned}
(\phi(x_i) - m_k)^ 2 &= K(x_i, x_i) -\frac{2}{|N_k|} \sum_{j \in N_k} K(x_i, x_j) + \frac{1}{|N_k|^ 2} \sum_{m, n \in N_k} K(x_m, x_n)
\end{aligned}
}

Where $ K (x, y) $ is a kernel function.

Example of use

Experiment with the following parameters in sklearn's datasets.make_circles.

The kernel function uses the following RBF kernel.

\displaystyle{
\begin{aligned}
K(x, y) = \exp \left(\frac{(x-y)^ 2}{\sigma^2} \right)
\end{aligned}
}

code

# kernel:Kernel function that receives 2 variables
def kernel_kmeans(data, k, kernel):
  df = pd.DataFrame()
  df['x'] = data.T[0]
  df['y'] = data.T[1]
  clusters = np.random.randint(0, k, len(df))
  df['c'] = clusters

  self_corr = {}
  #In-class correlation is heavy when calculated many times, so cache it
  def compute_self_corr():
    for j in range(0, k):
      sub_df = df[df.c == j].drop(columns=['c']).T
      ret = sub_df.apply(lambda col1: sub_df.apply(lambda col2:  kernel(col1, col2), axis=0).mean(), axis=0).mean()
      self_corr[j] = ret

  #Calculate the distance to the cluster center in the feature space
  def compute_distance_from_center_j(x, j):
    sub_df = df[df.c == j].drop(columns=['c']).T
    ret = kernel(x, x)
    ret += sub_df.apply(lambda col: -2 * kernel(x, col), axis=0).mean()
    ret += self_corr[j]
    return ret

  while True:
    compute_self_corr()
    before_clusters = list(df.c.copy())
    tmp = df.drop(columns=['c']).T
    new_clusters = [min(range(0, k), key=lambda compute_distance_from_center_j(tmp[col], j)) for col in tmp.columns]
    if before_clusters == new_clusters:
      break
    else:
      df['c'] = new_clusters
  return df.c

Experimental result

The result of turning kernel k-means is as follows. The kernel parameter $ \ sigma $ is $ 0.5 $. kernel-k-means-sig0.5.png

If kernel k-means works (does not work)

Consider the following situation.

Under the above situation, new data $ D = (x, y) = (R, 0) $ is added and classified. In this case, if the distance to enter the class $ c $ is written as $ C (c) $, the distance to the classes $ c_r $ and $ c_R $ is as follows.

\displaystyle{
\begin{aligned}
C(c_r) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 - \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}

Where $ I_ \ alpha (x) $ is a modified Bessel function. About this, when it is $ r \ ll \ sigma \ ll R $

\displaystyle{
\begin{aligned}
C(c_r) &= 2, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

Obviously $ C (c_R) \ lt C (c_r) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $

\displaystyle{
\begin{aligned}
C(c_r) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

If you look at the area where the two are equal,r=RThere is no choice but otherwiseC(c_r) \lt C(c_R)It becomes an incorrect classification. The important thing here is\sigmaWhen\delta r := |R - r|It doesn't matter how big or small the scale is. Alsor, R \ll \sigmaWhen

\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}

Obviously it cannot be classified correctly. Also, considering the reading order of $ r / \ sigma and R / \ sigma $, it can be seen that they are not classified correctly because they have the same expression as normal k-means.

Similarly, consider the case where the data $ d = (x, y) = (r, 0) $ is added. The distance to the center of the class

\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \exp \left(-\frac{2r^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2r^ 2}{\sigma^ 2}\right), \\
C(c_R) &= 1 -2 \exp \left(-\frac{r^ 2 + R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2rR}{\sigma^ 2}\right) + \exp \left(-\frac{2R^ 2}{\sigma^ 2}\right) I_0 \left(\frac{2R^ 2}{\sigma^ 2}\right).
\end{aligned}
}

About this, when it is $ r \ ll \ sigma \ ll R $

\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 1 + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

Obviously $ C (c_r) \ lt C (c_R) $, so it is classified as the correct cluster. Also, for $ \ sigma \ ll r, R $

\displaystyle{
\begin{aligned}
C(c_r) &= 1 - \sqrt \frac{\sigma^ 2}{2\pi r^ 2}, \\
C(c_R) &= 1 - 2 \sqrt \frac{\sigma^ 2}{2\pi r R} \exp \left(-\frac{(r - R)^ 2}{\sigma^ 2}\right) + \sqrt \frac{\sigma^ 2}{2\pi R^ 2}.
\end{aligned}
}

In this case as well, if you examine the area where the two are equal, there is only $ r = R $, and otherwise it is $ C (c_r) \ lt C (c_R) $, which is an incorrect classification. Also, when it is $ r, R \ ll \ sigma $

\displaystyle{
\begin{aligned}
C(c_r) &= 0, \\
C(c_R) &= 0.
\end{aligned}
}

Obviously it cannot be classified correctly. Similarly, considering the reading order of $ r / \ sigma and R / \ sigma $, the formula is the same as that of normal k-means.

As you can see from these results, the classification works well when the resolution of the kernel function, $ \ sigma $, is between the scales of the two clusters. Perhaps the RBF kernel cannot look in the direction, only the size, especially larger or smaller than the given scale $ \ sigma $. Therefore, if the kernel scale is not between the scales of each class, it will not work as expected.

Let's do a concrete numerical experiment. Let's change $ \ sigma $ under the same parameters $ r = 0.1, R = 1 $ as before. Let's see how it behaves especially in the areas of $ \ sigma \ ll r, R $, $ r \ ll \ sigma R $, $ r, R \ ll \ sigma $.

\sigma = 0.01, r=0.1, R=1 kernel-k-means-sig0.01.png

\sigma = 0.5, r=0.1, R=1 kernel-k-means-sig0.5.png

\sigma = 2, r=0.1, R=1 kernel-k-means-sig2.0.png

Impressions

kernel k-means is not easy to use because it depends heavily on the kernel and its parameters. However, ordinary k-means works only for linearly separable ones, and is not easy to use. .. ..

reference

--Hajipata --Castella book

Recommended Posts

k-means and kernel k-means
[Note] WSL2 kernel build and use
Explainable AI ~ Explainable k-Means and k-Medians Clustering ~
Unsupervised text classification with Doc2Vec and k-means
Kernel Self-Protection (1/2)
Beginner Kmeans
Kernel Self-Protection (2/2)
Understand k-means ++
About _ and __
Comparison of k-means implementation examples of scikit-learn and pyclustering