[PYTHON] I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]

Introduction

Unsupervised learning is generally less accurate than supervised learning, but at the cost of many benefits. Specifically, as a scene where unsupervised learning is useful

**-Data whose pattern is not well understood --Time-varying data --Unlabeled data **

And so on.

Unsupervised learning learns the structure behind the data from the data itself. This allows you to take advantage of more unlabeled data, which may pave the way for new applications.

In this article, I will introduce an implementation example ** of classifying mnist data by unsupervised learning. The method uses principal component analysis, t-SNE, and k-means method. In the sequel article, I would like to challenge the classification using the autoencoder.

What to do in this article

**-Classify mnist data by unsupervised learning --Implementation and evaluation of classification by PCA + k-means method --Implementation and evaluation of classification by PCA + t-SNE + k-means method **

Library import

`python`


import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.manifold import TSNE

Data preparation

`python`


mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
#Normalization
train_images = (train_images - train_images.min()) / (train_images.max() - train_images.min())
test_images = (test_images - test_images.min()) / (test_images.max() - test_images.min())

train_images.shape,test_images.shape

I will visualize it for the time being. These are familiar numbers.

`python`


#Random visualization
fig, ax = plt.subplots(4,5,figsize=[8,6])
ax_f = ax.flatten()
for ax_i in ax_f:
    idx = random.choice(range(train_images.shape[0]))
    ax_i.imshow(train_images[idx,:,:])
    ax_i.grid(False)
    ax_i.tick_params(labelbottom=False,labelleft=False)

Classification of images by principal component analysis

The mnist data is 28 * 28 = 784 dimensional data. It seems that the k-means method can be applied as it is, but it is effective to ** reduce the dimensions before classifying. By doing so, it is possible to reduce the amount of calculation.

** Principal component analysis (PCA) is used as a dimensionality reduction method. Then, the k-means method is applied to the data that has been dimensionally reduced by PCA to classify the data. ** **

Perform principal component analysis

`python`


df_train = pd.DataFrame(train_images.reshape(train_images.shape[0],28*28))

pca = PCA()
pca.fit(df_train)
feature = pca.transform(df_train)

#Visualization in two dimensions
plt.scatter(feature[:,0],feature[:,1],alpha=0.8,c=train_labels)

Here is the result of visualization with the first component and the second component of the main components. It is color-coded by 10 clusters from 0 to 9. It looks like they can be classified somehow, but ... there are many overlapping parts.

Next, we will consider how many components to apply k-means. Therefore, the contribution rate of each component is visualized.

`python`


#Change in contribution rate
ev_ratio = pca.explained_variance_ratio_
ev_ratio = np.hstack([0,ev_ratio.cumsum()])

df_ratio = pd.DataFrame({"components":range(len(ev_ratio)), "ratio":ev_ratio})

plt.plot(ev_ratio)
plt.xlabel("components")
plt.ylabel("explained variance ratio")
plt.xlim([0,50])

plt.scatter(range(len(ev_ratio)),ev_ratio)

Looking at the results, we can see that 90% or more of the data can be restored up to the 100th component. As you can see from various experiments, 10 components were enough when classifying. (I experimented with the evaluation method described below) From this, let's classify by applying the k-means method using up to the 10th component. The smaller the number of dimensions, the smaller the number of calculations, so it is preferable to have a smaller number.

Classification by k-means method

First, apply the k-means method. We know that there are 10 clusters, so let's first classify them into 10.

`python`


KM = KMeans(n_clusters = 10)
result = KM.fit(feature[:,:9])

Evaluation of classification results

First, the classification result is displayed as a confusion matrix.

`python`


df_eval = pd.DataFrame(confusion_matrix(train_labels,result.labels_))
df_eval.columns = df_eval.idxmax()
df_eval = df_eval.sort_index(axis=1)

df_eval

The correct label with the most correct answers in the cluster was classified as the predictive label. For example, if cluster No. 0 contains 100 0s, 20 1s, and 10 4s, this is predicting 0.

Looking at the results, you can see that there are two clusters that are predicted to be 0 and one that is predicted to be 1. In other words, it is considered that it was difficult to classify into 10 numbers ** by principal component analysis and k-means method **.

Now, let's think about ** how many clusters are best to classify **. Obviously, if there are many clusters, there will be many similar ones in the cluster, but it will be difficult to interpret. On the contrary, if it is small, it is easy to interpret, but the cluster will contain various data.

So, ** the number of clusters is small, but it is good to classify with similar data as much as possible **. Since we know the correct answer data here, we would like to find the number of clusters that has the largest number of correct answer labels, with the most correct label as the correct answer label.

Therefore, the evaluation index is ** the number of correct labels / the total number **.

`python`


#Search for the number of clusters with the largest number of data in the cluster as the correct label.
eval_acc_list=[]

for i in range(5,15):
    KM = KMeans(n_clusters = i)
    result = KM.fit(feature[:,:9])
    df_eval = pd.DataFrame(confusion_matrix(train_labels,result.labels_))
    eval_acc = df_eval.max().sum()/df_eval.sum().sum()
    eval_acc_list.append(eval_acc)

plt.plot(range(5,15),eval_acc_list)
plt.xlabel("The number of cluster")
plt.ylabel("accuracy")

This is the result when the number of clusters is changed from 5 to 15. As the number of clusters increases, the homogeneity increases, and the accuracy increases. ** Which is best depends on the purpose, but about 10 clusters seems to be good considering the interpretability. ** **

In other words, it was difficult to divide the labels into 10 labels with PCA alone. So, next, I would like to combine a method called t-SNE.

Classification by PCA + t-SNE

Execution of t-SNE

** It was difficult to classify into 10 with PCA alone, so I will try to classify by combining PCA and t-SNE. ** ** t-SNE is a method whose principle seems to be difficult (I don't understand it well). I'll put a site with detailed explanations in the references.

Since t-SNE is a method that takes a long time to calculate, it classifies 10000 pieces of 10-dimensional data reduced by PCA. When visualized, I feel that it can be classified as a fair feeling.

`python`


tsne = TSNE(n_components=2).fit_transform(feature[:10000,:9])
#Visualization
for i in range(10):
  idx = np.where(train_labels[:10000]==i)
  plt.scatter(tsne[idx,0],tsne[idx,1],label=i)
plt.legend(loc='upper left',bbox_to_anchor=(1.05,1))

Classification and evaluation by k-means

Next, classify by the k-means method and display the confusion matrix.

`python`


#Classify tsne by kmeans
KM = KMeans(n_clusters = 10)
result = KM.fit(tsne)

df_eval = pd.DataFrame(confusion_matrix(train_labels[:10000],result.labels_))
df_eval.columns = df_eval.idxmax()
df_eval = df_eval.sort_index(axis=1)

df_eval

** As it may happen, it is well divided into 10 labels. It was good. ** ** Looking at this table, you can see that "4" and "9" are often misclassified. Even if you look at the scatter plot that actually visualizes the t-SNE results, 4 and 9 are close to each other.

You can see that 4 and 9 are learning to be similar in this learning method. I don't know why they think they are similar, but it's interesting.

Finally, the accuracy is evaluated for each number of clusters. When the number of clusters is 10, the accuracy is 0.6. The result is a little higher than when only PCA is used.

`python`


#Search for the number of clusters with the largest number of data in the cluster as the correct label.
eval_acc_list=[]

for i in range(5,15):
    KM = KMeans(n_clusters = i)
    result = KM.fit(feature[:,:9])
    df_eval = pd.DataFrame(confusion_matrix(train_labels,result.labels_))
    eval_acc = df_eval.max().sum()/df_eval.sum().sum()
    eval_acc_list.append(eval_acc)

plt.plot(range(5,15),eval_acc_list)
plt.xlabel("The number of cluster")
plt.ylabel("accuracy")

At the end

Using mnist as a theme, we implemented dimension reduction and visualization by PCA and t-SNE, and classification and evaluation by k-means. Unlabeled data is abundant in the world, so it seems to be a very useful method.

If you find it helpful, it would be encouraging if you could use LGTM etc.

References

Book: ["Learning without teacher by python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82%81 % E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103)

Introduction of dimensional compression method using t-SNE https://blog.albert2005.co.jp/2015/12/02/tsne/

Understand t-SNE and improve visualization https://qiita.com/g-k/items/120f1cf85ff2ceae4aba