[PYTHON] I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]

Introduction

Unsupervised learning is generally less accurate than supervised learning, but at the cost of many benefits. Specifically, as a scene where unsupervised learning is useful

**-Data whose pattern is not well understood --Time-varying data --Unlabeled data **

And so on.

Unsupervised learning learns the structure behind the data from the data itself. This allows you to take advantage of more unlabeled data, which may pave the way for new applications.

In this article, I will introduce an implementation example ** of classifying mnist data by unsupervised learning. The method uses principal component analysis, t-SNE, and k-means method. In the sequel article, I would like to challenge the classification using the autoencoder.

What to do in this article

**-Classify mnist data by unsupervised learning --Implementation and evaluation of classification by PCA + k-means method --Implementation and evaluation of classification by PCA + t-SNE + k-means method **

Library import

python


import random
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.manifold import TSNE

Data preparation

python


mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
#Normalization
train_images = (train_images - train_images.min()) / (train_images.max() - train_images.min())
test_images = (test_images - test_images.min()) / (test_images.max() - test_images.min())

train_images.shape,test_images.shape

I will visualize it for the time being. These are familiar numbers.

python


#Random visualization
fig, ax = plt.subplots(4,5,figsize=[8,6])
ax_f = ax.flatten()
for ax_i in ax_f:
    idx = random.choice(range(train_images.shape[0]))
    ax_i.imshow(train_images[idx,:,:])
    ax_i.grid(False)
    ax_i.tick_params(labelbottom=False,labelleft=False)

image.png

Classification of images by principal component analysis

The mnist data is 28 * 28 = 784 dimensional data. It seems that the k-means method can be applied as it is, but it is effective to ** reduce the dimensions before classifying. By doing so, it is possible to reduce the amount of calculation.

** Principal component analysis (PCA) is used as a dimensionality reduction method. Then, the k-means method is applied to the data that has been dimensionally reduced by PCA to classify the data. ** **

Perform principal component analysis

python


df_train = pd.DataFrame(train_images.reshape(train_images.shape[0],28*28))

pca = PCA()
pca.fit(df_train)
feature = pca.transform(df_train)

#Visualization in two dimensions
plt.scatter(feature[:,0],feature[:,1],alpha=0.8,c=train_labels)

Here is the result of visualization with the first component and the second component of the main components. It is color-coded by 10 clusters from 0 to 9. It looks like they can be classified somehow, but ... there are many overlapping parts. image.png

Next, we will consider how many components to apply k-means. Therefore, the contribution rate of each component is visualized.

python


#Change in contribution rate
ev_ratio = pca.explained_variance_ratio_
ev_ratio = np.hstack([0,ev_ratio.cumsum()])

df_ratio = pd.DataFrame({"components":range(len(ev_ratio)), "ratio":ev_ratio})

plt.plot(ev_ratio)
plt.xlabel("components")
plt.ylabel("explained variance ratio")
plt.xlim([0,50])

plt.scatter(range(len(ev_ratio)),ev_ratio)

Looking at the results, we can see that 90% or more of the data can be restored up to the 100th component. As you can see from various experiments, 10 components were enough when classifying. (I experimented with the evaluation method described below) From this, let's classify by applying the k-means method using up to the 10th component. The smaller the number of dimensions, the smaller the number of calculations, so it is preferable to have a smaller number.

Classification by k-means method

First, apply the k-means method. We know that there are 10 clusters, so let's first classify them into 10.

python


KM = KMeans(n_clusters = 10)
result = KM.fit(feature[:,:9])

Evaluation of classification results

First, the classification result is displayed as a confusion matrix.

python


df_eval = pd.DataFrame(confusion_matrix(train_labels,result.labels_))
df_eval.columns = df_eval.idxmax()
df_eval = df_eval.sort_index(axis=1)

df_eval

The correct label with the most correct answers in the cluster was classified as the predictive label. For example, if cluster No. 0 contains 100 0s, 20 1s, and 10 4s, this is predicting 0.

Looking at the results, you can see that there are two clusters that are predicted to be 0 and one that is predicted to be 1. In other words, it is considered that it was difficult to classify into 10 numbers ** by principal component analysis and k-means method **. image.png

Now, let's think about ** how many clusters are best to classify **. Obviously, if there are many clusters, there will be many similar ones in the cluster, but it will be difficult to interpret. On the contrary, if it is small, it is easy to interpret, but the cluster will contain various data.

So, ** the number of clusters is small, but it is good to classify with similar data as much as possible **. Since we know the correct answer data here, we would like to find the number of clusters that has the largest number of correct answer labels, with the most correct label as the correct answer label.

Therefore, the evaluation index is ** the number of correct labels / the total number **.

python


#Search for the number of clusters with the largest number of data in the cluster as the correct label.
eval_acc_list=[]

for i in range(5,15):
    KM = KMeans(n_clusters = i)
    result = KM.fit(feature[:,:9])
    df_eval = pd.DataFrame(confusion_matrix(train_labels,result.labels_))
    eval_acc = df_eval.max().sum()/df_eval.sum().sum()
    eval_acc_list.append(eval_acc)

plt.plot(range(5,15),eval_acc_list)
plt.xlabel("The number of cluster")
plt.ylabel("accuracy")

This is the result when the number of clusters is changed from 5 to 15. As the number of clusters increases, the homogeneity increases, and the accuracy increases. ** Which is best depends on the purpose, but about 10 clusters seems to be good considering the interpretability. ** **

In other words, it was difficult to divide the labels into 10 labels with PCA alone. So, next, I would like to combine a method called t-SNE.

Classification by PCA + t-SNE

Execution of t-SNE

** It was difficult to classify into 10 with PCA alone, so I will try to classify by combining PCA and t-SNE. ** ** t-SNE is a method whose principle seems to be difficult (I don't understand it well). I'll put a site with detailed explanations in the references.

Since t-SNE is a method that takes a long time to calculate, it classifies 10000 pieces of 10-dimensional data reduced by PCA. When visualized, I feel that it can be classified as a fair feeling.

python


tsne = TSNE(n_components=2).fit_transform(feature[:10000,:9])
#Visualization
for i in range(10):
  idx = np.where(train_labels[:10000]==i)
  plt.scatter(tsne[idx,0],tsne[idx,1],label=i)
plt.legend(loc='upper left',bbox_to_anchor=(1.05,1))

image.png

Classification and evaluation by k-means

Next, classify by the k-means method and display the confusion matrix.

python


#Classify tsne by kmeans
KM = KMeans(n_clusters = 10)
result = KM.fit(tsne)

df_eval = pd.DataFrame(confusion_matrix(train_labels[:10000],result.labels_))
df_eval.columns = df_eval.idxmax()
df_eval = df_eval.sort_index(axis=1)

df_eval

image.png

** As it may happen, it is well divided into 10 labels. It was good. ** ** Looking at this table, you can see that "4" and "9" are often misclassified. Even if you look at the scatter plot that actually visualizes the t-SNE results, 4 and 9 are close to each other.

You can see that 4 and 9 are learning to be similar in this learning method. I don't know why they think they are similar, but it's interesting.

Finally, the accuracy is evaluated for each number of clusters. When the number of clusters is 10, the accuracy is 0.6. The result is a little higher than when only PCA is used.

python


#Search for the number of clusters with the largest number of data in the cluster as the correct label.
eval_acc_list=[]

for i in range(5,15):
    KM = KMeans(n_clusters = i)
    result = KM.fit(feature[:,:9])
    df_eval = pd.DataFrame(confusion_matrix(train_labels,result.labels_))
    eval_acc = df_eval.max().sum()/df_eval.sum().sum()
    eval_acc_list.append(eval_acc)

plt.plot(range(5,15),eval_acc_list)
plt.xlabel("The number of cluster")
plt.ylabel("accuracy")

image.png

At the end

Using mnist as a theme, we implemented dimension reduction and visualization by PCA and t-SNE, and classification and evaluation by k-means. Unlabeled data is abundant in the world, so it seems to be a very useful method.

If you find it helpful, it would be encouraging if you could use LGTM etc.

References

Book: ["Learning without teacher by python"](url https://www.amazon.co.jp/Python%E3%81%A7%E3%81%AF%E3%81%98%E3%82%81 % E3% 82% 8B% E6% 95% 99% E5% B8% AB% E3% 81% AA% E3% 81% 97% E5% AD% A6% E7% BF% 92-% E2% 80% 95% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 81% AE% E5% 8F% AF% E8% 83% BD% E6% 80% A7% E3% 82% 92% E5% BA% 83% E3% 81% 92% E3% 82% 8B% E3% 83% A9% E3% 83% 99% E3% 83% AB% E3% 81% AA% E3% 81% 97% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 88% A9% E7% 94% A8-Ankur-Patel / dp / 4873119103)

Introduction of dimensional compression method using t-SNE https://blog.albert2005.co.jp/2015/12/02/tsne/

Understand t-SNE and improve visualization https://qiita.com/g-k/items/120f1cf85ff2ceae4aba

Recommended Posts

I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]
Classify mnist numbers by unsupervised learning with keras [Autoencoder]
I tried to classify MNIST by GNN (with PyTorch geometric)
I tried to classify Oba Hana and Emiri Otani by deep learning
I tried to classify dragon ball by adaline
I tried to classify Oba Hana and Emiri Otani by deep learning (Part 2)
I tried to implement anomaly detection by sparse structure learning
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to classify guitar chords in real time using machine learning
I tried to program bubble sort by language
I tried to move GAN (mnist) with keras
I tried to get an image by scraping
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
I tried to predict horse racing by doing everything from data collection to deep learning
I tried to move machine learning (ObjectDetection) with TouchDesigner
Classify articles with tags specified by Qiita by unsupervised learning
[Deep Learning from scratch] I tried to explain Dropout
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried deep learning
I tried to debug.
I tried to paste
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to classify Shogi players Takami 7th Dan and Masuda 6th Dan by CNN [For CNN beginners]
I tried to speed up video creation by parallel processing
I wanted to classify Shadowverse card images by reader class
[Introduction to simulation] I tried playing by simulating corona infection ♬
[Django] I tried to implement access control by class inheritance.
[Introduction to Pandas] I tried to increase exchange data by data interpolation ♬
I tried machine learning to convert sentences into XX style
I tried to implement ListNet of rank learning with Chainer
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
I tried to classify music major / minor on Neural Network
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
I tried to divide with a deep learning language model
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
I tried HR Tech to develop an expert search engine by machine learning in-house meeting information
Mayungo's Python Learning Episode 4: I tried to see what happens when numbers are treated as letters
I tried to organize SVM.
Classify data by k-means method
I tried to implement PCANet
I tried to reintroduce Linux
I tried to introduce Pylint
I tried to summarize SparseMatrix
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
[Introduction to simulation] I tried playing by simulating corona infection ♬ Part 2
I tried to make Kana's handwriting recognition Part 1/3 First from MNIST
I tried to visualize the Beverage Preference Dataset by tensor decomposition.
I tried to create a list of prime numbers with python
I tried to implement sentence classification by Self Attention with PyTorch
I tried to summarize the commands used by beginner engineers today
I tried to predict by letting RNN learn the sine wave
I tried to visualize Boeing of violin performance by pose estimation
I tried to solve the shift scheduling problem by various methods
[Machine learning] I tried to do something like passing an image
A simple method to get MNIST correct answer rate of 97% or more by unsupervised learning (without transfer learning)