[PYTHON] Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes

Hello. I would like to keep what I learned in the university class in qiita as a memorandum. I have posted the sample code on github, so I'm not sure if it will be helpful, but if you are interested, please take a look. https://github.com/tkshim/MNIST/blob/master/BayesMNIST.py

  1. Contents: -PCA (Principal Component Analysis) is used to reduce the dimensions of feature vectors, and a naive Bayes classifier is used to perform image recognition of numbers.

  2. Purpose: -Share examples of code (Python) using machine learning.

  3. Target audience: ・ I understand the basic theory of machine learning, but I would like to see a sample of how other people implement it in code.

  4. Environment: ・ MacBook Air ・ OSX 10.11.6 · Python 3.x ・ Numpy ・ Pandas ・ Sklearn

  5. Summary: -I used sklearn's built-in Gaussian NB, but I was able to get an accuracy in the latter half of 80%.

■step1 The data set to be analyzed is downloaded from the homepage of Professor Lucan of New York University. http://yann.lecun.com/exdb/mnist/index.html Set variables for data storage according to your environment.

DATA_PATH = '/Users/takeshi/MNIST_data'
TRAIN_IMG_NAME = 'train-images.idx3-ubyte'
TRAIN_LBL_NAME = 'train-labels.idx1-ubyte'
TEST_IMG_NAME = 't10k-images.idx'
TEST_LBL_NAME = 't10k-labels.idx'

■step2, step3 Read the training and testing datasets as an array of numpy. You can use imshow to see which number each data represents, as shown below.

print("The shape of matrix is : ", Xtr.shape)
print("Label is : ", Ttr.shape)
plt.imshow(Xte[0].reshape(28, 28),interpolation='None', cmap=cm.gray)
show()

digit7.png

■step4 This is the heart of PCA. The image data is represented by 28x28 = 784 data, and the eigenvectors are inversely calculated using the eigh function for these 784 feature vectors.

X = np.vstack((Xtr,Xte))
T = np.vstack((Ttr,Tte))
print (X.shape)
print (T.shape)

import numpy as np;
import numpy.linalg as LA;
μ=np.mean(X,axis=0);#print(μ);
Z=X-μ;#print(Z);
C=np.cov(Z,rowvar=False);#print(C);
[λ,V]=LA.eigh(C);#print(λ,'\n\n',V);
row=V[0,:];col=V[:,0];
np.dot(C,row)/(λ[0]*row) ;
np.dot(C,col)/(λ[0]*col);
λ=np.flipud(λ);V=np.flipud(V.T);
row=V[0,:];
np.dot(C,row)/(λ[0]*row);
P=np.dot(Z,V.T);#print(P);

■step5 There are 784 eigenvectors = principal components, but instead of using all of them, for example, by using only two (= dimensionality reduction) and applying those two eigenvectors to GaussianNB, the recognition model is completed.

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# Apply traing dataset to this model
# A: the number of training set
# B: the number of dimension
A = 60000
B = 2
model.fit(P[0:A,0:B],T[0:A])

■step6 If there are two eigenvectors, the accuracy of the test data will be 44.7%, which is a very bad number. ..

from sklearn import metrics
predicted = model.predict(P[A:70001,0:B])
expected = T[A:70001,]
print ('The accuracy is : ', metrics.accuracy_score(expected, predicted)*100, '%')

■step7 Here, the Classification Report and Confusion Matrix are displayed so that you can check the recognition status of each number.

import matplotlib.pyplot as plt
import seaborn as sns
print ('          === Classification Report ===')
print (metrics.classification_report(expected, predicted))

cm = metrics.confusion_matrix(expected, predicted)
plt.figure(figsize=(9, 6))
sns.heatmap(cm, linewidths=.9,annot=True,fmt='g')
plt.suptitle('MNIST Confusion Matrix (GaussianNativeBayesian)')
plt.show()

If the eigenvector is 2, you can see that the number "1" is not bad at 83%, while "2" and "5" are hardly recognized correctly. report.png

cm.png

Why? That is to say, this is because if there are only two eigenvectors, some numbers will overlap and it may be difficult to determine which number.

Let's take a look at an easy-to-understand example. In the above matrix, the number 4 is recognized as the number 1 0 times, but the number 9 is mistakenly recognized as the number 9 374 times. Below is a three-dimensional plot of the eigenvectors of the numbers 1 and 4. If it is 1 and 4, you can see that the sets of eigenvectors are neatly separated. But what about 4 and 9? It's almost worn and overlaps.

■ Numbers 1 and 4 image.png ■ Numbers 4 and 9   image.png

Therefore, increasing the eigenvectors used will improve accuracy. In this environment, setting the eigenvector to around 70 seems to maximize the accuracy (late 80%).

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# Apply traing dataset to this model
# A: the number of training set
# B: the number of dimension
A = 60000
B = 70 # <-Gradually increase.
model.fit(P[0:A,0:B],T[0:A])

cm70.png

■ Comparison of eigenvectors 2 and 70 The following is the difference when the number 0 is reproduced with the eigenvectors set to 2 and 70. After all, if you increase it to 70, you can see that the outline becomes much clearer.

2.png

70.png

Xrec2=(np.dot(P[:,0:2],V[0:2,:]))+μ; #Reconstruction using 2 components
Xrec3=(np.dot(P[:,0:70],V[0:70,:]))+μ; #Reconstruction using 3 components
plt.imshow(Xrec2[1].reshape(28, 28),interpolation='None', cmap=cm.gray);
show()
plt.imshow(Xrec3[1].reshape(28, 28),interpolation='None', cmap=cm.gray);
show()

■ Summary ・ I was able to obtain accuracy in the latter half of 80% using the Machine Learning method. ・ This time, I am using sklearn's naive Bayes classifier, but next time I would like to implement this classifier from scratch in Python and aim for the 90% range.

Recommended Posts

Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
Tree disease determination by image recognition using CNTK and SVM
Similar face image detection using face recognition and PCA and K-means clustering
Significance of machine learning and mini-batch learning
Image recognition of fruits using VGG16
About the shortest path to create an image recognition model by machine learning and implement an Android application
Python: Application of image recognition using CNN
Image recognition model using deep learning in 2016
Image recognition using CNN Horses and deer
Judgment of igneous rock by machine learning ②
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Face image dataset sorting using machine learning model (# 3)
Parallel learning of deep learning by Keras and Kubernetes
Analysis of shared space usage by machine learning
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
Reasonable price estimation of Mercari by machine learning
Classification of guitar images by machine learning Part 2
Implementation of Deep Learning model for image recognition
Try using Jupyter Notebook of Azure Machine Learning
Causal reasoning using machine learning (organization of causal reasoning methods)
Python learning memo for machine learning by Chainer Chapters 1 and 2
Collection and automation of erotic images using deep learning
I tried to compress the image using machine learning
A concrete method of predicting horse racing by machine learning and simulating the recovery rate
I tried to verify the yin and yang classification of Hololive members by machine learning
Rank learning using neural network (Implementation of RankNet by Chainer)
Examination of Forecasting Method Using Deep Learning and Wavelet Transform-Part 2-
Predict the presence or absence of infidelity by machine learning
What I learned about AI and machine learning using Python (4)
Judging the victory or defeat of Shadowverse by image recognition
[Machine learning] Feature selection of categorical variables using chi-square test
Classify CIFAR-10 image datasets using various models of deep learning
Image recognition of garbage with Edge (Raspberry Pi) from zero knowledge using AutoML Vsion and TPU