Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)

Introduction

Coursera's Machine Learning course (Stanford, Dr. Andrew Ng) is a classic first step in learning machine learning. A series that implements the Matlab / Octave programming tasks in this course in Python. This time, we will do Principal Component Analysis (PCA) in the latter half of ex-7 unsupervised learning.

Library import

Import various libraries.

import numpy as np
import scipy.io as scio
import matplotlib.pyplot as plt
from sklearn import decomposition

Data reading / display

Load Matlab .mat format data with scipy.io.loadmat (). The data is 5000 32x32 pixel 256-gradation grayscale images. It comes in a 5000x1024 2D matrix. Let's display this as it is (only the first 100 images).

data = scio.loadmat('ex7faces.mat')
X = data['X'] #X is a 5000x1024 2D matrix

fig = plt.figure()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)
for i in range(0,100):
    ax = fig.add_subplot(10,10,i+1)
    ax.axis('off')
    ax.imshow(X[i].reshape(32,32).T, cmap = plt.get_cmap('gray'))
plt.show()

Click here for output.

Conducting principal component analysis

The data is reduced to 100 dimensions by applying principal component analysis to the original image data expressed in 32x32 pixels = 1024 dimensions. Principal component analysis is one shot in the sklearn.decomposition.PCA () class. The parameter n_components = allows you to specify how many principal components to take.

pca = decomposition.PCA(n_components=100)
pca.fit(X)

Visualization of principal components

The results of the principal component analysis are stored in pca.components_. It is a 100x1024 two-dimensional matrix. This principal component vector can be displayed as it is. Let's display only the first 36 principal components.

fig = plt.figure()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)
for i in range(0,36):
    ax = fig.add_subplot(6,6,i+1)
    ax.axis('off')
    ax.imshow(pca.components_[i].reshape(32,32).T, cmap = plt.get_cmap('gray'))
plt.show()

Click here for the results.

Dimension reduction and reconstruction

Principal component analysis reduces the image information originally represented by a 1024-dimensional vector to 100 dimensions. The dimensionally reduced dataset can be obtained with pca.transform (X) (5000x100 2D vector). Multiply this by the principal component vector to restore a 5000x1024 2D vector. The restored data is the original data compressed with 100 main components and restored so that it can be displayed. Let's display the first 100 images of the reconstructed result.

Xreduce = pca.transform(X) #Dimension reduction. The result is a 5000x100 matrix
Xrecon = np.dot(Xreduce, pca.components_) #Rebuilding. The result is a 5000x1024 matrix

fig = plt.figure()
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)
for i in range(0,100):
    ax = fig.add_subplot(10,10,i+1)
    ax.axis('off')
    ax.imshow(Xrecon[i].reshape(32,32).T, cmap = plt.get_cmap('gray'))
plt.show()

Click here for the results. Compared to the original image above, you can see that the rough features have been restored, although the details have been lost.

in conclusion

This time too, the code is simple.