[PYTHON] Principal component analysis Analyze handwritten numbers using PCA. Part 1

At the beginning##

Analyzing handwritten numbers has been made into a series, but this time I would like to analyze using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which can be called a supervised version. I think. I am using python's machine learning library scikit-learn.

[Past handwritten digit data analysis article]

Playing handwritten numbers with python Part 1 [Playing handwritten numbers with python Part 2 (identify)] (http://qiita.com/kenmatsu4/items/2d21466078917c200033) [Machine learning] Write the k-nearest neighbor method in python by yourself and recognize handwritten numbers

Principal Component Analysis (PCA)

First is the principal component analysis. Basically, it is an analysis that collects multiple data elements and extracts the so-called main component. In machine learning, I think that it is mainly used with the intention of reducing the dimension of the target data. Since the handwritten digits handled here are image data of 28x28 = 784 pixels, it becomes a 784-dimensional vector, but we will try to reduce this dimension.

Image of principal component analysis

An example of dropping two-dimensional information into one dimension is explained as shown in the figure below. As an example that is often used, I think that there is a way to extract two main components such as science ability and liberal arts ability from the grades of each subject such as Japanese language, math, science, and social studies, but this explains the image. I want to. Since I want to show it in the figure, I will use test data with two subjects, the horizontal axis is mathematics and the vertical axis is science grades. In this case, both mathematics and science are abilities of science, so there seems to be a correlation. The blue straight line is the main component: the science ability. As shown by the arrows, the place where the line perpendicular to this straight line is dropped from each data can be expressed as one-dimensional data as the ability of science.

It is a graph after making it one-dimensional.

The length of the perpendicular line drawn from the data points shown in the figure below to the straight line of the main component is called the amount of information loss. It's the part that is lost when you reduce the dimension. The straight line of the main component passes through the average point of the data, and the slope that minimizes this amount of information loss is derived and determined. This is also the least squares method. The calculation is performed to minimize the sum of squares of the amount of information loss.

How many dimensions do you drop?

As I mentioned earlier about the amount of information loss, there is also the concept of contribution rate, which is inextricably linked to it. Roughly speaking, it is the ratio of how much the main component can explain. In the example of summarizing mathematics and science as science ability 1 = Contribution rate of the first principal component (science ability) + Information loss rate (others) = 0.89997 + 0.10003 So, about 90% could be explained in one dimension.

Now, let's perform principal component analysis on numerical data written in python for the time being. First of all, instead of doing it for the entire data, let's analyze what happens to the variation of the data for each number by dividing it for each number. Below is the main code part. (The entire code is here)


#Data reading
raw_data= np.loadtxt('train_master.csv',delimiter=',',skiprows=1)
#dataset = DigitDataSet(raw_data)
dataset = DigitDataSet(raw_data)
data = [None for i in range(10)]
for i in range(10):
    data[i] = dataset.getByLabel(i,'all')

#Perform principal component analysis
#Calculate the difference in contribution rate depending on the number of dimensions after reduction
comp_items = [5,10,20,30]  #List of post-reduction dimensions
cumsum_explained = np.zeros((10,len(comp_items)))
for i, n_comp in zip(range(len(comp_items)), comp_items):
    for num in range(10):                        #Analyze each number
        pca = decomp.PCA(n_components = n_comp)  #Creating a principal component analysis object
        pca.fit(data[num])                       #Perform principal component analysis
        transformed = pca.transform(data[num])   #Generate reduced vector for data
        E = pca.explained_variance_ratio_        #Contribution rate
        cumsum_explained[num, i] = np.cumsum(E)[::-1][0] #Cumulative contribution rate

print "|　label　|explained n_comp:5|explained n_comp:10|explained n_comp:20|explained n_comp:30|"
print "|:-----:|:-----:|:-----:|:-----:|:-----:|"
for i in range(10):
    print "|%d|%.1f％|%.1f％|%.1f％|%.1f％|"%(i, cumsum_explained[i,0]*100, cumsum_explained[i,1]*100, cumsum_explained[i,2]*100, cumsum_explained[i,3]*100)

The table below shows each cumulative contribution rate when the number of dimensions after reduction is 5, 10, 20, and 30. For example, in the case of the number of dimensions 30, it is reduced from the original number of dimensions 784 to 30, and it shows what percentage is explained by the 30-dimensional vector. There are variations for each number, but if you take up to about 30 dimensions, you will be able to explain about 70%. The number of dimensions has been reduced from 784 to 30, which is about 4%, but it is quite easy to think that 7,80% can be explained.

label	explained n_comp:5	explained n_comp:10	explained n_comp:20	explained n_comp:30
0	48.7%	62.8%	75.9%	82.0%
1	66.6%	76.6%	84.8%	88.7%
2	36.5%	51.9%	67.2%	75.3%
3	39.7%	53.7%	68.3%	75.8%
4	39.4%	56.3%	70.7%	77.9%
5	42.3%	55.5%	69.7%	77.0%
6	44.5%	59.7%	74.0%	80.9%
7	45.9%	61.0%	74.2%	80.6%
8	36.3%	49.6%	65.5%	74.1%
9	43.2%	58.5%	73.4%	80.4%

The graph below is an image of the 10 main components after the reduction. They are arranged in descending order of contribution rate from the upper left. (The part of exp: is the contribution rate)

Apply principal component analysis to the entire data

Earlier, we applied principal component analysis for each number, but then we applied principal component analysis to all 43000 numerical data, and finally reduced the dimension to two dimensions. It seems that it is dropped too much from 784 dimensions to 2 dimensions, but when it is changed to 2 dimensions, a graph is drawn. The graph below is a two-dimensional plot of all the data. It has become a pretty cluttered figure. Principal component analysis is done because the principal components are calculated collectively without considering which number each data point is. LDA is an analysis that considers which number is for each point, but I would like to explain this next time. In any case, I think it is a big advantage to be able to see it in a graph by dropping it in two dimensions.

** Graph plotting all points **

** A graph in which the averaged points are plotted and expressed in a size proportional to the variance **

Here is the main part of the python code.


# PCA ALL
pca = decomp.PCA(n_components = 2)
pca.fit(dataset.getData())
transformed = pca.transform(dataset.getData())
colors = [plt.cm.hsv(0.1*i, 1) for i in range(10)]
plt.figure(figsize=(16,11))
for i in range(10):
    plt.scatter(0,0, alpha=1, c=colors[i],label=str(i))
plt.legend()

for l, d in zip(dataset.getLabel(), transformed):
    plt.scatter(d[0],d[1] , c=colors[int(l)], alpha=0.3)

plt.title("PCA(Principal Component Analysis)")
plt.show()

#Draw a representative value of each number in a graph
transformed = [pca.transform(dataset.getByLabel(label=i,num=('all'))) for i in range(10)]

ave = [np.average(transformed[i],axis=0) for i in range(10)]
var = [np.var(transformed[i],axis=0) for i in range(10)]
plt.clf()
plt.figure(figsize=(14,10))
for j in range(10):
    plt.scatter(100,100, alpha=1, c=colors[j],label=str(j))
plt.legend()
plt.xlim(-1500, 1500)
plt.ylim(-1500, 1500)

for i, a, v in zip(range(10), ave, var):
    print i, a[0], a[1]
    plt.scatter(a[0], a[1], c=colors[i], alpha=0.6, s=v/4, linewidth=1)
    plt.scatter(a[0], a[1], c="k", s=10)    
    plt.text(a[0], a[1], "digit: %d"%i, fontsize=12)

plt.title("PCA Representative Vector for each digit.")
plt.savefig("PCA_RepVec.png ")
plt.show()

Continued.