[PYTHON] Principal component analysis Analyze handwritten numbers using PCA. Part 2

I would like to continue using PCA following Yesterday's article. I was planning to explain LDA, but I changed my schedule a little and today I only have PCA.

In chaos ...

Below is a graph of 43000 handwritten digit data dropped into two dimensions by PCA. PCA_ALL2-compressor.png PCA_ALL_reps-compressor.png
This is the graph of the result of an unreasonable analysis of dropping 784-dimensional handwritten characters into 2 dimensions. "1" and "0" are separated, but "2", "3", "5", "6", "8" are one group, "4", "7" , "9" overlaps as another group. .. .. Especially the graph above is chaos. .. ..

However, if you increase the main components from 2D to ** 30D **, the product contribution rate will exceed 70% at ** 0.731 **, so I have a feeling that the results will be surprisingly good.

Click here for the python code that gives the contribution rate. It's easy with the library ^^

n_comp = 30

pca = decomp.PCA(n_components = n_comp)
pca.fit(dataset.getData())       #Apply principal component analysis to all 43000 handwritten digits
E = pca.explained_variance_ratio_
print "explained", E             #Contribution rate of each main component
print "cumsum E", np.cumsum(E)   #Cumulative contribution rate

Visualization of principal components

Now, I would like to test how much this main component and the data after dimensionality reduction can be expressed.

In this example, the data is compressed from 784 dimensions to 30 dimensions. Therefore, since there are 30 main components

{\bf a_1} = \{ a_{1,1},a_{1,2},a_{1,3},...,a_{1,784} \}    \\
...                             \\
{\bf a_{30}} = \{ a_{30,1},a_{30,2},a_{30,3},...,a_{30,784} \}

This and the numeric data "0" can be expressed as

{\bf x_0} = \{ x_{0,1},x_{0,2},x_{0,3},....x_{0,30} \}

Can be represented by 30 vectors. The meaning of principal component analysis is that it can be expressed by 30 components, which could not be expressed without using a large vector of 784 components. So, in order to return this to image data,

ImageData_0 = x_{0,1}{\bf a_1} + x_{0,2}{\bf a_2} + ... + x_{0,30}{\bf a_30}

I will do the calculation.

First, I would like to graphically check what each principal component $ {\ bf a_i} $ would look like in an image. From top left, $ {\ bf a_1}, {\ bf a_2}, ... $.

PCA-comps-compressor.png

Wow, it turned out to be a rather unpleasant graph rather than chaos. .. .. Is it really okay? .. .. worry. .. ..

Click here for the python code that graphs this principal component.

n_comp = 30

fig = plt.figure(figsize=(10, 12))
for i in range(n_comp):
    plot_digits(pca.components_[i], size, 6, 5, i+1, "comp:%d exp:%.3f" % (i+1, E[0]), fontsize=9)

plt.savefig("PCA components.png ")
plt.show()

### Display image of numerical data dropped in 30 dimensions ###

By the way, the data of each number is the following 30-dimensional vector.

0 [-1701.4516848    360.5515562    501.80559391   335.42365557  -442.37893255
   738.40404869   653.87543763  -176.60067741     7.52017489    67.8462729
   -34.2218036    -46.55184177   -70.43577469  -342.69209695   377.83995173
    -5.66582709   317.76574823   -87.61261823   -94.53116795   175.02827
  -213.08659782   272.41196629     7.16761158   -22.635149     -34.60858894
  -264.48697639   -76.62192789    14.02612973   -80.42733958    87.6849867 ] 

1 [ 661.59577975  699.31132821 -183.28203965  120.61139445  -81.08181052
  489.46188551 -683.47083797   85.55938661 -348.5480522   202.97854522
  364.55994931  -21.26575592  404.44144851  -97.05254548   61.83993555
  -86.78002717   17.65814358 -285.48469649   18.82730277 -207.64273128
   44.24360034 -221.89436971   57.22745918 -148.67496175   14.34358893
   41.55603106 -333.7236588   208.97888078   59.81363057  -84.55446472] 

2 [   2.61889858  667.83425383  623.25708606 -240.73842216  807.87987427
  448.08932462  809.91470435 -532.39654183 -541.55909038  172.17476512
   -9.56195501  282.15421246  219.11044719 -220.93747327   43.0973319
  146.33386437  181.93836014  116.40958486   13.92428748 -105.6101748
  185.89765605 -291.55100581   87.49435262   84.58855469  145.02361174
   51.72930638   47.85132163  261.13345514  -24.44843863   50.79510831] 

3 [ 114.38181469   20.72714258  504.58355599  -89.64933421 -253.97294532
  325.980776   -360.69326214   66.35769716  -14.68477165 -130.43479691
 -447.40395968  111.99175081  -31.50682548  183.41780399 -519.83792854
 -256.85478577 -113.73387925 -342.03579127 -252.46793099   42.67143142
 -127.42356394  186.64626798  181.90229759  219.77068914 -163.18068948
  135.98266763  131.31762106  264.38488399  133.3078287     7.35507795] 

4 [ 165.75560243 -300.18276053   64.14548517  759.70626076 -425.8443787
  157.39033697 -304.0991401   276.40898204   45.86721541 -295.47758088
    2.74648031  256.88429711  -87.73418977 -175.36126677   40.05170784
  -87.53632407   54.27888133 -199.84899771  -11.82620089 -298.09170974
 -232.16000555   89.85484106  292.73288896  125.82278044  -68.7010304
  193.42367936 -184.23850425   82.89710955  214.44949617 -191.17837477] 

5 [-350.22936554 -141.01297399  389.03065738 -619.26138386  288.79058105
 -500.15719527 -538.72021671 -205.96174636  365.50575542  -60.49472136
   44.4873806  -135.66792908 -112.30051758  592.00954779  211.90849699
 -222.04781047   76.68101573 -173.22893185  -74.82330789 -328.13687912
   54.20947384   29.24886881  -54.30828897  109.31639119 -148.5643377
  231.27705194  -56.10174144  104.02362596   -5.79036367  127.80551682] 

6 [-187.86580218   90.04418067 -744.54442254  350.31041481 -332.06715871
 -180.95934671 -162.1086696    16.39830485 -374.48172442   83.73143967
  130.89870535   80.7921533    -8.58842498   84.6122807  -146.77018343
 -138.92568721  158.65298533  103.19544849 -212.53071491 -278.06266361
  176.32032658  318.61200636  -25.04615495 -331.00041428  -68.16511766
   -8.6657172   131.68031183  163.86737242   80.71525      17.82871763] 

7 [ 672.32316444 -464.80397448  313.66005881 -136.13073047 -325.54440893
  352.67672269  333.35838571  149.05717471  110.00701405  233.97083611
  202.69128282 -211.99648309 -121.59390141 -235.10689307 -183.46670371
 -262.11364747  164.16735123 -102.47910648  257.00098676 -242.83922531
 -205.03185461   85.07389889  159.82922651 -153.57362576   15.46211732
 -282.58840157   38.47973654   40.80311292  -28.29748845  -68.67282582] 

8 [ -2.63682307e+02  -3.92782426e+02  -4.82817090e+02  -1.08732309e+03
   3.16636550e+02   9.40608014e+01   9.62990473e+00  -3.44484518e+02
  -3.42655267e+02   2.44390233e+02  -3.28347935e+02   1.00384644e+00
   4.64562686e+02  -3.92400283e+02  -1.99550884e+01   2.25400966e+02
  -3.93818241e+02  -1.72862773e+02   6.53585372e+01  -2.10524955e+02
   3.71808324e+02   7.97760442e+01   1.02030888e+02  -2.09960791e+02
  -1.65077930e+02   3.51481675e+02   1.99475148e+02  -3.71063678e+02
   6.48734732e+01  -1.68703356e+02] 

9 [ 306.52927212 -351.81955452 -469.10137206 -647.33110927  112.86683552
  215.02556985  -54.15071035 -363.34083069 -211.26631969 -706.20430501
    8.02149574  249.34094422 -166.86167523  211.93010084 -226.5338007
   85.19449155  185.16142027  148.70012599 -141.61039606  200.31039523
  116.4385829   282.81894609   40.49368393   49.44160736 -236.56002282
   42.07184296 -210.24357152 -226.30974581  166.15458766  175.76515424] 

Then, the data is restored by multiplying each component of this vector by the vector of the main component. Is it okay, with the main component just before ...

Click here for the python code that performed the calculation.

#Get the data after dimension reduction one by one from each number
transformed = [pca.transform(dataset.getByLabel(i, num=1)) for i in range(10)]
label_data_10dim = []
for l in range(10):
    vec = []
    for i, t in zip(range(len(transformed[l][0])), transformed[l][0]):
        vec.append(t * np.array(pca.components_[i]))    #Multiply the data element by the vector of the principal component.
    S = np.sum(vec,axis=0)       #Add up for each element.
    label_data_10dim.append(S) 

#Image display
fig = plt.figure(figsize=(10, 5))
for i in range(10):
    plot_digits(label_data_10dim[i], size, 2, 5, i+1, "label:%d"%i, fontsize=9)

plt.show()

So, here is the display.

PCA-result-compressor.png

** I can express more than I imagined! ** You can read "0", "1", "2", "5", "7", "8", "9" without any problem. Please look at "3" lightly. If you look closely, 3 is appearing properly! "4" and "6" are a bit harsh results. However, I think that the numerical image can be restored from the mysterious main component.

Recommended Posts

Principal component analysis Analyze handwritten numbers using PCA. Part 2
Principal component analysis Analyze handwritten numbers using PCA. Part 1
Principal component analysis (Principal component analysis: PCA)
Face recognition using principal component analysis
Principal component analysis
Robot grip position (Python PCA principal component analysis)
Principal component analysis using python from nim with nimpy
Principal component analysis (PCA) and independent component analysis (ICA) in python
Unsupervised learning 3 Principal component analysis
Playing handwritten numbers with python Part 1
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Japanese analysis processing using Janome part1
Python: Unsupervised Learning: Principal Component Analysis
Principal Component Analysis with Livedoor News Corpus-Practice-
Play handwritten numbers with python Part 2 (identify)
Principal component analysis with Power BI + Python
<Course> Machine learning Chapter 4: Principal component analysis
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Principal component analysis with Livedoor News Corpus --Preparation--
Dimensional compression with self-encoder and principal component analysis
I tried principal component analysis with Titanic data!
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation