Introduction

I tried to follow the proof of principal component analysis, but I didn't understand how to use it, so I decided to deepen my understanding by scratching Chapter 9 Example 4 of the book "Introduction to Multivariate Analysis" published by Science. This article is meant to be that reminder.

What to use

・ VSCode (... well, any editor is fine) · Python version 3.7.0 ·Library -Numpy (for matrix management) -Matplotlib (used in scatter plots) ・ Introduction to multivariate analysis method (Science) ← In the following, this book will be referred to as a textbook.

flow

Easily check your scratch policy. ① Data entry ② Data standardization ③ Calculate the correlation coefficient matrix ④ Calculate eigenvalues and eigenvectors ⑤ Perform principal component analysis and compress to two dimensions ⑥ In addition, the factor load is also given. ⑦ Draw a scatter plot for each factor loading and main component score

Matrix of data

The data in the example. I entered it by hand.

`ex9_4.txt`


86 79 67 68
71 75 78 84
42 43 39 44
62 58 98 95
96 97 61 63
39 33 45 50
50 53 64 72
78 66 52 47
51 44 76 72
89 92 93 91

code

I will put the code.

`scratch.py`


import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

#Enter sample size(line)
n = int(input())

#Enter data
a = [input().split() for l in range(n)]
data = np.array(a, dtype=float)
line = int(data.shape[0])
row = int(data.shape[1])
print(line, row)

means = np.empty(row)
sds = np.empty(row)

#Calculate average
sum = 0
for i in range(row):
    for j in range(line):
        sum +=  data[j][i]
    else:
        means[i] = sum / line
        sum = 0
print("average:")
print(means)

#Calculate standard deviation
for i in range(row):
    for j in range(line):
        sum += (data[j][i]-means[i]) ** 2
    else:
        sds[i] = (sum / (line - 1)) ** 0.5
        sum = 0
print("standard deviation:")
print(sds)

#Standardization
standardized_data = np.zeros((line,row))
for i in range(row):
    for j in range(line):
        standardized_data[j][i] = (data[j][i] - means[i]) / sds[i]  
print(standardized_data)

#Calculate the correlation coefficient matrix
cc_matrix = np.zeros((row, row))
for i in range(row):
    for j in range(row):
        if(cc_matrix[i][j] == 0):
            for k in range(line):
                sum += standardized_data[k][i] * standardized_data[k][j]
            else:
                cc_matrix[i][j] = sum / (line - 1)
                cc_matrix[j][i] = cc_matrix[i][j]
                sum = 0

#Eigenvalue problem
w,v = LA.eig(cc_matrix)
#print(w)
#print(v)

#Compressed to 2D by principal component analysis
main_component = np.zeros((line, 2))
z_1, z_2 = 0, 0
for i in range(line):
    for j in range(row):
        z_1 += standardized_data[i][j] * v[j][0]
        z_2 += standardized_data[i][j] * v[j][1]
    else:
        main_component[i][0] = z_1
        main_component[i][1] = z_2
        z_1, z_2 = 0, 0
print("Main component:")
print(main_component)

#Factor loading
factor_load = np.zeros((2, row))
for i in range(2):
    for j in range(row):
        factor_load[i][j] = (w[i] ** 0.5) * v[j][i]
print("Factor loading:")
print(factor_load)

#Draw a scatter plot

#Scatter plot of factor loading
for label in range(row):
    plt.scatter(factor_load[0][label], factor_load[1][label], marker="$"+str(label+1)+"$")
    plt.title('factor_load')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

#Scatter plot of main component scores
for label in range(line):
    plt.scatter(main_component[label][0], main_component[label][1], marker="$"+str(label+1)+"$")
    plt.title('principal component')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

Each process of the code is explained below.

Data entry

First, input the number of rows of the matrix to be created with input () from the keyboard. This is the sample size in the original data. Next, input the matrix data with the keyboard and manage it with numpy. I think there is another good method for this input method. (But this time, this part is compared with the target principal component analysis. I thought it was a trivial matter, and I didn't want to spend too much time, so I decided to put it on the shelf and proceed.) Then, get the number of rows and columns of the created matrix. These guys use it a lot. Also, with this code, the average vector and standard deviation vector are initialized at this stage.


#Enter sample size(line)
n = int(input())

#Enter data
a = [input().split() for l in range(n)]
data = np.array(a, dtype=float)
line = int(data.shape[0])
row = int(data.shape[1])
print(line, row)

means = np.empty(row)
sds = np.empty(row)

Data standardization

From the matrix of data, find the mean and standard deviation for each column, and use them to create a matrix that standardizes the original data. (Standardized_data) From here, it's basically a for sentence storm. I used my head very much in the subscript. Also note that when calculating the standard deviation, the number to be divided must be n-1 instead of n, otherwise the value will not be as per the textbook.


#Calculate average
sum = 0
for i in range(row):
    for j in range(line):
        sum +=  data[j][i]
    else:
        means[i] = sum / line
        sum = 0
print("average:")
print(means)

#Calculate standard deviation
for i in range(row):
    for j in range(line):
        sum += (data[j][i]-means[i]) ** 2
    else:
        sds[i] = (sum / (line - 1)) ** 0.5
        sum = 0
print("standard deviation:")
print(sds)

#Standardization
standardized_data = np.zeros((line,row))
for i in range(row):
    for j in range(line):
        standardized_data[j][i] = (data[j][i] - means[i]) / sds[i]  
print(standardized_data)

Calculate the correlation coefficient matrix

We appreciate the textbook equations (9.2) and (9.3) and calculate the correlation coefficient matrix. Of course, we confirm the proof on paper before using the equation. In addition, we are trying to reduce the amount of calculation by utilizing the fact that the correlation coefficient matrix is a symmetric matrix.

#Calculate the correlation coefficient matrix
cc_matrix = np.zeros((row, row))
for i in range(row):
    for j in range(row):
        if(cc_matrix[i][j] == 0):
            for k in range(line):
                sum += standardized_data[k][i] * standardized_data[k][j]
            else:
                cc_matrix[i][j] = sum / (line - 1)
                cc_matrix[j][i] = cc_matrix[i][j]
                sum = 0

Eigenvalue problem

Calculate the eigenvalues and eigenvectors ... but I couldn't think of a scratching method here, so I decided to throw it into the linalg.eig function and move on. I put the eigenvalues in w and the eigenvectors in v. Masu. If you have any good way to scratch here, please let me know.

#Eigenvalue problem
w,v = LA.eig(cc_matrix)
#print(w)
#print(v)

Principal component analysis

This is the main subject. This time, I want to check the flow up to the scatter plot, so I decided to drop two dimensions, that is, two main components, and wrote the code. The for statement is hard ...


#Compressed to 2D by principal component analysis
main_component = np.zeros((line, 2))
z_1, z_2 = 0, 0
for i in range(line):
    for j in range(row):
        z_1 += standardized_data[i][j] * v[j][0]
        z_2 += standardized_data[i][j] * v[j][1]
    else:
        main_component[i][0] = z_1
        main_component[i][1] = z_2
        z_1, z_2 = 0, 0
print("Main component:")
print(main_component)

Factor loading

By the way, it is a factor loading.


#Factor loading
factor_load = np.zeros((2, row))
for i in range(2):
    for j in range(row):
        factor_load[i][j] = (w[i] ** 0.5) * v[j][i]
print("Factor loading:")
print(factor_load)

Scatter plot

I will plot the principal component scores and factor loadings that I worked hard on on a scatter plot. The x-axis is the first principal component and the y-axis is the second principal component.

#Draw a scatter plot

#Scatter plot of factor loading
for label in range(row):
    plt.scatter(factor_load[0][label], factor_load[1][label], marker="$"+str(label+1)+"$")
    plt.title('factor_load')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

#Scatter plot of main component scores
for label in range(line):
    plt.scatter(main_component[label][0], main_component[label][1], marker="$"+str(label+1)+"$")
    plt.title('principal component')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

Scatter plot

I was able to draw almost the same thing as a textbook. Is it good

At the end

I wrote an article about qiita for the first time. I think there are many things that are difficult to read. Also, because I am immature, I think there are a lot of tsukkomi in the above code. I welcome tsukkomi to them. Rather, I posted it to qiita to get tsukkomi.

References

Yasushi Nagata, Masahiko Muchinaka Introduction to Multivariate Analysis Method Science, April 10, 2001

[Python] PCA scratch in the example of "Introduction to multivariate analysis"

Introduction

What to use

flow

Matrix of data

ex9_4.txt

code

scratch.py

Data entry

Data standardization

Calculate the correlation coefficient matrix

Eigenvalue problem

Principal component analysis

Factor loading

Scatter plot

Scatter plot

At the end

References

`ex9_4.txt`

`scratch.py`