I tried to follow the proof of principal component analysis, but I didn't understand how to use it, so I decided to deepen my understanding by scratching Chapter 9 Example 4 of the book "Introduction to Multivariate Analysis" published by Science. This article is meant to be that reminder.
・ VSCode (... well, any editor is fine) · Python version 3.7.0 ·Library -Numpy (for matrix management) -Matplotlib (used in scatter plots) ・ Introduction to multivariate analysis method (Science) ← In the following, this book will be referred to as a textbook.
Easily check your scratch policy. ① Data entry ② Data standardization ③ Calculate the correlation coefficient matrix ④ Calculate eigenvalues and eigenvectors ⑤ Perform principal component analysis and compress to two dimensions ⑥ In addition, the factor load is also given. ⑦ Draw a scatter plot for each factor loading and main component score
The data in the example. I entered it by hand.
ex9_4.txt
86 79 67 68
71 75 78 84
42 43 39 44
62 58 98 95
96 97 61 63
39 33 45 50
50 53 64 72
78 66 52 47
51 44 76 72
89 92 93 91
I will put the code.
scratch.py
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
#Enter sample size(line)
n = int(input())
#Enter data
a = [input().split() for l in range(n)]
data = np.array(a, dtype=float)
line = int(data.shape[0])
row = int(data.shape[1])
print(line, row)
means = np.empty(row)
sds = np.empty(row)
#Calculate average
sum = 0
for i in range(row):
for j in range(line):
sum += data[j][i]
else:
means[i] = sum / line
sum = 0
print("average:")
print(means)
#Calculate standard deviation
for i in range(row):
for j in range(line):
sum += (data[j][i]-means[i]) ** 2
else:
sds[i] = (sum / (line - 1)) ** 0.5
sum = 0
print("standard deviation:")
print(sds)
#Standardization
standardized_data = np.zeros((line,row))
for i in range(row):
for j in range(line):
standardized_data[j][i] = (data[j][i] - means[i]) / sds[i]
print(standardized_data)
#Calculate the correlation coefficient matrix
cc_matrix = np.zeros((row, row))
for i in range(row):
for j in range(row):
if(cc_matrix[i][j] == 0):
for k in range(line):
sum += standardized_data[k][i] * standardized_data[k][j]
else:
cc_matrix[i][j] = sum / (line - 1)
cc_matrix[j][i] = cc_matrix[i][j]
sum = 0
#Eigenvalue problem
w,v = LA.eig(cc_matrix)
#print(w)
#print(v)
#Compressed to 2D by principal component analysis
main_component = np.zeros((line, 2))
z_1, z_2 = 0, 0
for i in range(line):
for j in range(row):
z_1 += standardized_data[i][j] * v[j][0]
z_2 += standardized_data[i][j] * v[j][1]
else:
main_component[i][0] = z_1
main_component[i][1] = z_2
z_1, z_2 = 0, 0
print("Main component:")
print(main_component)
#Factor loading
factor_load = np.zeros((2, row))
for i in range(2):
for j in range(row):
factor_load[i][j] = (w[i] ** 0.5) * v[j][i]
print("Factor loading:")
print(factor_load)
#Draw a scatter plot
#Scatter plot of factor loading
for label in range(row):
plt.scatter(factor_load[0][label], factor_load[1][label], marker="$"+str(label+1)+"$")
plt.title('factor_load')
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.show()
#Scatter plot of main component scores
for label in range(line):
plt.scatter(main_component[label][0], main_component[label][1], marker="$"+str(label+1)+"$")
plt.title('principal component')
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.show()
Each process of the code is explained below.
First, input the number of rows of the matrix to be created with input () from the keyboard. This is the sample size in the original data. Next, input the matrix data with the keyboard and manage it with numpy. I think there is another good method for this input method. (But this time, this part is compared with the target principal component analysis. I thought it was a trivial matter, and I didn't want to spend too much time, so I decided to put it on the shelf and proceed.) Then, get the number of rows and columns of the created matrix. These guys use it a lot. Also, with this code, the average vector and standard deviation vector are initialized at this stage.
#Enter sample size(line)
n = int(input())
#Enter data
a = [input().split() for l in range(n)]
data = np.array(a, dtype=float)
line = int(data.shape[0])
row = int(data.shape[1])
print(line, row)
means = np.empty(row)
sds = np.empty(row)
From the matrix of data, find the mean and standard deviation for each column, and use them to create a matrix that standardizes the original data. (Standardized_data) From here, it's basically a for sentence storm. I used my head very much in the subscript. Also note that when calculating the standard deviation, the number to be divided must be n-1 instead of n, otherwise the value will not be as per the textbook.
#Calculate average
sum = 0
for i in range(row):
for j in range(line):
sum += data[j][i]
else:
means[i] = sum / line
sum = 0
print("average:")
print(means)
#Calculate standard deviation
for i in range(row):
for j in range(line):
sum += (data[j][i]-means[i]) ** 2
else:
sds[i] = (sum / (line - 1)) ** 0.5
sum = 0
print("standard deviation:")
print(sds)
#Standardization
standardized_data = np.zeros((line,row))
for i in range(row):
for j in range(line):
standardized_data[j][i] = (data[j][i] - means[i]) / sds[i]
print(standardized_data)
We appreciate the textbook equations (9.2) and (9.3) and calculate the correlation coefficient matrix. Of course, we confirm the proof on paper before using the equation. In addition, we are trying to reduce the amount of calculation by utilizing the fact that the correlation coefficient matrix is a symmetric matrix.
#Calculate the correlation coefficient matrix
cc_matrix = np.zeros((row, row))
for i in range(row):
for j in range(row):
if(cc_matrix[i][j] == 0):
for k in range(line):
sum += standardized_data[k][i] * standardized_data[k][j]
else:
cc_matrix[i][j] = sum / (line - 1)
cc_matrix[j][i] = cc_matrix[i][j]
sum = 0
Calculate the eigenvalues and eigenvectors ... but I couldn't think of a scratching method here, so I decided to throw it into the linalg.eig function and move on. I put the eigenvalues in w and the eigenvectors in v. Masu. If you have any good way to scratch here, please let me know.
#Eigenvalue problem
w,v = LA.eig(cc_matrix)
#print(w)
#print(v)
This is the main subject. This time, I want to check the flow up to the scatter plot, so I decided to drop two dimensions, that is, two main components, and wrote the code. The for statement is hard ...
#Compressed to 2D by principal component analysis
main_component = np.zeros((line, 2))
z_1, z_2 = 0, 0
for i in range(line):
for j in range(row):
z_1 += standardized_data[i][j] * v[j][0]
z_2 += standardized_data[i][j] * v[j][1]
else:
main_component[i][0] = z_1
main_component[i][1] = z_2
z_1, z_2 = 0, 0
print("Main component:")
print(main_component)
By the way, it is a factor loading.
#Factor loading
factor_load = np.zeros((2, row))
for i in range(2):
for j in range(row):
factor_load[i][j] = (w[i] ** 0.5) * v[j][i]
print("Factor loading:")
print(factor_load)
I will plot the principal component scores and factor loadings that I worked hard on on a scatter plot. The x-axis is the first principal component and the y-axis is the second principal component.
#Draw a scatter plot
#Scatter plot of factor loading
for label in range(row):
plt.scatter(factor_load[0][label], factor_load[1][label], marker="$"+str(label+1)+"$")
plt.title('factor_load')
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.show()
#Scatter plot of main component scores
for label in range(line):
plt.scatter(main_component[label][0], main_component[label][1], marker="$"+str(label+1)+"$")
plt.title('principal component')
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.show()
I was able to draw almost the same thing as a textbook. Is it good
I wrote an article about qiita for the first time. I think there are many things that are difficult to read. Also, because I am immature, I think there are a lot of tsukkomi in the above code. I welcome tsukkomi to them. Rather, I posted it to qiita to get tsukkomi.
Yasushi Nagata, Masahiko Muchinaka Introduction to Multivariate Analysis Method Science, April 10, 2001
Recommended Posts