[Python] PCA scratch in the example of "Introduction to multivariate analysis"

Introduction

I tried to follow the proof of principal component analysis, but I didn't understand how to use it, so I decided to deepen my understanding by scratching Chapter 9 Example 4 of the book "Introduction to Multivariate Analysis" published by Science. This article is meant to be that reminder.

What to use

・ VSCode (... well, any editor is fine) · Python version 3.7.0 ·Library -Numpy (for matrix management) -Matplotlib (used in scatter plots) ・ Introduction to multivariate analysis method (Science) ← In the following, this book will be referred to as a textbook.

flow

Easily check your scratch policy. ① Data entry ② Data standardization ③ Calculate the correlation coefficient matrix ④ Calculate eigenvalues and eigenvectors ⑤ Perform principal component analysis and compress to two dimensions ⑥ In addition, the factor load is also given. ⑦ Draw a scatter plot for each factor loading and main component score

Matrix of data

The data in the example. I entered it by hand.

ex9_4.txt


86 79 67 68
71 75 78 84
42 43 39 44
62 58 98 95
96 97 61 63
39 33 45 50
50 53 64 72
78 66 52 47
51 44 76 72
89 92 93 91

code

I will put the code.

scratch.py


import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

#Enter sample size(line)
n = int(input())

#Enter data
a = [input().split() for l in range(n)]
data = np.array(a, dtype=float)
line = int(data.shape[0])
row = int(data.shape[1])
print(line, row)

means = np.empty(row)
sds = np.empty(row)

#Calculate average
sum = 0
for i in range(row):
    for j in range(line):
        sum +=  data[j][i]
    else:
        means[i] = sum / line
        sum = 0
print("average:")
print(means)

#Calculate standard deviation
for i in range(row):
    for j in range(line):
        sum += (data[j][i]-means[i]) ** 2
    else:
        sds[i] = (sum / (line - 1)) ** 0.5
        sum = 0
print("standard deviation:")
print(sds)

#Standardization
standardized_data = np.zeros((line,row))
for i in range(row):
    for j in range(line):
        standardized_data[j][i] = (data[j][i] - means[i]) / sds[i]  
print(standardized_data)

#Calculate the correlation coefficient matrix
cc_matrix = np.zeros((row, row))
for i in range(row):
    for j in range(row):
        if(cc_matrix[i][j] == 0):
            for k in range(line):
                sum += standardized_data[k][i] * standardized_data[k][j]
            else:
                cc_matrix[i][j] = sum / (line - 1)
                cc_matrix[j][i] = cc_matrix[i][j]
                sum = 0

#Eigenvalue problem
w,v = LA.eig(cc_matrix)
#print(w)
#print(v)

#Compressed to 2D by principal component analysis
main_component = np.zeros((line, 2))
z_1, z_2 = 0, 0
for i in range(line):
    for j in range(row):
        z_1 += standardized_data[i][j] * v[j][0]
        z_2 += standardized_data[i][j] * v[j][1]
    else:
        main_component[i][0] = z_1
        main_component[i][1] = z_2
        z_1, z_2 = 0, 0
print("Main component:")
print(main_component)

#Factor loading
factor_load = np.zeros((2, row))
for i in range(2):
    for j in range(row):
        factor_load[i][j] = (w[i] ** 0.5) * v[j][i]
print("Factor loading:")
print(factor_load)

#Draw a scatter plot

#Scatter plot of factor loading
for label in range(row):
    plt.scatter(factor_load[0][label], factor_load[1][label], marker="$"+str(label+1)+"$")
    plt.title('factor_load')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

#Scatter plot of main component scores
for label in range(line):
    plt.scatter(main_component[label][0], main_component[label][1], marker="$"+str(label+1)+"$")
    plt.title('principal component')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

Each process of the code is explained below.

Data entry

First, input the number of rows of the matrix to be created with input () from the keyboard. This is the sample size in the original data. Next, input the matrix data with the keyboard and manage it with numpy. I think there is another good method for this input method. (But this time, this part is compared with the target principal component analysis. I thought it was a trivial matter, and I didn't want to spend too much time, so I decided to put it on the shelf and proceed.) Then, get the number of rows and columns of the created matrix. These guys use it a lot. Also, with this code, the average vector and standard deviation vector are initialized at this stage.


#Enter sample size(line)
n = int(input())

#Enter data
a = [input().split() for l in range(n)]
data = np.array(a, dtype=float)
line = int(data.shape[0])
row = int(data.shape[1])
print(line, row)

means = np.empty(row)
sds = np.empty(row)

Data standardization

From the matrix of data, find the mean and standard deviation for each column, and use them to create a matrix that standardizes the original data. (Standardized_data) From here, it's basically a for sentence storm. I used my head very much in the subscript. Also note that when calculating the standard deviation, the number to be divided must be n-1 instead of n, otherwise the value will not be as per the textbook.


#Calculate average
sum = 0
for i in range(row):
    for j in range(line):
        sum +=  data[j][i]
    else:
        means[i] = sum / line
        sum = 0
print("average:")
print(means)

#Calculate standard deviation
for i in range(row):
    for j in range(line):
        sum += (data[j][i]-means[i]) ** 2
    else:
        sds[i] = (sum / (line - 1)) ** 0.5
        sum = 0
print("standard deviation:")
print(sds)

#Standardization
standardized_data = np.zeros((line,row))
for i in range(row):
    for j in range(line):
        standardized_data[j][i] = (data[j][i] - means[i]) / sds[i]  
print(standardized_data)

Calculate the correlation coefficient matrix

We appreciate the textbook equations (9.2) and (9.3) and calculate the correlation coefficient matrix. Of course, we confirm the proof on paper before using the equation. In addition, we are trying to reduce the amount of calculation by utilizing the fact that the correlation coefficient matrix is a symmetric matrix.

#Calculate the correlation coefficient matrix
cc_matrix = np.zeros((row, row))
for i in range(row):
    for j in range(row):
        if(cc_matrix[i][j] == 0):
            for k in range(line):
                sum += standardized_data[k][i] * standardized_data[k][j]
            else:
                cc_matrix[i][j] = sum / (line - 1)
                cc_matrix[j][i] = cc_matrix[i][j]
                sum = 0

Eigenvalue problem

Calculate the eigenvalues and eigenvectors ... but I couldn't think of a scratching method here, so I decided to throw it into the linalg.eig function and move on. I put the eigenvalues in w and the eigenvectors in v. Masu. If you have any good way to scratch here, please let me know.

#Eigenvalue problem
w,v = LA.eig(cc_matrix)
#print(w)
#print(v)

Principal component analysis

This is the main subject. This time, I want to check the flow up to the scatter plot, so I decided to drop two dimensions, that is, two main components, and wrote the code. The for statement is hard ...


#Compressed to 2D by principal component analysis
main_component = np.zeros((line, 2))
z_1, z_2 = 0, 0
for i in range(line):
    for j in range(row):
        z_1 += standardized_data[i][j] * v[j][0]
        z_2 += standardized_data[i][j] * v[j][1]
    else:
        main_component[i][0] = z_1
        main_component[i][1] = z_2
        z_1, z_2 = 0, 0
print("Main component:")
print(main_component)

Factor loading

By the way, it is a factor loading.


#Factor loading
factor_load = np.zeros((2, row))
for i in range(2):
    for j in range(row):
        factor_load[i][j] = (w[i] ** 0.5) * v[j][i]
print("Factor loading:")
print(factor_load)

Scatter plot

I will plot the principal component scores and factor loadings that I worked hard on on a scatter plot. The x-axis is the first principal component and the y-axis is the second principal component.

#Draw a scatter plot

#Scatter plot of factor loading
for label in range(row):
    plt.scatter(factor_load[0][label], factor_load[1][label], marker="$"+str(label+1)+"$")
    plt.title('factor_load')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

#Scatter plot of main component scores
for label in range(line):
    plt.scatter(main_component[label][0], main_component[label][1], marker="$"+str(label+1)+"$")
    plt.title('principal component')
    plt.xlabel('pc1')
    plt.ylabel('pc2')
plt.show()

Scatter plot

load.png main_component.png I was able to draw almost the same thing as a textbook. Is it good

At the end

I wrote an article about qiita for the first time. I think there are many things that are difficult to read. Also, because I am immature, I think there are a lot of tsukkomi in the above code. I welcome tsukkomi to them. Rather, I posted it to qiita to get tsukkomi.

References

Yasushi Nagata, Masahiko Muchinaka Introduction to Multivariate Analysis Method Science, April 10, 2001

Recommended Posts

[Python] PCA scratch in the example of "Introduction to multivariate analysis"
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
[Introduction to Python] Thorough explanation of the character string type used in Python!
An example of the answer to the reference question of the study session. In python.
How to get the number of digits in Python
Reproduce the execution example of Chapter 4 of Hajipata in Python
[Introduction to Python] Basic usage of the library matplotlib
Reproduce the execution example of Chapter 5 of Hajipata in Python
To do the equivalent of Ruby's ObjectSpace._id2ref in Python
[Python] How to do PCA in Python
In the python command python points to python3.8
Introduction to image analysis opencv python
How to determine the existence of a selenium element in Python
How to know the internal structure of an object in Python
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
How to check the memory size of a variable in Python
[Introduction to Python] I compared the naming conventions of C # and Python.
Output the contents of ~ .xlsx in the folder to HTML with Python
Feel free to change the label of the legend in Seaborn in python
I wrote the code to write the code of Brainf * ck in python
[Introduction to Python] How to use the in operator in a for statement?
How to check the memory size of a dictionary in Python
[Introduction to Python] How to use class in Python?
Check the behavior of destructor in Python
General Theory of Relativity in Python: Introduction
"Book to train programming skills to fight in the world" Python code answer example --1.4 Permutation of sentences
Example of 3D skeleton analysis by Python
The result of installing python in Anaconda
The basics of running NoxPlayer in Python
Chapter 1 Introduction to Python Cut out only the good points of deep learning made from scratch
In search of the fastest FizzBuzz in Python
Practical example of Hexagonal Architecture in Python
Introduction to Vectors: Linear Algebra in Python <1>
Introduction to Effectiveness Verification Chapter 1 in Python
[Introduction to Data Scientists] Basics of Python ♬
[Introduction to Python] How to sort the contents of a list efficiently with list sort
I want to batch convert the result of "string" .split () in Python
I want to explain the abstract class (ABCmeta) of Python in detail.
I made a program to check the size of a file in Python
[Introduction to Python] What is the method of repeating with the continue statement?
"A book to train programming skills to fight in the world" Python code answer example --1.9 Rotation of strings
Introduction of Python
How to use the C library in Python
Output the number of CPU cores in Python
[Introduction to Udemy Python 3 + Application] 26. Copy of dictionary
[Python] Sort the list of pathlib.Path in natural sort
Introduction to effectiveness verification Chapter 3 written in Python
Summary of how to import files in Python 3
Get the caller of a function in Python
tse --Introduction to Text Stream Editor in Python
Match the distribution of each group in Python
[Introduction to Udemy Python 3 + Application] 19. Copy of list
View the result of geometry processing in Python
To dynamically replace the next method in python
Make a copy of the list in Python
I wrote "Introduction to Effect Verification" in Python
Summary of how to use MNIST in Python
Draw graphs in Julia ... Leave the graphs to Python
Find the divisor of the value entered in python
The trick to write flatten concisely in python