Challenge principal component analysis of text data with Python

About this article

In Posted last time, I followed the statistics understood by manga [Factor analysis] Chapter 4 Principal component analysis with Python.

This time, I will challenge the principal component analysis of text data with Python.

reference

-Text Analytics

background purpose

Originally, I wanted to know about principal component analysis from this text analytics book by Mr. Akitetsu Kim. When I wanted to do text data clustering, I found the principal component analysis of this book to be interesting, and this led me to study principal component analysis.

About analysis contents

The analysis target is composition data written on three themes (friends, cars, Japanese food). There are 33 data in total for 3 themes x 11 people.

The data can be obtained from the source code download on the support page of here.

The data is not text, but already in Bag of Words (BoW) format. Therefore, processing such as morphological analysis and BoW conversion is not included this time.

Code (principal component analysis)

The code is diverted from previous article.

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
from matplotlib import rcParams
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Hiragino Maru Gothic Pro', 'Yu Gothic', 'Meirio', 'Takao', 'IPAexGothic', 'IPAPGothic', 'Noto Sans CJK JP']

#Reading composition data * Please modify the file path according to your environment.
df = pd.read_csv('./sakubun3f.csv',encoding='cp932')
data = df.values
# "Words"Column,"OTHERS"Exclude columns
d = data[:,1:-1].astype(np.int64)

#Data standardization * Standard deviation is calculated by unbiased standard deviation
X = (d - d.mean(axis=0)) / d.std(ddof=1,axis=0)

#Find the correlation matrix
XX = np.round(np.dot(X.T,X) / (len(X) - 1), 2)

#Find the eigenvalues and eigenvalue vectors of the correlation matrix
w, V = np.linalg.eig(XX)

print('-------eigenvalue-------')
print(np.round(w,3))
print('')

#Find the first principal component
z1 = np.dot(X,V[:,0])

#Find the second principal component
z2 = np.dot(X,V[:,1])

##############################################################
#Draw the first principal component score and the second principal component score obtained so far in a graph
##############################################################

#Generating objects for graphs
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-6.0, 6.0]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-6.0, 6.0, 2.0)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

#Data plot
for (i,j,k) in zip(z1,z2,data[:,0]):
    ax.plot(i,j,'o')
    ax.annotate(k, xy=(i, j),fontsize=16)

#drawing
plt.show()

Execution result (principal component analysis)

-------eigenvalue-------
[ 5.589e+00  4.433e+00  2.739e+00  2.425e+00  2.194e+00  1.950e+00
  1.672e+00  1.411e+00  1.227e+00  1.069e+00  9.590e-01  9.240e-01
  7.490e-01  6.860e-01  5.820e-01  5.150e-01  4.330e-01  3.840e-01
  2.970e-01  2.200e-01  1.620e-01  1.080e-01  8.800e-02  7.800e-02
  4.600e-02  3.500e-02 -7.000e-03 -2.000e-03  4.000e-03  1.700e-02
  1.300e-02]

Scatter plot (principal component analysis)

テキスト主成分分析1.png

Consideration from the result

According to the explanation of the book, the text with 9 at the end of the label is "Japanese food", 2 at the end is "friend", and 5 at the end is "car".

The scatter plot is output in the opposite direction to the book, but the three themes are neatly classified, with the upper left direction being "Japanese food", the upper right direction being "friends", and the lower right direction being "cars". .. (The figure opposite to the book may be due to the arbitrariness of a constant multiple of the eigenvalue.)

Code (factor loading)

#Coordinates with the eigenvector corresponding to the largest eigenvalue on the horizontal axis and the eigenvector corresponding to the penultimate eigenvalue on the vertical axis.
V_ = np.array([(V[:,0]),V[:,1]]).T
V_ = np.round(V_,2)

#Data for graph drawing
data_name=df.columns[1:-1]
z1 = V_[:,0]
z2 = V_[:,1]

#Generating objects for graphs
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111)

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-0.4, 0.4]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-0.4, 0.4, 0.2)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

#Data plot
for (i,j,k) in zip(z1,z2,data_name):
    ax.plot(i,j,'o')
    ax.annotate(k, xy=(i, j),fontsize=14)
    
#drawing
plt.show()

Scatter plot (factor loading)

テキスト主成分分析2.png

Consideration from the result

The factor loading is also opposite, but the result is almost the same as the book.

Words that are likely to be related to the theme of "Japanese food" such as "Japanese" and "rice" in the upper left direction, and words that are likely to be related to the theme of "friends" such as "best friend" and "friend" in the upper right direction. However, in the lower right direction, there are words that seem to be related to the theme of "car" such as "traffic" and "accident".

If you compare it with the scatter plot of the principal components, you can see that the directions of the words that are likely to be related to each theme are the same.

Impressions

When I used the code from the previous article, I was able to analyze the principal components of text data more easily than I had expected.

This time, the data was already cleanly preprocessed, so the results were pretty good. Next time, I would like to check if it can be classified neatly in news articles.

end

Recommended Posts

Challenge principal component analysis of text data with Python
Principal component analysis with Power BI + Python
Data analysis with python 2
Data analysis with Python
I tried principal component analysis with Titanic data!
Principal component analysis using python from nim with nimpy
Text mining with Python ① Morphological analysis
Principal component analysis with Spark ML
Data analysis starting with python (data visualization 1)
Python: Unsupervised Learning: Principal Component Analysis
Data analysis starting with python (data visualization 2)
Principal component analysis
Principal Component Analysis with Livedoor News Corpus-Practice-
Data analysis python
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Recommendation of Altair! Data visualization with Python
Data analysis starting with python (data preprocessing-machine learning)
I have 0 years of programming experience and challenge data processing with python
Principal component analysis with Livedoor News Corpus --Preparation--
Dimensional compression with self-encoder and principal component analysis
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)
Static analysis of Python code with GitLab CI
A well-prepared record of data analysis in Python
Data analysis using Python 0
Voice analysis with python
Recognize the contour and direction of a shaped object with OpenCV3 and Python3 (Principal component analysis: PCA, eigenvectors)
Python data analysis template
Principal component analysis (Principal component analysis: PCA)
Voice analysis with python
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
Collaborative filtering with principal component analysis and K-means clustering
[Python] [Word] [python-docx] Simple analysis of diff data using python
[OpenCV / Python] I tried image analysis of cells with OpenCV
Reading Note: An Introduction to Data Analysis with Python
Mathematical understanding of principal component analysis from the beginning
Data analysis environment construction with Python (IPython notebook + Pandas)
Calculate the regression coefficient of simple regression analysis with python
List of Python code used in big data analysis
Planar skeleton analysis with Python (4) Handling of forced displacement
Principal component analysis (PCA) and independent component analysis (ICA) in python
[Basics of data science] Collecting data from RSS with python
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
Extract the band information of raster data with python
Use python installed with Pyenv with Sublime REPL of Sublime Text 3
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
Sample data created with python
My python data analysis container
[Python] Morphological analysis with MeCab
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Python for Data Analysis Chapter 4
Python: Japanese text: Morphological analysis
Get Youtube data with python
Sentiment analysis with Python (word2vec)
Static analysis of Python programs
[Python] Notes on data analysis
Unsupervised learning 3 Principal component analysis