[PYTHON] Principal Component Analysis with Livedoor News Corpus-Practice-

About this article

Use the Livedoor News Corpus to challenge the principal component analysis of text data. Last time, as a preparation before analysis, the text was decomposed into morphemes and summarized in a tabular format.

We will use this table to perform principal component analysis. Even if there are many weighted words, it is difficult, so I will narrow down to the top 5 frequently-used words for each noun and article classification.

You can also refer to the code below. https://github.com/torahirod/TextDataPCA

First, check the Top 5 Frequently Used Words for Each Article Classification.

Confirmation of Top 5 Frequent Words for Each Article Classification

python


import pandas as pd
import numpy as np

#Read the text data collected in one file in preparation
df = pd.read_csv('c:/temp/livedoor_corpus.csv')

#Part of speech is narrowed down to general nouns only
df = df[df['Part of speech'].str.startswith('noun,General')].reset_index(drop=True)

#Aggregate the frequency of appearance of words for each article classification
gdf = pd.crosstab([df['Article classification'],df['word'],df['Part of speech']],
                  'number',
                  aggfunc='count',
                  values=df['word']
                 ).reset_index()

#Ranking by descending frequency of word appearance for each article classification
gdf['Ranking'] = gdf.groupby(['Article classification'])['number'].rank('dense', ascending=False)
gdf.sort_values(['Article classification','Ranking'], inplace=True)

#Narrow down to the top 5 frequently-used words for each article classification
gdf = gdf[gdf['Ranking'] <= 5]

#Confirmation of Top 5 Frequent Words for Each Article Classification
for k in gdf['Article classification'].unique():
    display(gdf[gdf['Article classification']==k])

** ・ Dokujotsushin ** image.png The words "female" and "female" are ranked high because they are named "German News".

** ・ it-life-hack ** image.png In IT life hacks, words such as "people" and "apps" certainly seem to be highly relevant. What is a "product"? It may be a gadget-like thing.

** ・ Kanen-chan e l ** image.png In the home appliance channel, words like "topic", "selling", and "video" appear frequently. What is a "person"? It's a little strange that it appears frequently in the article classification of home appliances channels.

** ・ livedoor-homme ** image.png Livedoor-Om is an article for men. Is "golf" a gentleman's favorite? It's also interesting that "annual income" is in the Top 5.

** ・ movie-enter ** image.png As the name suggests, the words "movie" and "work" are ranked high in movies and entertainment.

** ・ peachy ** image.png This is also an article for women. It seems difficult to separate it from the German communication.

** ・ smax ** image.png Esmax seems to be an article related to smartphones and mobiles. There seems to be some IT life hacks and content coverage.

** ・ sports-watch ** image.png Sports watches are like "players", "soccer", and "baseball". I don't know what the "T" is. I will check what the sentence looks like later.

** ・ topic-news ** image.png Topic News was an image of dealing with various articles, but there may be many articles that describe the public reaction to the news from words such as "net", "bulletin board", and "voice".

Check the surrounding sentences of the word you care about

Earlier, I was worried that words that seemed to be unrelated on the sports watch came to the top, so let's check it.

#Set the word you are interested in
word = 'T'

df = pd.read_csv('c:/temp/livedoor_corpus.csv')

#Get the appearance position of the word you are interested in
idxes = df[(df['word'] == word)
          &(df['Part of speech'].str.startswith('noun,General'))].index.values.tolist()

#Window size (setting how many words before and after the word you are interested in)
ws = 20

#Get the surrounding sentences of the word you care about
l = []
for i, r in df.loc[idxes, :].iterrows():
    s = i - ws
    e = i + ws
    tmp = df.loc[s:e, :]
    tmp = tmp[tmp['file name']==r['file name']]
    lm = list(map(str, tmp['word'].values.tolist()))
    ss = ''.join(lm)
    l.append([r['Article classification'],r['file name'],r['word'],ss])
rdf = pd.DataFrame(np.array(l))
rdf.columns = ['Article classification','file name','word','word周辺文']

rdf.head(5)

image.png Apparently, the single letter "T" is a part of the URL that represents time. If you exclude the URL, the 4th to 5th words are likely to change, so after excluding the URL, check the Top 5 again.

Text data processing

Modify the preparation code a little, remove spaces, line breaks, and URLs, and then add up the morphemes for each file. To remove the URL, look at the result of the sentence around the word, and it seems that the URL type is fixed to some extent, so I will try to exclude it based on a rough standard. It is not a beautiful process that completely removes the URL part.

import pandas as pd
import numpy as np
import pathlib
import glob
import re

from janome.tokenizer import Tokenizer
tnz = Tokenizer()

pth = pathlib.Path('c:/temp/text')

l = []
for p in pth.glob('**/*.txt') :
    #Skip other than article data
    if p.name in ['CHANGES.txt','README.txt','LICENSE.txt']:
        continue
    
    #Open article data and morphological analysis with janome ⇒ Keep in list in 1 line 1 word format
    with open(p,'r',encoding='utf-8-sig') as f :
        s = f.read()
        s = s.replace(' ', '')
        s = s.replace(' ', '')
        s = s.replace('\n', '')
        s = re.sub(r'http://.*\+[0-9]{4}', '', s)
        #Remove whitespace, line breaks, URLs
        l.extend([[p.parent.name, p.name, t.surface, t.part_of_speech] for t in tnz.tokenize(s)])

#Convert list to dataframe
df = pd.DataFrame(np.array(l))

#Give column name
df.columns = ['Article classification','file name','word','Part of speech']

#Csv output data frame
df.to_csv('c:/temp/livedoor_corpus.csv', index=False)

Reconfirm Top 5 Frequently Used Words in Sports Watch

** ・ sports-watch ** image.png Compared to before processing, the 4th to 5th words have changed. "Team" seems to be related to sports.

We were able to confirm the top 5 frequently-used words for each article classification. Principal component analysis will be carried out with these weights.

Principal component analysis (2D)

#The top 5 frequently-used words for each article classification acquired in the upper cell are retained as a list.
words = gdf['word'].unique().tolist()

df = pd.read_csv('c:/temp/livedoor_corpus.csv')
df = df[df['Part of speech'].str.startswith('noun,General')].reset_index(drop=True)
df = df[df['word'].isin(words)]

#Get a crosstab of the Top 5 Frequent Words for Each File and Article Classification
xdf = pd.crosstab([df['Article classification'],df['file name']],df['word']).reset_index()
#Keep as a list for later output as a factor loading label
cls = xdf.columns.values.tolist()[2:]

#A classification number is assigned to each article classification for later graph display.
ul = xdf['Article classification'].unique()
def _fnc(x):
    return ul.tolist().index(x)
xdf['Class number'] = xdf['Article classification'].apply(lambda x : _fnc(x))

#Preparation for finding the main component
data = xdf.values
labels = data[:,0]
d = data[:, 2:-1].astype(np.int64)
k = data[:, -1].astype(np.int64)

#Data standardization * Standard deviation is calculated by unbiased standard deviation
X = (d - d.mean(axis=0)) / d.std(ddof=1,axis=0)

#Find the correlation matrix
XX = np.round(np.dot(X.T,X) / (len(X) - 1), 2)

#Find the eigenvalues and eigenvalue vectors of the correlation matrix
w, V = np.linalg.eig(XX)

#Find the first principal component
z1 = np.dot(X,V[:,0])

#Find the second principal component
z2 = np.dot(X,V[:,1])

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-10.0, 10.0]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-10.0, 10.0, 1.0)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

from matplotlib.colors import ListedColormap
colors = ['red','blue','gold','olive','green','dodgerblue','brown','black','grey']
cmap = ListedColormap(colors)
a = np.array(list(zip(z1,z2,k,labels)))
df = pd.DataFrame({'z1':pd.Series(z1, dtype='float'),
                   'z2':pd.Series(z2, dtype='float'),
                   'k':pd.Series(k, dtype='int'),
                   'labels':pd.Series(labels, dtype='str'),
                  })

#Plot with different colors for each article category
for l in df['labels'].unique():
    d = df[df['labels']==l]
    ax.scatter(d['z1'],d['z2'],c=cmap(d['k']),label=l)
    ax.legend()

#drawing
plt.show()

image.png Looking at the results, depending on the article classification, the dots are gathered at close positions, but there are many overlapping points, and it is difficult to understand the boundaries for each article classification.

Next, let's check the factor loading.

#Coordinates with the eigenvector corresponding to the largest eigenvalue on the horizontal axis and the eigenvector corresponding to the penultimate eigenvalue on the vertical axis.
V_ = np.array([(V[:,0]),V[:,1]]).T
V_ = np.round(V_,2)

#Data for graph drawing
z1 = V_[:,0]
z2 = V_[:,1]

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-0.4, 0.4]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-0.4, 0.4, 0.05)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

#Data plot
for (i,j,k) in zip(z1,z2,cls):
    ax.plot(i,j,'o')
    ax.annotate(k, xy=(i, j),fontsize=10)
    
#drawing
plt.show()

image.png Comparing with the result of the principal component analysis earlier, it seems that the word related to the article classification comes to the group of each article classification and the similar position.

Principal component analysis (3D)

#The top 5 frequently-used words for each article classification acquired in the upper cell are retained as a list.
words = gdf['word'].unique().tolist()

df = pd.read_csv('c:/temp/livedoor_corpus.csv')
df = df[df['Part of speech'].str.startswith('noun,General')].reset_index(drop=True)
df = df[df['word'].isin(words)]

#Get a crosstab of the Top 5 Frequent Words for Each File and Article Classification
xdf = pd.crosstab([df['Article classification'],df['file name']],df['word']).reset_index()
#Keep as a list for later output as a factor loading label
cls = xdf.columns.values.tolist()[2:]

#A classification number is assigned to each article classification for later graph display.
ul = xdf['Article classification'].unique()
def _fnc(x):
    return ul.tolist().index(x)
xdf['Class number'] = xdf['Article classification'].apply(lambda x : _fnc(x))

#Preparation for finding the main component
data = xdf.values
labels = data[:,0]
d = data[:, 2:-1].astype(np.int64)
k = data[:, -1].astype(np.int64)

#Data standardization * Standard deviation is calculated by unbiased standard deviation
X = (d - d.mean(axis=0)) / d.std(ddof=1,axis=0)

#Find the correlation matrix
XX = np.round(np.dot(X.T,X) / (len(X) - 1), 2)

#Find the eigenvalues and eigenvalue vectors of the correlation matrix
w, V = np.linalg.eig(XX)

#Find the first principal component
z1 = np.dot(X,V[:,0])

#Find the second principal component
z2 = np.dot(X,V[:,1])

#Find the third principal component
z3 = np.dot(X,V[:,2])

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-10.0, 10.0]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-10.0, 10.0, 1.0)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

from matplotlib.colors import ListedColormap
colors = ['red','blue','gold','olive','green','dodgerblue','brown','black','grey']
cmap = ListedColormap(colors)

a = np.array(list(zip(z1,z2,z3,k,labels)))
df = pd.DataFrame({'z1':pd.Series(z1, dtype='float'),
                   'z2':pd.Series(z2, dtype='float'),
                   'z3':pd.Series(z3, dtype='float'),
                   'k':pd.Series(k, dtype='int'),
                   'labels':pd.Series(labels, dtype='str'),
                  })

for l in df['labels'].unique():
    d = df[df['labels']==l]
    ax.scatter(d['z1'],d['z2'],d['z3'],c=cmap(d['k']),label=l)
    ax.legend()

#drawing
plt.show()

image.png

Even if you look at it in three dimensions, each point still overlaps, and it seems difficult to draw a beautiful boundary line. However, since you can rotate the graph around, it is interesting to see it by rotating it.

image.png I turned it over.

Factor loading (3D)

#Coordinates with the eigenvector corresponding to the largest eigenvalue on the horizontal axis and the eigenvector corresponding to the penultimate eigenvalue on the vertical axis.
V_ = np.array([(V[:,0]),V[:,1],V[:,2]]).T
V_ = np.round(V_,2)

#Data for graph drawing
z1 = V_[:,0]
z2 = V_[:,1]
z3 = V_[:,2]

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-0.4, 0.4]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-0.4, 0.4, 0.05)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16)
ax.set_zlabel('Z3', fontsize=16)

ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

#Data plot
for zdir, x, y, z in zip(cls, z1, z2, z3):
    ax.scatter(x, y, z)
    ax.text(x, y, z, zdir)

#drawing
plt.show()

image.png

Source code

https://github.com/torahirod/TextDataPCA

Impressions

It was interesting to be able to visually confirm a certain degree of cohesion only by principal component analysis using the top 5 frequently-used words for each article classification.

It may also be interesting to try the following and see the results.

・ Expand to Top10. ・ Of the Top 5, try to exclude words that appear in multiple article categories. -Try Top 5 with features such as TF-IDF instead of Top 5 with simple appearance frequency.

From here, I would like to challenge document classification by combining with k-means using the information of 2D coordinates and 3D coordinates obtained by principal component analysis.

Recommended Posts

Principal Component Analysis with Livedoor News Corpus-Practice-
Principal component analysis with Livedoor News Corpus --Preparation--
Principal component analysis with Spark ML
Principal component analysis
Principal component analysis with Power BI + Python
Principal component analysis (Principal component analysis: PCA)
Dimensional compression with self-encoder and principal component analysis
I tried principal component analysis with Titanic data!
Collaborative filtering with principal component analysis and K-means clustering
Challenge principal component analysis of text data with Python
Unsupervised learning 3 Principal component analysis
Principal component analysis using python from nim with nimpy
Principal component analysis hands-on with PyCaret [normalization + visualization (plotly)] memo
Face recognition using principal component analysis
Python: Unsupervised Learning: Principal Component Analysis
<Course> Machine learning Chapter 4: Principal component analysis
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)
Let's start multivariate analysis and principal component analysis with Pokemon! Collaboration between R and Tableau
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
Mathematical understanding of principal component analysis from the beginning
Clustering and principal component analysis by K-means method (beginner)
Principal component analysis Analyze handwritten numbers using PCA. Part 2
Principal component analysis (PCA) and independent component analysis (ICA) in python
Principal component analysis Analyze handwritten numbers using PCA. Part 1
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)
Data analysis with python 2
Basket analysis with Spark (1)
Dependency analysis with CaboCha
Voice analysis with python
Voice analysis with python
Dynamic analysis with Valgrind
Regression analysis with NumPy
Data analysis with Python
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Visualize the correlation matrix by principal component analysis in Python