About this article

Use the Livedoor News Corpus to challenge the principal component analysis of text data. Last time, as a preparation before analysis, the text was decomposed into morphemes and summarized in a tabular format.

We will use this table to perform principal component analysis. Even if there are many weighted words, it is difficult, so I will narrow down to the top 5 frequently-used words for each noun and article classification.

You can also refer to the code below. https://github.com/torahirod/TextDataPCA

First, check the Top 5 Frequently Used Words for Each Article Classification.

Confirmation of Top 5 Frequent Words for Each Article Classification

`python`


import pandas as pd
import numpy as np

#Read the text data collected in one file in preparation
df = pd.read_csv('c:/temp/livedoor_corpus.csv')

#Part of speech is narrowed down to general nouns only
df = df[df['Part of speech'].str.startswith('noun,General')].reset_index(drop=True)

#Aggregate the frequency of appearance of words for each article classification
gdf = pd.crosstab([df['Article classification'],df['word'],df['Part of speech']],
                  'number',
                  aggfunc='count',
                  values=df['word']
                 ).reset_index()

#Ranking by descending frequency of word appearance for each article classification
gdf['Ranking'] = gdf.groupby(['Article classification'])['number'].rank('dense', ascending=False)
gdf.sort_values(['Article classification','Ranking'], inplace=True)

#Narrow down to the top 5 frequently-used words for each article classification
gdf = gdf[gdf['Ranking'] <= 5]

#Confirmation of Top 5 Frequent Words for Each Article Classification
for k in gdf['Article classification'].unique():
    display(gdf[gdf['Article classification']==k])

** ・ Dokujotsushin ** The words "female" and "female" are ranked high because they are named "German News".

** ・ it-life-hack ** In IT life hacks, words such as "people" and "apps" certainly seem to be highly relevant. What is a "product"? It may be a gadget-like thing.

** ・ Kanen-chan e l ** In the home appliance channel, words like "topic", "selling", and "video" appear frequently. What is a "person"? It's a little strange that it appears frequently in the article classification of home appliances channels.

** ・ livedoor-homme ** Livedoor-Om is an article for men. Is "golf" a gentleman's favorite? It's also interesting that "annual income" is in the Top 5.

** ・ movie-enter ** As the name suggests, the words "movie" and "work" are ranked high in movies and entertainment.

** ・ peachy ** This is also an article for women. It seems difficult to separate it from the German communication.

** ・ smax ** Esmax seems to be an article related to smartphones and mobiles. There seems to be some IT life hacks and content coverage.

** ・ sports-watch ** Sports watches are like "players", "soccer", and "baseball". I don't know what the "T" is. I will check what the sentence looks like later.

** ・ topic-news ** Topic News was an image of dealing with various articles, but there may be many articles that describe the public reaction to the news from words such as "net", "bulletin board", and "voice".

Check the surrounding sentences of the word you care about

Earlier, I was worried that words that seemed to be unrelated on the sports watch came to the top, so let's check it.

#Set the word you are interested in
word = 'T'

df = pd.read_csv('c:/temp/livedoor_corpus.csv')

#Get the appearance position of the word you are interested in
idxes = df[(df['word'] == word)
          &(df['Part of speech'].str.startswith('noun,General'))].index.values.tolist()

#Window size (setting how many words before and after the word you are interested in)
ws = 20

#Get the surrounding sentences of the word you care about
l = []
for i, r in df.loc[idxes, :].iterrows():
    s = i - ws
    e = i + ws
    tmp = df.loc[s:e, :]
    tmp = tmp[tmp['file name']==r['file name']]
    lm = list(map(str, tmp['word'].values.tolist()))
    ss = ''.join(lm)
    l.append([r['Article classification'],r['file name'],r['word'],ss])
rdf = pd.DataFrame(np.array(l))
rdf.columns = ['Article classification','file name','word','word周辺文']

rdf.head(5)

Apparently, the single letter "T" is a part of the URL that represents time. If you exclude the URL, the 4th to 5th words are likely to change, so after excluding the URL, check the Top 5 again.

Text data processing

Modify the preparation code a little, remove spaces, line breaks, and URLs, and then add up the morphemes for each file. To remove the URL, look at the result of the sentence around the word, and it seems that the URL type is fixed to some extent, so I will try to exclude it based on a rough standard. It is not a beautiful process that completely removes the URL part.

import pandas as pd
import numpy as np
import pathlib
import glob
import re

from janome.tokenizer import Tokenizer
tnz = Tokenizer()

pth = pathlib.Path('c:/temp/text')

l = []
for p in pth.glob('**/*.txt') :
    #Skip other than article data
    if p.name in ['CHANGES.txt','README.txt','LICENSE.txt']:
        continue
    
    #Open article data and morphological analysis with janome ⇒ Keep in list in 1 line 1 word format
    with open(p,'r',encoding='utf-8-sig') as f :
        s = f.read()
        s = s.replace('　', '')
        s = s.replace(' ', '')
        s = s.replace('\n', '')
        s = re.sub(r'http://.*\+[0-9]{4}', '', s)
        #Remove whitespace, line breaks, URLs
        l.extend([[p.parent.name, p.name, t.surface, t.part_of_speech] for t in tnz.tokenize(s)])

#Convert list to dataframe
df = pd.DataFrame(np.array(l))

#Give column name
df.columns = ['Article classification','file name','word','Part of speech']

#Csv output data frame
df.to_csv('c:/temp/livedoor_corpus.csv', index=False)

Reconfirm Top 5 Frequently Used Words in Sports Watch

** ・ sports-watch ** Compared to before processing, the 4th to 5th words have changed. "Team" seems to be related to sports.

We were able to confirm the top 5 frequently-used words for each article classification. Principal component analysis will be carried out with these weights.

Principal component analysis (2D)

#The top 5 frequently-used words for each article classification acquired in the upper cell are retained as a list.
words = gdf['word'].unique().tolist()

df = pd.read_csv('c:/temp/livedoor_corpus.csv')
df = df[df['Part of speech'].str.startswith('noun,General')].reset_index(drop=True)
df = df[df['word'].isin(words)]

#Get a crosstab of the Top 5 Frequent Words for Each File and Article Classification
xdf = pd.crosstab([df['Article classification'],df['file name']],df['word']).reset_index()
#Keep as a list for later output as a factor loading label
cls = xdf.columns.values.tolist()[2:]

#A classification number is assigned to each article classification for later graph display.
ul = xdf['Article classification'].unique()
def _fnc(x):
    return ul.tolist().index(x)
xdf['Class number'] = xdf['Article classification'].apply(lambda x : _fnc(x))

#Preparation for finding the main component
data = xdf.values
labels = data[:,0]
d = data[:, 2:-1].astype(np.int64)
k = data[:, -1].astype(np.int64)

#Data standardization * Standard deviation is calculated by unbiased standard deviation
X = (d - d.mean(axis=0)) / d.std(ddof=1,axis=0)

#Find the correlation matrix
XX = np.round(np.dot(X.T,X) / (len(X) - 1), 2)

#Find the eigenvalues and eigenvalue vectors of the correlation matrix
w, V = np.linalg.eig(XX)

#Find the first principal component
z1 = np.dot(X,V[:,0])

#Find the second principal component
z2 = np.dot(X,V[:,1])

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-10.0, 10.0]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-10.0, 10.0, 1.0)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

from matplotlib.colors import ListedColormap
colors = ['red','blue','gold','olive','green','dodgerblue','brown','black','grey']
cmap = ListedColormap(colors)
a = np.array(list(zip(z1,z2,k,labels)))
df = pd.DataFrame({'z1':pd.Series(z1, dtype='float'),
                   'z2':pd.Series(z2, dtype='float'),
                   'k':pd.Series(k, dtype='int'),
                   'labels':pd.Series(labels, dtype='str'),
                  })

#Plot with different colors for each article category
for l in df['labels'].unique():
    d = df[df['labels']==l]
    ax.scatter(d['z1'],d['z2'],c=cmap(d['k']),label=l)
    ax.legend()

#drawing
plt.show()

Looking at the results, depending on the article classification, the dots are gathered at close positions, but there are many overlapping points, and it is difficult to understand the boundaries for each article classification.

Next, let's check the factor loading.

#Coordinates with the eigenvector corresponding to the largest eigenvalue on the horizontal axis and the eigenvector corresponding to the penultimate eigenvalue on the vertical axis.
V_ = np.array([(V[:,0]),V[:,1]]).T
V_ = np.round(V_,2)

#Data for graph drawing
z1 = V_[:,0]
z2 = V_[:,1]

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-0.4, 0.4]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-0.4, 0.4, 0.05)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

#Data plot
for (i,j,k) in zip(z1,z2,cls):
    ax.plot(i,j,'o')
    ax.annotate(k, xy=(i, j),fontsize=10)
    
#drawing
plt.show()

Comparing with the result of the principal component analysis earlier, it seems that the word related to the article classification comes to the group of each article classification and the similar position.

Principal component analysis (3D)

#The top 5 frequently-used words for each article classification acquired in the upper cell are retained as a list.
words = gdf['word'].unique().tolist()

df = pd.read_csv('c:/temp/livedoor_corpus.csv')
df = df[df['Part of speech'].str.startswith('noun,General')].reset_index(drop=True)
df = df[df['word'].isin(words)]

#Get a crosstab of the Top 5 Frequent Words for Each File and Article Classification
xdf = pd.crosstab([df['Article classification'],df['file name']],df['word']).reset_index()
#Keep as a list for later output as a factor loading label
cls = xdf.columns.values.tolist()[2:]

#A classification number is assigned to each article classification for later graph display.
ul = xdf['Article classification'].unique()
def _fnc(x):
    return ul.tolist().index(x)
xdf['Class number'] = xdf['Article classification'].apply(lambda x : _fnc(x))

#Preparation for finding the main component
data = xdf.values
labels = data[:,0]
d = data[:, 2:-1].astype(np.int64)
k = data[:, -1].astype(np.int64)

#Data standardization * Standard deviation is calculated by unbiased standard deviation
X = (d - d.mean(axis=0)) / d.std(ddof=1,axis=0)

#Find the correlation matrix
XX = np.round(np.dot(X.T,X) / (len(X) - 1), 2)

#Find the eigenvalues and eigenvalue vectors of the correlation matrix
w, V = np.linalg.eig(XX)

#Find the first principal component
z1 = np.dot(X,V[:,0])

#Find the second principal component
z2 = np.dot(X,V[:,1])

#Find the third principal component
z3 = np.dot(X,V[:,2])

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-10.0, 10.0]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-10.0, 10.0, 1.0)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16, rotation=0)
ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

from matplotlib.colors import ListedColormap
colors = ['red','blue','gold','olive','green','dodgerblue','brown','black','grey']
cmap = ListedColormap(colors)

a = np.array(list(zip(z1,z2,z3,k,labels)))
df = pd.DataFrame({'z1':pd.Series(z1, dtype='float'),
                   'z2':pd.Series(z2, dtype='float'),
                   'z3':pd.Series(z3, dtype='float'),
                   'k':pd.Series(k, dtype='int'),
                   'labels':pd.Series(labels, dtype='str'),
                  })

for l in df['labels'].unique():
    d = df[df['labels']==l]
    ax.scatter(d['z1'],d['z2'],d['z3'],c=cmap(d['k']),label=l)
    ax.legend()

#drawing
plt.show()

Even if you look at it in three dimensions, each point still overlaps, and it seems difficult to draw a beautiful boundary line. However, since you can rotate the graph around, it is interesting to see it by rotating it.

I turned it over.

Factor loading (3D)

#Coordinates with the eigenvector corresponding to the largest eigenvalue on the horizontal axis and the eigenvector corresponding to the penultimate eigenvalue on the vertical axis.
V_ = np.array([(V[:,0]),V[:,1],V[:,2]]).T
V_ = np.round(V_,2)

#Data for graph drawing
z1 = V_[:,0]
z2 = V_[:,1]
z3 = V_[:,2]

#Generating objects for graphs
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111, projection='3d')

#Insert grid lines
ax.grid()

#Boundary of data to draw
lim = [-0.4, 0.4]
ax.set_xlim(lim)
ax.set_ylim(lim)

#Bring the left and bottom axes to the middle
ax.spines['bottom'].set_position(('axes', 0.5))
ax.spines['left'].set_position(('axes', 0.5))
#Erase the right and top axes
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

#Adjust the axis scale spacing
ticks = np.arange(-0.4, 0.4, 0.05)
ax.set_xticks(ticks)
ax.set_yticks(ticks)

#Add axis label, adjust position
ax.set_xlabel('Z1', fontsize=16)
ax.set_ylabel('Z2', fontsize=16)
ax.set_zlabel('Z3', fontsize=16)

ax.xaxis.set_label_coords(1.02, 0.49)
ax.yaxis.set_label_coords(0.5, 1.02)

#Data plot
for zdir, x, y, z in zip(cls, z1, z2, z3):
    ax.scatter(x, y, z)
    ax.text(x, y, z, zdir)

#drawing
plt.show()

Source code

https://github.com/torahirod/TextDataPCA

Impressions

It was interesting to be able to visually confirm a certain degree of cohesion only by principal component analysis using the top 5 frequently-used words for each article classification.

It may also be interesting to try the following and see the results.

・ Expand to Top10. ・ Of the Top 5, try to exclude words that appear in multiple article categories. -Try Top 5 with features such as TF-IDF instead of Top 5 with simple appearance frequency.

From here, I would like to challenge document classification by combining with k-means using the information of 2D coordinates and 3D coordinates obtained by principal component analysis.

[PYTHON] Principal Component Analysis with Livedoor News Corpus-Practice-