[PYTHON] Principal component analysis with Livedoor News Corpus --Preparation--

About this article

Last time, I tried to analyze the principal component of text data. I wanted to try it with different text data, so I would like to challenge the principal component analysis using Livedoor News Corpus published by Ronwitt Co., Ltd.

As a pre-processing, I would like to read the contents of the text file divided for each article in sequence, perform morphological analysis, and then combine them into one csv file.

The morphological analysis library used janome.

reference

-Livedoor News Corpus

Livedoor News Corpus Directory Structure

If you download the file from the above link and unzip it, there are 9 folders under each category such as it-life-hack under the text folder, and articles of that category are stored in 1 article 1 file unit under each folder. It has been.

Preprocessing program

python


import pandas as pd
import numpy as np
import pathlib
import glob
from janome.tokenizer import Tokenizer
tnz = Tokenizer()

pth = pathlib.Path('c:/temp/text')

l = []
for p in pth.glob('**/*.txt') :
    #Skip other than article data
    if p.name in ['CHANGES.txt','README.txt','LICENSE.txt']:
        continue
        
    #Open article data and morphological analysis with janome ⇒ Keep in list in 1 line 1 word format
    with open(p,'r',encoding='utf-8-sig') as f :
        l.extend([[p.parent.name, p.name, t.surface, t.part_of_speech] for s in f for t in tnz.tokenize(s)])

#Convert list to dataframe
df = pd.DataFrame(np.array(l))

#Give column name
df.columns = ['Article classification','file name','word','Part of speech']

#Csv output data frame
df.to_csv('c:/temp/livedoor_corpus.csv', index=False)

Output result

The output result looks like this. 出力結果.png

Recommended Posts

Principal component analysis with Livedoor News Corpus --Preparation--
Principal Component Analysis with Livedoor News Corpus-Practice-
Principal component analysis with Spark ML
Principal component analysis with Power BI + Python
Dimensional compression with self-encoder and principal component analysis
I tried principal component analysis with Titanic data!
Principal component analysis (Principal component analysis: PCA)
Challenge principal component analysis of text data with Python
Principal component analysis using python from nim with nimpy
Unsupervised learning 3 Principal component analysis
Principal component analysis hands-on with PyCaret [normalization + visualization (plotly)] memo
Face recognition using principal component analysis
Python: Unsupervised Learning: Principal Component Analysis
<Course> Machine learning Chapter 4: Principal component analysis
Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Let's start multivariate analysis and principal component analysis with Pokemon! Collaboration between R and Tableau
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)
2. Multivariate analysis spelled out in Python 3-2. Principal component analysis (algorithm)
Mathematical understanding of principal component analysis from the beginning
Clustering and principal component analysis by K-means method (beginner)
Principal component analysis Analyze handwritten numbers using PCA. Part 2
Principal component analysis (PCA) and independent component analysis (ICA) in python
Principal component analysis Analyze handwritten numbers using PCA. Part 1
2. Multivariate analysis spelled out in Python 3-1. Principal component analysis (scikit-learn)