Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python

Introduction

Aozora Bunko is a website that volunteers to digitize and publish literary works that are out of copyright. Currently, for engineers who use Aozora Bunko data, ** text format and HTML format of all data are uploaded daily to Aozora Bunko's Github, and batch download is also possible ** It has become.

If you want to use the data of Aozora Bunko in natural language processing, there is a fixed format in the text data, so it takes a little time from data collection to preprocessing only for a specific writer. So I wrote ** a batch of downloads for a specific writer + a Python script that can be easily preprocessed **, so I decided to leave it in Qiita.

Goal

--Batch download only data for specific writers in Aozora Bunko --Preprocess the downloaded text file and save it as a UTF-8 TSV file. --Save as TSV with the text body in the first column and the work name in the second column --Text formatting is as follows

item name Before plastic surgery (example) After plastic surgery (example)
Notes and bibliographic information "In the text" "Bottom:" "* [#" 勹 <evening ", 3rd level 1-14-76]」 (Delete)
Clause delimiter 「―――」「×××」「***」 (Delete)
Literary symbols Dash "-", three-point reader "...", rice "*" (Delete)
Lines of 1 character or less, blank lines "Three" (numbers separated by clauses, etc.) (Delete)
Ruby notation "Tokyo Express train for Yuki"
Indentation (double-byte space at the beginning of the line) "In the study," "In the study,"

Collectively download only data for specific writers

The easiest way to download data for a particular author in bulk is to use the ** svn command **. The URL is https://github.com/aozorabunko/aozorabunko/trunk/cards/ {author ID}. {Writer ID} is a 6-digit number that appears on the URL when you access the page of a specific writer on the Aozora Bunko website. For example, in the case of Ryunosuke Akutagawa, it is "000879". qiita_01.jpg

You can download all files of a specific writer (for example, Ryunosuke Akutagawa) by specifying the URL with svn export as shown below. By the way, svn should work standard on Linux and Mac. I don't know if it's Windows, but even if you're a Windows user, it's easy to start Ubuntu with WSL and type commands **.

Download all data of Ryunosuke Akutagawa


svn export https://github.com/aozorabunko/aozorabunko/trunk/cards/000879/

This will create a local ./000879/ directory.

Pre-processing (text formatting + saving)

With the following Python script, the downloaded ZIP file is batch text-formatted + preprocessed and saved as TSV. The outline of the process is as follows. Each process is as described in the comments. It was okay to classify it, but for the time being, it works only with functions.

--Search all zip files under a specific directory and store them in the list --Create output directory --Loop processing for each file in list order (for) --Read ZIP-compressed txt as Pandas DataFrame: save_cleanse_text () --Convert original data to UTF-8 and save as text file --Text formatting: text_cleanse_df () --Save as the work name in the second column

aozora_preprocess.py


import pandas as pd
from pathlib import Path

author_id = '000879'  #Aozora Bunko writer number
author_name = 'Ryunosuke Akutagawa'  #Writer name in Aozora Bunko notation

write_title = True  #Do you put the title of the work in the second column?
write_header = True  #Whether the first line is the column name (column name "text" "title")
save_utf8_org = True  #UTF original data-Whether to save the text file set to 8

out_dir = Path(f'./out_{author_id}/')  #File output destination
tx_org_dir = Path(out_dir / './org/')  #UTF of original text-8 Save destination of conversion file
tx_edit_dir = Path(out_dir / './edit/')  #File save destination after text formatting


def text_cleanse_df(df):
    #Find the beginning of the text ('---…'Premise that the text starts immediately after the break)
    head_tx = list(df[df['text'].str.contains(
        '-------------------------------------------------------')].index)
    #Find the end of the text ('Bottom book:'Premise that the text ends just before)
    atx = list(df[df['text'].str.contains('Bottom book:')].index)
    if head_tx == []:
        #if'---…'If there is no delimiter, the premise that the text starts immediately after the writer's name
        head_tx = list(df[df['text'].str.contains(author_name)].index)
        head_tx_num = head_tx[0]+1
    else:
        #2nd'---…'The text starts immediately after the break
        head_tx_num = head_tx[1]+1
    df_e = df[head_tx_num:atx[0]]

    #Deleted Aozora Bunko format
    df_e = df_e.replace({'text': {'《.*?》': ''}}, regex=True)
    df_e = df_e.replace({'text': {'[.*?]': ''}}, regex=True)
    df_e = df_e.replace({'text': {'|': ''}}, regex=True)

    #Removed indentation (double-byte space at the beginning of the line)
    df_e = df_e.replace({'text': {' ': ''}}, regex=True)

    #Remove clause breaks
    df_e = df_e.replace({'text': {'^.$': ''}}, regex=True)
    df_e = df_e.replace({'text': {'^―――.*$': ''}}, regex=True)
    df_e = df_e.replace({'text': {'^***.*$': ''}}, regex=True)
    df_e = df_e.replace({'text': {'^×××.*$': ''}}, regex=True)

    #Remove symbols and parentheses left by deleting symbols
    df_e = df_e.replace({'text': {'―': ''}}, regex=True)
    df_e = df_e.replace({'text': {'…': ''}}, regex=True)
    df_e = df_e.replace({'text': {'※': ''}}, regex=True)
    df_e = df_e.replace({'text': {'「」': ''}}, regex=True)

    #Delete lines consisting of one character or less
    df_e['length'] = df_e['text'].map(lambda x: len(x))
    df_e = df_e[df_e['length'] > 1]

    #The index will shift, so re-roll
    df_e = df_e.reset_index().drop(['index'], axis=1)

    #Remove blank lines (just in case)
    df_e = df_e[~(df_e['text'] == '')]

    #Since the index shifts, re-roll and delete the character length column
    df_e = df_e.reset_index().drop(['index', 'length'], axis=1)
    return df_e


def save_cleanse_text(target_file):
    try:
        #Read file
        print(target_file)
        #Read as Pandas DataFrame (variants cannot be read unless read with cp932)
        df_tmp = pd.read_csv(target_file, encoding='cp932', names=['text'])
        #UTF original data-Convert to 8 and save text file
        if save_utf8_org:
            out_org_file_nm = Path(target_file.stem + '_org_utf-8.tsv')
            df_tmp.to_csv(Path(tx_org_dir / out_org_file_nm), sep='\t',
                          encoding='utf-8', index=None)
        #Text formatting
        df_tmp_e = text_cleanse_df(df_tmp)
        if write_title:
            #Make a title column
            df_tmp_e['title'] = df_tmp['text'][0]
        out_edit_file_nm = Path(target_file.stem + '_clns_utf-8.txt')
        df_tmp_e.to_csv(Path(tx_edit_dir / out_edit_file_nm), sep='\t',
                        encoding='utf-8', index=None, header=write_header)
    except:
        print(f'ERROR: {target_file}')


def main():
    tx_dir = Path(author_id + './files/')
    #Create a list of zip files
    zip_list = list(tx_dir.glob('*.zip'))
    #Create a save directory
    tx_edit_dir.mkdir(exist_ok=True, parents=True)
    if save_utf8_org:
        tx_org_dir.mkdir(exist_ok=True, parents=True)

    for target_file in zip_list:
        save_cleanse_text(target_file)


if __name__ == '__main__':
    main()

Execution result

100_ruby_1154_org_utf-8.txt (original data)


Momotaro
Ryunosuke Akutagawa
-------------------------------------------------------
[About the symbols that appear in the text]
"":ruby
(Example) Peach << Momo >>
|: Symbol that identifies the beginning of a character string with ruby
(Example) Heaven and Earth | Around the time of the creation myth "Koro" Hey
[#]: Enterer's note: Mainly explanation of external characters and designation of emphasis marks
(Numbers are JIS X 0213 area area number or Unicode, base page and number of lines)
(Example) * [# "Word + Making a mound", Level 4 2-88-74]
-------------------------------------------------------
[# 8 indentation] 1 [# "1" is the middle heading]
Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. (...)
(…)
What kind of person picked up this baby "Akaji" after leaving the depths of the deep mountains? ――I don't need to talk about it anymore. At the end of Tanigawa, there was one grandmother, as the children of Japanese people all over Japan know, the kimono of the old man who went to Shibaka. I was washing it. ......
(…)

100_ruby_1154_clns_utf-8.tsv (preprocessed data)


text	title
Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. (...) Momotaro
(…)
What kind of person picked up this baby after leaving the depths of the deep mountains? I don't need to talk about it anymore. At the end of Tanigawa, an old woman was washing the kimono or something of an old man who went to mow the bush, as children all over Japan know. Momotaro
(…)

The parts that are not needed to be processed as natural language have been successfully deleted. I hope you can use it as a class or add other items you want to process.

Recommended Posts

Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
[Natural language processing] Preprocessing with Japanese
Dockerfile with the necessary libraries for natural language processing in python
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Python] I played with natural language processing ~ transformers ~
Python: Natural language processing
Performance verification of data preprocessing in natural language processing
Overview of natural language processing and its data preprocessing
100 Language Processing with Python Knock 2015
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Study natural language processing with Kikagaku
100 Language Processing Knock with Python (Chapter 1)
Natural language processing for busy people
100 Language Processing Knock with Python (Chapter 3)
Preprocessing template for data analysis (Python)
Data formatting for Python / color plots
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
100 Language Processing Knock with Python (Chapter 2, Part 2)
[WIP] Pre-processing memo in natural language processing
100 Language Processing Knock with Python (Chapter 2, Part 1)
I tried natural language processing with transformers.
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
Getting started with Python with 100 knocks on language processing
Python Pandas is not suitable for batch processing
Image Processing with Python Environment Setup for Windows
Note for formatting numbers with python format function
Python: Deep Learning in Natural Language Processing: Basics
I started machine learning with Python Data preprocessing
Let's enjoy natural language processing with COTOHA API
Data analysis for improving POG 1 ~ Web scraping with Python ~
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
[Chapter 5] Introduction to Python with 100 knocks of language processing
[2020 version] Scraping and processing the text from Aozora Bunko
Challenge principal component analysis of text data with Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
100 natural language processing knocks Chapter 6 English text processing (second half)
[Chapter 2] Introduction to Python with 100 knocks of language processing
Process csv data with python (count processing using pandas)
100 natural language processing knocks Chapter 6 English text processing (first half)
[Chapter 4] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
Natural language processing with Word2Vec developed by a researcher in the US google (original data)