[PYTHON] [2020 version] Scraping and processing the text from Aozora Bunko

Overview

The text of the work published in Aozora Bunko was scraped with Python and processed into a nice feeling. At that time, I was addicted to it here and there, so that memorandum.

environment

Acquisition of text

First, get the text of the work from Aozora Bunko. What to do is basically as described in this article (https://qiita.com/icy_mountain/items/011c9f56151b9832b54d), [Aozora Bunko API (https://qiita.com/ksato9700/items/626cc82c007ba8337034) Hit to fetch the HTML of the body. At that time, I will put a variable as book_id in the work ID part of the URL. However, I couldn't do this in my own environment. First, I was working on a Jupyter Notebook instead of a terminal, so I can't use the ! Wget command. So, if you look up the command that sends a GET request to the API in Python, you will see ʻurllib2.urlopen () and reqests.req () , but of these, ʻurllib2 is a library for Python2. can not use. It seems that it has been renamed to ʻurllib3 or ʻurllib in Python3, but I wasn't sure, so I decided to use the requests library. So the API fetch by the GET method looks like this:

import requests
res = requests.get('http://pubserver2.herokuapp.com/api/v0.1/books/{book_id}/content?format=html'.format(book_id))

Extraction of text

Next, convert the fetched HTML data into a format that can be used by BeautifulSoup4.

from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')

To get the body of the title tag from here, write:

soup.find('title').get_text()
# ->Yumeno Kyusaku Blue Narcissus, Red Narcissus

The titles are nicely separated by a half-width space, so if you do split () here, you can divide them into the author name and title. In some cases, such as when you are a foreign author, things are not going well, so put the presence or absence of the remaining information in a variable.

title_list = title.split(' ')
        book_author = title_list[0] #Author name
        book_title = title_list[1] #title
        book_title_info = len(title_list) > 2 #Is the title broken?

On the other hand, the body (in the true sense of the word) is the div tag of the main_text class, so it would look like this:

soup.find('div', {'class': 'main_text'}).get_text()
# -> \n\n\n\r\n\u3000 Utako was taught by a friend to cut the roots of daffodils, put red and blue paints in them, and buried them in the corner of the garden.[...]

This time, I want to separate sentences by punctuation marks, so I will make a list format for each sentence as follows. If you want the first sentence, you can get it as the 0th element.

text_list = soup.find('div', {'class': 'main_text'}).get_text().split('。')
text_first = text_list[0] + "。" #The first sentence

Purification of text

If this is left as it is, the text will be dirty, so we will remove unnecessary elements and refine the text. First, the line feed code \ n and the code corresponding to the half-width space \ u3000 are mixed, so drop it with strip () after get_text ().

text_list = soup.find('div', {'class': 'main_text'}).get_text().strip('\r''\n''\u3000').split('。')

In order to remove the ruby part enclosed in parentheses in the text, add the following code immediately after conversion to remove the ruby.

    for tag in soup.find_all(["rt", "rp"]):
        tag.decompose() #Delete tags and their contents

Sometimes it is not separated by kuten and the first sentence continues endlessly, so if it is too long, I will replace it with another string. Length standard 100 is appropriate.

text_first = text_list[0] + "。" if (len(text_list[0]) < 100) else "too long" #beginning

Finally, when converting the above process into a function, if the corresponding work ID does not exist and fetching fails, a NoneType error will occur during extraction, so exclude this by the presence or absence of tags and classes. (I think there is a better way to write it).

if (soup.find('title') is None) or (soup.find('div') is None) or (soup.find('div', {'class': 'main_text'}) is None):
        return [book_id, '', '', '', '', '' ]
else:
        title =  soup.find('title').get_text()
        [...]

bookInfo.py


def bookInfo(book_id):
    import requests
    from bs4 import BeautifulSoup

    res = requests.get(f'http://pubserver2.herokuapp.com/api/v0.1/books/{book_id}/content?format=html')
    
    soup = BeautifulSoup(res.text, 'html.parser')
    for tag in soup.find_all(["rt", "rp"]):
        tag.decompose() #Delete tags and their contents
    
    if (soup.find('title') is None) or (soup.find('div') is None) or (soup.find('div', {'class': 'main_text'}) is None):
        return [book_id, '', '', '', '']
    else:
        title =  soup.find('title').get_text()
        title_list = title.split(' ')
        book_author = title_list[0] #Author name
        book_title = title_list[1] #title
        book_title_info = len(title_list) > 2 #Is the title broken?
        
        print(soup.find('div', {'class': 'main_text'}))
        text_list = soup.find('div', {'class': 'main_text'}).get_text().strip('\r''\n''\u3000').split('。')
        text_first = text_list[0] + "。" if (len(text_list[0]) < 100) else "too long" #beginning
        else:
            text_last = ""
    
        list = [book_id, book_author, book_title, book_title_info, text_first]
        print(list)
        return list
bookInfo(930)
# -> [930,
# 'Yumeno Kyusaku',
# 'The miracle of Oshie',
# False,
# 'When I see the nurse's sleeping gap, I run a poor female character, so it's hard to read and it's hard to understand, but I'll forgive you many times as I hurry. Please.']

CSV output of refined data

I would like to add the lists obtained using the above function in order to create a two-dimensional list and output it as a CSV file. First import csv and open the file. By the way, if the file already exists and you want to add it after the full text instead of overwriting it, change 'w' to'a'.

import csv
f = open('output.csv', 'w')
writer = csv.writer(f, lineterminator='\n')

Create an empty list and turn the for loop to add the return value of the function execution to the list. By the way, with the Aozora Bunko API, it took a few seconds to get from one work. I feel that if you request too many requests, it will become unusable, so I think it is better to execute it in small units such as 10 or 100.

csvlist = []
for i in range(930, 940):
    csvlist.append(bookInfo(i))

Finally close the file and you're done.

writer.writerows(csvlist)
f.close()

A CSV file has been output. It is convenient to load it into a Google spreadsheet.

930,Yumeno Kyusaku,The miracle of Oshie,False,When I see the nurse's sleeping gap, I run a poor female character, so it's hard to read and it's hard to understand, but I'll forgive you many times as I hurry. Please.

Digression

--Originally, I wanted to write the value obtained by the GAS function directly to Google Sheets from the beginning, but the HTML obtained by the Aozora Bunko API seems to have a broken meta tag and the XML parsing fails, so I tried various things. It was, but it didn't work. So take the trouble of converting it to CSV once with Python. ――When I was able to get the data well, I thought "I got a good dashi", but I realized that BeautiflSoup is "dashi" in Japanese ...

Recommended Posts

[2020 version] Scraping and processing the text from Aozora Bunko
Scraping and tabulating weather warnings and warnings nationwide from the Japan Meteorological Agency
Clustering books from Aozora Bunko with Doc2Vec
Add lines and text on the image
DJango Note: From the beginning (form processing)
Get the address from latitude and longitude
Get only the text from the Django form.
Scraping the holojour and displaying it with CLI
Download the image from the text file containing the URL
Macports easy_install automatically resolves and runs the version
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
Pass an array from PHP to PYTHON and do numpy processing to get the result