[PYTHON] I tried to extract and illustrate the stage of the story using COTOHA

Overview

If the story is set in a local area, it's the type that you read by itself. I tried to extract and illustrate the stage of the story by using the named entity extraction of COTOHA.

Process flow

  1. Get any novel from Aozora Bunko
  2. Get only place name (LOC) with COTOHA API (named entity recognition)
  3. Use the acquired place name to illustrate with WordCroud.

environment

Google Colaboratory

Library

-Wordcloud is the only one that needs to be installed individually. (Google Colaboratory is convenient because some libraries are already installed in advance)   Execute the following command to complete the preparation.

Install wordcloud & download fonts


!pip install wordcloud
!apt-get -y install fonts-ipafont-gothic
!rm /root/.cache/matplotlib/fontlist-v300.json

Clone Aozora Bunko


!git clone --branch master --depth 1 https://github.com/aozorabunko/aozorabunko.git

code

1. The part to get an arbitrary novel from Aozora Bunko

The part to get any novel from Aozora Bunko


from bs4 import BeautifulSoup

def get_word():

  #Specify the path from the cloned html(The sample is Osamu Dazai's Good Bye)
  path_to_html='aozorabunko/cards/000035/files/258_20179.html'
  
  #HTML parsing with BeautifulSoup
  with open(path_to_html, 'rb') as html:
    soup = BeautifulSoup(html, 'lxml')
  main_text = soup.find("div", class_='main_text')
  for yomigana in main_text.find_all(["rp","h4","rt"]):
    yomigana.decompose()
  sentences = [line.strip() for line in main_text.text.strip().splitlines()]
  aozora_text=','.join(sentences)

  #Split by number of characters for cotoha api call(Every 1800 characters)
  aozora_text_list = [aozora_text[i: i+1800] for i in range(0, len(aozora_text), 1800)]
  return aozora_text_list
Get the full text by specifying the path (path_to_html) of any novel from Aozora Bunko cloned from Git I'm parsing with Beautiful Soup. (By the way, the sample is Osamu Dazai's Good Bye)

In addition, the character string is divided into 1800 characters and arranged so that COTOHA can be executed. (I haven't checked it properly, but 2000 characters didn't work, and when I ran it with 1800 characters, it was cool ... ~~ Check it out ~~)


2. The part that calls COTOHA_API

COTOHA_The part that calls the API


import os
import urllib.request
import json
import configparser
import codecs
import sys
import time


client_id = "Your client ID"
client_secret = "Your own secret key"

developer_api_base_url = "https://api.ce-cotoha.com/api/dev/nlp/"
access_token_publish_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"

def cotoha_call(sentence):
    #Get access token
    def getAccessToken():     
        url = access_token_publish_url
        headers={
            "Content-Type": "application/json;charset=UTF-8"
        }
        data = {
            "grantType": "client_credentials",
            "clientId": client_id,
            "clientSecret": client_secret
        }
        data = json.dumps(data).encode()
        req = urllib.request.Request(url, data, headers)
        res = urllib.request.urlopen(req)
        res_body = res.read()
        res_body = json.loads(res_body)
        access_token = res_body["access_token"]
        return access_token

    #API URL specification(Named entity recognition)
    base_url_footer = "v1/ne" 
    url = developer_api_base_url + base_url_footer
    headers={
        "Authorization": "Bearer " + getAccessToken(), #access_token,
        "Content-Type": "application/json;charset=UTF-8",
    }
    data = {
        "sentence": sentence
    }
    data = json.dumps(data).encode()
    time.sleep(0.5)
    req = urllib.request.Request(url, data, headers)
        
    try:
        res = urllib.request.urlopen(req)
    #What to do if an error occurs in the request
    except urllib.request.HTTPError as e:
        #If the status code is 401 Unauthorized or 500 Internal Server Error, reacquire the access token and request again.
        if e.code == 401 or 500:
            access_token = getAccessToken()
            headers["Authorization"] = "Bearer " + access_token
            time.sleep(0.5)
            req = urllib.request.Request(url, data, headers)
            res = urllib.request.urlopen(req)
        #Show cause for errors other than 401 or 500
        else:
            print ("<Error> " + e.reason)
            #sys.exit()

    res_body = res.read()
    res_body = json.loads(res_body)
    return res_body

The part that calls COTOHA (named entity recognition), I try to retry only in the case of errors 401 and 500.


3. The part illustrated in WordCloud

The part illustrated in Word Cloud


from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image

def get_wordcrowd_mask(text):
  
  #Japanese font specification
  f_path = '/usr/share/fonts/opentype/ipafont-gothic/ipagp.ttf'

  #wc parameter specification
  wc = WordCloud(background_color="white",
                  width=500,
                  height=500,
                  font_path=f_path,
                  collocations=False,
                  ).generate( text )

  #Screen depiction
  plt.figure(figsize=(5,5), dpi=200)
  plt.imshow(wc, interpolation="bilinear")
  plt.axis("off")
  plt.show()

This is the part that illustrates the text using WordCloud.


4. Main

The part that executes the process


aozora_text_list = get_word()
json_list = []
loc_str = ''
cnt = 0
for i in aozora_text_list:
  cnt+=1
  print( str(cnt) + '/' + str(len(aozora_text_list)) )
  json_list.append(cotoha_call(i))


for i in json_list:
  for j in i['result']:
    if(j['class'] == 'LOC'):
      loc_str = loc_str + j['form'] + ","

get_wordcrowd_mask(loc_str)

The part that simply executes 1 to 3 In addition, the progress at the time of API call is shown as the output result below. (Number in the array when dividing text by n / 1)

python


1/9
2/9
3/9
4/9
5/9
6/9
7/9
8/9
9/9

Output result

・ Good Bye (Osamu Dazai)

goodbye.png

There are some characters that seem to have nothing to do with the place name, but it seems that they can be extracted in general. Below are the results of other trials.

・ The Setting Sun (Osamu Dazai)

syayou.png

・ Night on the Galactic Railroad (Kenji Miyazawa)

gingatetudou.png

・ Lemon (Motojiro Kajii)

REMON.png

It's fun. .. ..

Summary

Both COTOHA and Colab can be used free of charge, and you can easily experience language processing. It's an environment, so it's great!

That's all, thank you for reading!

Finally, what kind of work is the image below! (Stop it because it's annoying ...) gongitune.png

Recommended Posts

I tried to extract and illustrate the stage of the story using COTOHA
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[First COTOHA API] I tried to summarize the old story
I tried to illustrate the time and time in C language
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to score the syntax that was too humorous and humorous using the COTOHA API.
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to touch the COTOHA API
I tried to get the index of the list using the enumerate function
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to get the batting results of Hachinai using image processing
I tried to visualize the age group and rate distribution of Atcoder
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to correct the keystone of the image
I became horror when I tried to detect the features of anime faces using PCA and NMF.
I tried to verify and analyze the acceleration of Python by Cython
I tried using the image filter of OpenCV
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
[Python] I tried to judge the member image of the idol group using Keras
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)
I tried to extract features with SIFT of OpenCV
I tried to summarize the basic form of GPLVM
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
The story of using circleci to build manylinux wheels
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to predict the victory or defeat of the Premier League using the Qore SDK
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to automate the article update of Livedoor blog with Python and selenium.
I just wanted to extract the data of the desired date and time with Django
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried to simulate ad optimization using the bandit algorithm.
I tried to get Web information using "Requests" and "lxml"
I tried face recognition of the laughter problem using Keras.
I tried to display the time and today's weather w
[Python] I tried to visualize the follow relationship of Twitter
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
I want to know the features of Python and pip
[Python] I tried collecting data using the API of wikipedia
I tried to enumerate the differences between java and python