[PYTHON] Correspondence analysis of sentences with COTOHA API and save to file

This article is from Kronos Co., Ltd. "~ Spring 2020 ~ I will do it on my own Advent Calendar" This is the article for the third day!

Introduction

I would like to try something called COTOHA API (natural language processing / speech processing API platform) used by cutting-edge engineers. ・ ・ ・ That's why About COTOHA API, which has a lot of interesting articles, I thought I would do something fun and flashy, but this time I just wrote an article about it. I wrote it !! (contradiction) As for feelings, how about preparing for something flashy? I intend to say that.

What you know / what you do not know

** What you know **

--Detailed usage of anaphora analysis in COTOHA API. --Behavior that you noticed in the trial, which is not written in the reference of COTOHA API --How to manage Json returned by API response in class

** Unknown **

--Information in versions prior to Python 3.7 --Smart coding (I wrote it on the way)

What is analysis and what do you want to do from now on?

First, here is a quote from the Official Page about "what is anaphora resolution".

Extract antecedents (including antecedents consisting of multiple words) corresponding to demonstratives such as "there" and "it", pronouns such as "he" and "she", and anaphoric words such as "same 〇〇". It is a RESTful API that outputs all as the same thing.

Hmmm, for example? (Further quote)

For example, in the analysis of the dialogue log between the dialogue engine and the user, by extracting the word pointed to by the pronoun from the sentence containing the pronoun and the context before and after it, it is not so meaningful for log analysis such as "he" and "she". It is possible to replace missing words with pronouns and achieve more precise log analysis.

In other words, (this is also an official example sentence), when the sentence "Taro is a friend. He ate yakiniku." Is analytically analyzed, "** Taro " and " he **" are The result will be returned together.

Check up to here ** "If you do preprocessing with anaphora resolution before doing flashy things, the results of other natural language processing will change (more accuracy)?" ** Then, I decided to do the title " Resolve sentences with COTOHA API and save them in a file " (I was off the beaten track because I was trying to do it first). Maybe there is demand. That is also a factor.

code

This time

-Scraping any text of Aozora Bunko --Throw sentences to COTOHA API anaphora resolution --Assign the response from json format to the class. ――Wrap it into one anaphoric word and save it in a file.

Consider the implementation. For example, in the previous example, " Taro is a friend. He ate grilled meat. " And " Taro is a friend. Taro ate grilled meat. " Save to a text file.

Overall story

See here for the entire source. It also includes some processing that is not related to anaphora resolution. The folder structure looks like this.

├── aozora_scraping.py
├── config.ini
├── cotoha_function.py
├── json_to_obj.py
├── main.py
├── respobj
│   ├── __init__.py
│   └── coreference.py
└── result

Scraping any text in Aozora Bunko

Contents of aozora_scraping.py

aozora_scraping.py


# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup

def get_aocora_sentence(aozora_url):
    res = requests.get(aozora_url)
    #Beautiful Soup initialization
    soup = BeautifulSoup(res.content, 'lxml')
    #Get the main text of Aozora Bunko
    main_text = soup.find("div", class_="main_text")
    #Elimination of ruby
    for script in main_text(["rp","rt","h4"]):
        script.decompose()
    sentences = [line.strip() for line in main_text.text.splitlines()]
    #Elimination of empty parts
    sentences = [line for line in sentences if line != '']
    return sentences

If you pass the URL of Aozora Bunko to the method get_aocora_sentence, the text of that page will be returned as a list for each line break, omitting ruby and margins.

    main_text = soup.find("div", class_="main_text")

Something is done after finding out that the text of Aozora Bunko is surrounded by <div class =" main_text "> </ div>. I referred to the following, such as how to process the text of Aozora Bunko. I tried to extract and illustrate the stage of the story using COTOHA

Throw sentences into COTOHA API anaphora resolution

Contents of cotoha_function.py

cotoha_function.py


# -*- coding:utf-8 -*-
import os
import urllib.request
import json
import configparser
import codecs

#COTOHA API operation class
class CotohaApi:
    #Initialization
    def __init__(self, client_id, client_secret, developer_api_base_url, access_token_publish_url):
        self.client_id = client_id
        self.client_secret = client_secret
        self.developer_api_base_url = developer_api_base_url
        self.access_token_publish_url = access_token_publish_url
        self.getAccessToken()

    #Get access token
    def getAccessToken(self):
        #Access token acquisition URL specification
        url = self.access_token_publish_url

        #Header specification
        headers={
            "Content-Type": "application/json;charset=UTF-8"
        }

        #Request body specification
        data = {
            "grantType": "client_credentials",
            "clientId": self.client_id,
            "clientSecret": self.client_secret
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()

        #Request generation
        req = urllib.request.Request(url, data, headers)

        #Send a request and receive a response
        res = urllib.request.urlopen(req)

        #Get response body
        res_body = res.read()

        #Decode the response body from JSON
        res_body = json.loads(res_body)

        #Get an access token from the response body
        self.access_token = res_body["access_token"]

    #Resolution API
    def coreference(self, document):
        #Correspondence analysis API acquisition URL specification
        url = self.developer_api_base_url + "v1/coreference"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "document": document
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body

For the functions that use the COTOHA API, refer to I tried using the COTOHA API rumored to be easy to handle natural language processing in Python. I got it. However, please note that the url has changed from beta / coreference to v1 / coreference for anaphoric analysis. (Now the beta version will change someday, maybe)

The first half of passing sentences to resolution analysis in main.py is as follows. I wrote it (I wrote it as it is because there is a part to explain)

main.py


# -*- coding:utf-8 -*-
import os
import json
import configparser
import datetime
import codecs
import cotoha_function as cotoha
from aozora_scraping import get_aocora_sentence
from respobj.coreference import Coreference
from json_to_obj import json_to_coreference

if __name__ == '__main__':
    #Get the location of the source file
    APP_ROOT = os.path.dirname(os.path.abspath( __file__)) + "/"

    #Get set value
    config = configparser.ConfigParser()
    config.read(APP_ROOT + "config.ini")
    CLIENT_ID = config.get("COTOHA API", "Developer Client id")
    CLIENT_SECRET = config.get("COTOHA API", "Developer Client secret")
    DEVELOPER_API_BASE_URL = config.get("COTOHA API", "Developer API Base URL")
    ACCESS_TOKEN_PUBLISH_URL = config.get("COTOHA API", "Access Token Publish URL")

    #constant
    max_word = 1800
    max_call_api_count = 150
    max_elements_count = 20
    #URL of Aozora Bunko
    aozora_html = 'Any'
    #Current time
    now_date = datetime.datetime.today().strftime("%Y%m%d%H%M%S")
    #The path of the file to save the original text
    origin_txt_path = './result/origin_' + now_date + '.txt'
    #Path of the file to save the result
    result_txt_path = './result/converted_' + now_date + '.txt'

    #COTOHA API instantiation
    cotoha_api = cotoha.CotohaApi(CLIENT_ID, CLIENT_SECRET, DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)

    #Get text from Aozora Bunko
    sentences = get_aocora_sentence(aozora_html)
    #Save original text for comparison
    with open(origin_txt_path, mode='a') as f:
        for sentence in sentences:
            f.write(sentence + '\n')

    #initial value
    start_index = 0
    end_index = 0
    call_api_count = 1
    temp_sentences = sentences[start_index:end_index]
    elements_count = end_index - start_index
    limit_index = len(sentences)
    result = []
    print("Total number of lists" + str(limit_index))
    while(end_index <= limit_index and call_api_count <= max_call_api_count):
        length_sentences = len(''.join(temp_sentences))
        if(length_sentences < max_word and elements_count < max_elements_count and end_index < limit_index):
            end_index += 1
        else:
            if end_index == limit_index:
                input_sentences = sentences[start_index:end_index]
                print('index: ' + str(start_index) + 'From' + str(end_index) + 'Until')
                #Exit conditions
                end_index += 1
            else:
                input_sentences = sentences[start_index:end_index - 1]
                print('index: ' + str(start_index) + 'From' + str(end_index-1) + 'Until')
            print(str(call_api_count) + 'Second communication')
            response = cotoha_api.coreference(input_sentences)
            result.append(json_to_coreference(response))
            call_api_count += 1
            start_index = end_index - 1
        temp_sentences = sentences[start_index:end_index]
        elements_count = end_index - start_index

In the first place, it is not possible to send all the sentences in one request at once (although it may be natural to say that it is natural).

Behavior found in trial

I couldn't find any mention in the reference,

--The maximum text length is about 1800 characters (I tried to extract and illustrate the stage of the story using COTOHA) --The number of elements in the list must not be 20 or more.

(It has not been verified, but I think that the processing of other COTOHA APIs may be similar or close) After the while statement, in the if statement, "Throw the text data packed in the list for each line break as much as possible to the anaphora resolution" is implemented.

I was particular about the list rather than just the simple sentence length because I thought that the accuracy might be affected if the anaphora analysis was not performed at the breaks in the sentence. (Expected and unverified)

call_api_count <= max_call_api_count With the free plan, each API 1000 calls / day, so I made a bad statement that I wanted to control the number of API calls to some extent.

Assign the response from json format to the class.

I think this is a matter of taste, but Isn't it easier to assign the API response to a class than to use it as it is in a dictionary type? It is a proposal department.

In the case of COTOHA API, the dictionary type seems to be the majority in the Qiita article, so I will post a reference about the correspondence analysis.

First, let's look at an official example of what kind of json format the response of anaphora analysis comes in. (As usual, if you throw "Taro is a friend. He ate yakiniku.")

coreference.json


{
  "result" : {
    "coreference" : [ {
      "representative_id" : 0,
      "referents" : [ {
        "referent_id" : 0,
        "sentence_id" : 0,
        "token_id_from" : 0,
        "token_id_to" : 0,
        "form" : "Taro"
      }, {
        "referent_id" : 1,
        "sentence_id" : 1,
        "token_id_from" : 0,
        "token_id_to" : 0,
        "form" : "he"
      } ]
    } ],
    "tokens" : [ [ "Taro", "Is", "friend", "is" ], [ "he", "Is", "Roasted meat", "To", "eat", "Ta" ] ]
  },
  "status" : 0,
  "message" : "OK"
}

The definition of the class to which this can be assigned is as follows. If you stare at the Official Reference, you'll find out. Let's define from the deep part of the json hierarchy.

resobj/coreference.py


# -*- coding: utf-8; -*-
from dataclasses import dataclass, field
from typing import List

#Entity object
@dataclass
class Referent:
    #Entity ID
    referent_id: int
    #The number of the statement that contains the entity
    sentence_id: int
    #Entity start morpheme prime number
    token_id_from: int
    #Entity end morpheme prime number
    token_id_to: int
    #Anaphora of the subject
    form: str

#Resolution analysis information object
@dataclass
class Representative:
    #Resolution analysis information ID
    representative_id: int
    #Array of entity objects
    referents: List[Referent] = field(default_factory=list)

#Resolution analysis result object
@dataclass
class Result:
    #Array of analytic information objects
    coreference: List[Representative] = field(default_factory=list)
    #An array of notations obtained by morphological analysis of each sentence to be analyzed
    tokens: List[List[str]] = field(default_factory=list)

#response
@dataclass
class Coreference:
    #Resolution analysis result object
    result: Result
    #Status code 0:OK, >0:error
    status: int
    #Error message
    message: str

Where I got caught For some reason, the form field of the "Referent" class was not explained in Official Reference. It took me some time to notice that the tokens of the" Result "class wasList [List [str]].

Method to assign json to class (json_to_coreference is also described in main.py)

json_to_obj.py


# -*- coding:utf-8 -*-
import json
import codecs
import marshmallow_dataclass
from respobj.coreference import Coreference

def json_to_coreference(jsonstr):
    json_formated = codecs.decode(json.dumps(jsonstr),'unicode-escape')
    result = marshmallow_dataclass.class_schema( Coreference )().loads(json_formated)
    return result

It is implemented as dataclasses, marshmallow_dataclass. The marshmallow_dataclass may often not be installed. (PyPI Page)

This is the main reason why Python 3.7 should be used this time. I recommend this method because I think that even if there is a change in specifications, the corresponding parts are easy to understand and the correspondence will be quick. (I think it's just that I'm not used to Python dictionary types, so please use it as a reference only.)

Reference site JSONize python class

Round it into one anaphoric word and save it in a file.

The question here is "which anaphora to round". This time, based on the prediction that what appears first in the text is the main body ~~ easy ~~, we will implement it with the anaphoric words that appeared earlier.

main.py


    #Second half
    for obj in result:
        coreferences = obj.result.coreference
        tokens = obj.result.tokens
        for coreference in coreferences:
            anaphor = []
            #Based on the first anaphora in the coreference.
            anaphor.append(coreference.referents[0].form)
            for referent in coreference.referents:
                sentence_id = referent.sentence_id
                token_id_from = referent.token_id_from
                token_id_to = referent.token_id_to
                #Rewrite so that the number of elements in list is not changed for subsequent processing.
                anaphor_and_empty = anaphor + ['']*(token_id_to - token_id_from)
                tokens[sentence_id][token_id_from: (token_id_to + 1)] = anaphor_and_empty
        #Save the modified text to a file
        with open(result_txt_path, mode='a') as f:
            for token in tokens:
                line = ''.join(token)
                f.write(line + '\n')

What number element (sentence) of the sentence in tokens is sentence_id token_id_from and token_id_to mean that the token_id_from to token_id_to th elements of the morphologically analyzed element in the sentence_id th sentence correspond to anaphora.

Find the anaphoric words to rewrite with coreference.referents [0] .form, When rewriting

                #Rewrite so that the number of elements in list is not changed for subsequent processing.
                anaphor_and_empty = anaphor + ['']*(token_id_to - token_id_from)
                tokens[sentence_id][token_id_from: (token_id_to + 1)] = anaphor_and_empty

I will do a little work like that. (The number of elements to be rewritten and the number of elements to be rewritten are forcibly matched) If you don't do it, the numbers of token_id_from and token_id_to will be incorrect. (Please tell me if there is a better way)

The whole picture of main.py

main.py


# -*- coding:utf-8 -*-
import os
import json
import configparser
import datetime
import codecs
import cotoha_function as cotoha
from aozora_scraping import get_aocora_sentence
from respobj.coreference import Coreference
from json_to_obj import json_to_coreference

if __name__ == '__main__':
    #Get the location of the source file
    APP_ROOT = os.path.dirname(os.path.abspath( __file__)) + "/"

    #Get set value
    config = configparser.ConfigParser()
    config.read(APP_ROOT + "config.ini")
    CLIENT_ID = config.get("COTOHA API", "Developer Client id")
    CLIENT_SECRET = config.get("COTOHA API", "Developer Client secret")
    DEVELOPER_API_BASE_URL = config.get("COTOHA API", "Developer API Base URL")
    ACCESS_TOKEN_PUBLISH_URL = config.get("COTOHA API", "Access Token Publish URL")

    #constant
    max_word = 1800
    max_call_api_count = 150
    max_elements_count = 20
    #URL of Aozora Bunko
    aozora_html = 'https://www.aozora.gr.jp/cards/000155/files/832_16016.html'
    #Current time
    now_date = datetime.datetime.today().strftime("%Y%m%d%H%M%S")
    #The path of the file to save the original text
    origin_txt_path = './result/origin_' + now_date + '.txt'
    #Path of the file to save the result
    result_txt_path = './result/converted_' + now_date + '.txt'

    #COTOHA API instantiation
    cotoha_api = cotoha.CotohaApi(CLIENT_ID, CLIENT_SECRET, DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)

    #Get text from Aozora Bunko
    sentences = get_aocora_sentence(aozora_html)
    #Save original text for comparison
    with open(origin_txt_path, mode='a') as f:
        for sentence in sentences:
            f.write(sentence + '\n')

    #initial value
    start_index = 0
    end_index = 0
    call_api_count = 1
    temp_sentences = sentences[start_index:end_index]
    elements_count = end_index - start_index
    limit_index = len(sentences)
    result = []
    print("Total number of lists" + str(limit_index))
    while(end_index <= limit_index and call_api_count <= max_call_api_count):
        length_sentences = len(''.join(temp_sentences))
        if(length_sentences < max_word and elements_count < max_elements_count and end_index < limit_index):
            end_index += 1
        else:
            if end_index == limit_index:
                input_sentences = sentences[start_index:end_index]
                print('index: ' + str(start_index) + 'From' + str(end_index) + 'Until')
                #Exit conditions
                end_index += 1
            else:
                input_sentences = sentences[start_index:end_index - 1]
                print('index: ' + str(start_index) + 'From' + str(end_index-1) + 'Until')
            print(str(call_api_count) + 'Second communication')
            response = cotoha_api.coreference(input_sentences)
            result.append(json_to_coreference(response))
            call_api_count += 1
            start_index = end_index - 1
        temp_sentences = sentences[start_index:end_index]
        elements_count = end_index - start_index
    
    for obj in result:
        coreferences = obj.result.coreference
        tokens = obj.result.tokens
        for coreference in coreferences:
            anaphor = []
            #Based on the first anaphora in the coreference.
            anaphor.append(coreference.referents[0].form)
            for referent in coreference.referents:
                sentence_id = referent.sentence_id
                token_id_from = referent.token_id_from
                token_id_to = referent.token_id_to
                #Rewrite so that the number of elements in list is not changed for subsequent processing.
                anaphor_and_empty = anaphor + ['']*(token_id_to - token_id_from)
                tokens[sentence_id][token_id_from: (token_id_to + 1)] = anaphor_and_empty
        #Save the modified text to a file
        with open(result_txt_path, mode='a') as f:
            for token in tokens:
                line = ''.join(token)
                f.write(line + '\n')

Processing result

Natsume Soseki "heart"

Image comparing the original text and a part of the converted text with FileMerge (left is original, right is after conversion) スクリーンショット 2020-03-04 1.53.24.png

Amazing COTOHA API

before:

I had a friend from Kagoshima and learned naturally while imitating that person, so I was good at ringing this turf flute.
As I continued to blow it, the teacher looked away and walked away.

↓ after:

Having a friend of Kagoshima, I learned naturally while imitating a Kagoshima, and I was good at playing this turf flute.
As I continued to blow this turf flute, the teacher looked away and walked away.

"That person" tells "a Kagoshima person" that "if you keep blowing it" becomes "if you keep blowing this turf flute".

How about? COTOHA API

before:

I immediately went to return the money to the teacher. I also brought the shiitake mushrooms with me.
~About 6 sentences omitted~
The teacher knew a lot of things I didn't know about kidney disease.
"The characteristic of the illness is that you are sick by yourself, but you don't notice it.
An officer I knew was finally killed by it, but he died like a lie.
~

after:

I immediately went to return the money to the teacher. I also brought the shiitake mushrooms with me.
~About 6 sentences omitted~
The teacher knew a lot of things I didn't know about kidney disease.
"The characteristic of illness is that you are sick by yourself, but you don't notice it.
An officer I knew was finally killed by the shiitake mushrooms, but he died like a lie.
~

Example shiitake mushrooms ... There are many other pains. The symbol is not processed well.

bonus

Ranpo Edogawa "I'm Twenty Faces"

スクリーンショット 2020-03-04 2.00.20.png

Ki no Tsurayuki "Tosa Nikki"

I don't think the ancient texts correspond to that much. Half story (it's just interesting because it's like that)

スクリーンショット 2020-03-04 2.03.34.png

Both are the original on the left and the converted one on the right.

Finally

In the end, is the assumption of " "If preprocessing is done by anaphora before doing flashy things, the results of other natural language processing will change (more accuracy)?" Is true or not. The result was indescribable. I think more verification is needed.

In order to improve the conversion accuracy, it seems better to think more carefully about the question of "which anaphora should be used for rounding", and it may be important to consider the framework of the sentence.

Since Japanese is an on-parade of demonstratives and pronouns, perfection seems difficult, but there are some places where it can be converted comfortably, so I feel that the COTOHA API has considerable potential. that's all!

Recommended Posts

Correspondence analysis of sentences with COTOHA API and save to file
Operate Jupyter with REST API to extract and save Python code
Script to tweet with multiples of 3 and numbers with 3 !!
Save the object to a file with pickle
[AWS; Introduction to Lambda] 2nd; Extract sentences from json file and save S3 ♬
I tried to read and save automatically with VOICEROID2 2
I tried to automatically read and save with VOICEROID2
Story of image analysis of PDF file and data extraction
The story of trying to contribute to COVID-19 analysis with AWS free tier and failing
A story of reading a picture book by synthesizing voice with COTOHA API and Cloud Vision API
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
Easy to use Nifty Cloud API with botocore and python
Perform isocurrent analysis of open channels with Python and matplotlib
kobitonote.py --Synchronize and save items edited with Kobito to Evernote
Image analysis with Object Detection API to try in 1 hour
Process the gzip file UNLOADed with Redshift with Python of Lambda, gzip it again and upload it to S3
Numerical analysis of ordinary differential equations with Scipy's odeint and ode
Save the results of crawling with Scrapy to the Google Data Store
Process Splunk execution results using Python and save to a file
Get conversions and revenue with Google Analytics API and report to Slack
Sample to use after OAuth authentication of BOX API with Python
Crawling with Python and Twitter API 2-Implementation of user search function
Output search results of posts to a file using Mattermost API
To improve the reusability and maintainability of workflows created with Luigi
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Get OCTA simulation conditions from a file and save with pandas
I have lived a life with a lot of "happiness". [Use COTOHA API to make "human disqualification" "happy"]
The problem that give me chocolate is not made even if the correspondence analysis is done with COTOHA API
[Python] Write to csv file with Python
Output to csv file with Python
Output cell to file with Colaboratory
Convert sentences to vectors with gensim
Save and retrieve files with Pepper
Clash of Clans and image analysis (3)
Coexistence of Python2 and 3 with CircleCI (1.0)
How to process camera images with Teams and Zoom Sentiment analysis with Tensorflow
Challenge to create time axis list report with Toggl API and Python
[Windows] Cause and solution of NotImplementedError when using asyncio.create_subprocess_shell () with Fast API
Wrap reading and writing of GCP to Secret Manager with google subcommands
Hit the Rakuten Ranking API to save the ranking of any category in CSV
I tried to get the authentication code of Qiita API with Python.
Get tweets with Google Cloud Function and automatically save images to Google Photos
I tried to extract and illustrate the stage of the story using COTOHA
Python code to train and test with Custom Vision of Cognitive Service
Get the number of articles accessed and likes with Qiita API + Python
Read CSV file with Python and convert it to DataFrame as it is
I tried to get the movie information of TMDb API with Python
Don't use your username and password to register with PyPI. Use API tokens
How to query BigQuery with Kubeflow Pipelines and save the result and notes
Try to separate the background and moving object of the video with OpenCV
How to save the feature point information of an image in a file and use it for matching