[PYTHON] Correspondence analysis of sentences with COTOHA API and save to file

This article is from Kronos Co., Ltd. "~ Spring 2020 ~ I will do it on my own Advent Calendar" This is the article for the third day!

Introduction

I would like to try something called COTOHA API (natural language processing / speech processing API platform) used by cutting-edge engineers. ・・・ That's why About COTOHA API, which has a lot of interesting articles, I thought I would do something fun and flashy, but this time I just wrote an article about it. I wrote it !! (contradiction) As for feelings, how about preparing for something flashy? I intend to say that.

What you know / what you do not know

** What you know **

--Detailed usage of anaphora analysis in COTOHA API. --Behavior that you noticed in the trial, which is not written in the reference of COTOHA API --How to manage Json returned by API response in class

** Unknown **

--Information in versions prior to Python 3.7 --Smart coding (I wrote it on the way)

What is analysis and what do you want to do from now on?

First, here is a quote from the Official Page about "what is anaphora resolution".

Extract antecedents (including antecedents consisting of multiple words) corresponding to demonstratives such as "there" and "it", pronouns such as "he" and "she", and anaphoric words such as "same 〇〇". It is a RESTful API that outputs all as the same thing.

Hmmm, for example? (Further quote)

For example, in the analysis of the dialogue log between the dialogue engine and the user, by extracting the word pointed to by the pronoun from the sentence containing the pronoun and the context before and after it, it is not so meaningful for log analysis such as "he" and "she". It is possible to replace missing words with pronouns and achieve more precise log analysis.

In other words, (this is also an official example sentence), when the sentence "Taro is a friend. He ate yakiniku." Is analytically analyzed, "** Taro " and " he **" are The result will be returned together.

Check up to here ** "If you do preprocessing with anaphora resolution before doing flashy things, the results of other natural language processing will change (more accuracy)?" ** Then, I decided to do the title " Resolve sentences with COTOHA API and save them in a file " (I was off the beaten track because I was trying to do it first). Maybe there is demand. That is also a factor.

code

This time

-Scraping any text of Aozora Bunko --Throw sentences to COTOHA API anaphora resolution --Assign the response from json format to the class. ――Wrap it into one anaphoric word and save it in a file.

Consider the implementation. For example, in the previous example, " Taro is a friend. He ate grilled meat. " And " Taro is a friend. Taro ate grilled meat. " Save to a text file.

Anaphora means "Taro" or "he"

Overall story

See here for the entire source. It also includes some processing that is not related to anaphora resolution. The folder structure looks like this.

├── aozora_scraping.py
├── config.ini
├── cotoha_function.py
├── json_to_obj.py
├── main.py
├── respobj
│   ├── __init__.py
│   └── coreference.py
└── result

__Pycache__ and README.md are omitted. It is assumed that the resulting text will be stored under the result folder.

Scraping any text in Aozora Bunko

Contents of aozora_scraping.py

`aozora_scraping.py`


# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup

def get_aocora_sentence(aozora_url):
    res = requests.get(aozora_url)
    #Beautiful Soup initialization
    soup = BeautifulSoup(res.content, 'lxml')
    #Get the main text of Aozora Bunko
    main_text = soup.find("div", class_="main_text")
    #Elimination of ruby
    for script in main_text(["rp","rt","h4"]):
        script.decompose()
    sentences = [line.strip() for line in main_text.text.splitlines()]
    #Elimination of empty parts
    sentences = [line for line in sentences if line != '']
    return sentences

If you pass the URL of Aozora Bunko to the method get_aocora_sentence, the text of that page will be returned as a list for each line break, omitting ruby and margins.

    main_text = soup.find("div", class_="main_text")

Something is done after finding out that the text of Aozora Bunko is surrounded by <div class =" main_text "> </ div>. I referred to the following, such as how to process the text of Aozora Bunko. I tried to extract and illustrate the stage of the story using COTOHA

Throw sentences into COTOHA API anaphora resolution

Contents of cotoha_function.py

`cotoha_function.py`


# -*- coding:utf-8 -*-
import os
import urllib.request
import json
import configparser
import codecs

#COTOHA API operation class
class CotohaApi:
    #Initialization
    def __init__(self, client_id, client_secret, developer_api_base_url, access_token_publish_url):
        self.client_id = client_id
        self.client_secret = client_secret
        self.developer_api_base_url = developer_api_base_url
        self.access_token_publish_url = access_token_publish_url
        self.getAccessToken()

    #Get access token
    def getAccessToken(self):
        #Access token acquisition URL specification
        url = self.access_token_publish_url

        #Header specification
        headers={
            "Content-Type": "application/json;charset=UTF-8"
        }

        #Request body specification
        data = {
            "grantType": "client_credentials",
            "clientId": self.client_id,
            "clientSecret": self.client_secret
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()

        #Request generation
        req = urllib.request.Request(url, data, headers)

        #Send a request and receive a response
        res = urllib.request.urlopen(req)

        #Get response body
        res_body = res.read()

        #Decode the response body from JSON
        res_body = json.loads(res_body)

        #Get an access token from the response body
        self.access_token = res_body["access_token"]

    #Resolution API
    def coreference(self, document):
        #Correspondence analysis API acquisition URL specification
        url = self.developer_api_base_url + "v1/coreference"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "document": document
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body

The first half of passing sentences to resolution analysis in main.py is as follows. I wrote it (I wrote it as it is because there is a part to explain)

main.py

In the first place, it is not possible to send all the sentences in one request at once (although it may be natural to say that it is natural).

Behavior found in trial

(It has not been verified, but I think that the processing of other COTOHA APIs may be similar or close) After the while statement, in the if statement, "Throw the text data packed in the list for each line break as much as possible to the anaphora resolution" is implemented.

I was particular about the list rather than just the simple sentence length because I thought that the accuracy might be affected if the anaphora analysis was not performed at the breaks in the sentence. (Expected and unverified)

call_api_count <= max_call_api_count With the free plan, each API 1000 calls / day, so I made a bad statement that I wanted to control the number of API calls to some extent.

Assign the response from json format to the class.

I think this is a matter of taste, but Isn't it easier to assign the API response to a class than to use it as it is in a dictionary type? It is a proposal department.

In the case of COTOHA API, the dictionary type seems to be the majority in the Qiita article, so I will post a reference about the correspondence analysis.

First, let's look at an official example of what kind of json format the response of anaphora analysis comes in. (As usual, if you throw "Taro is a friend. He ate yakiniku.")

coreference.json

The definition of the class to which this can be assigned is as follows. If you stare at the Official Reference, you'll find out. Let's define from the deep part of the json hierarchy.

resobj/coreference.py

Where I got caught For some reason, the form field of the "Referent" class was not explained in Official Reference. It took me some time to notice that the tokens of the" Result "class wasList [List [str]].

Method to assign json to class (json_to_coreference is also described in main.py)

json_to_obj.py

It is implemented as dataclasses, marshmallow_dataclass. The marshmallow_dataclass may often not be installed. (PyPI Page)

This is the main reason why Python 3.7 should be used this time. I recommend this method because I think that even if there is a change in specifications, the corresponding parts are easy to understand and the correspondence will be quick. (I think it's just that I'm not used to Python dictionary types, so please use it as a reference only.)

Round it into one anaphoric word and save it in a file.

The question here is "which anaphora to round". This time, based on the prediction that what appears first in the text is the main body ~~ easy ~~, we will implement it with the anaphoric words that appeared earlier.

main.py

What number element (sentence) of the sentence in tokens is sentence_id token_id_from and token_id_to mean that the token_id_from to token_id_to th elements of the morphologically analyzed element in the sentence_id th sentence correspond to anaphora.

Find the anaphoric words to rewrite with coreference.referents [0] .form, When rewriting

I will do a little work like that. (The number of elements to be rewritten and the number of elements to be rewritten are forcibly matched) If you don't do it, the numbers of token_id_from and token_id_to will be incorrect. (Please tell me if there is a better way)

Processing result

Natsume Soseki "heart"

Image comparing the original text and a part of the converted text with FileMerge (left is original, right is after conversion) スクリーンショット 2020-03-04 1.53.24.png

Amazing COTOHA API

"That person" tells "a Kagoshima person" that "if you keep blowing it" becomes "if you keep blowing this turf flute".

How about? COTOHA API

Example shiitake mushrooms ... There are many other pains. The symbol is not processed well.

bonus

Ranpo Edogawa "I'm Twenty Faces"

Ki no Tsurayuki "Tosa Nikki"

I don't think the ancient texts correspond to that much. Half story (it's just interesting because it's like that)

Finally

In the end, is the assumption of

" "If preprocessing is done by anaphora before doing flashy things, the results of other natural language processing will change (more accuracy)?"

Is true or not. The result was indescribable. I think more verification is needed.

In order to improve the conversion accuracy, it seems better to think more carefully about the question of "which anaphora should be used for rounding", and it may be important to consider the framework of the sentence.

Since Japanese is an on-parade of demonstratives and pronouns, perfection seems difficult, but there are some places where it can be converted comfortably, so I feel that the COTOHA API has considerable potential. that's all!

[PYTHON] Correspondence analysis of sentences with COTOHA API and save to file

Introduction

What you know / what you do not know

What is analysis and what do you want to do from now on?

code

Overall story

Scraping any text in Aozora Bunko

`aozora_scraping.py`

Throw sentences into COTOHA API anaphora resolution

`cotoha_function.py`

`main.py`

Behavior found in trial

Assign the response from json format to the class.

`coreference.json`

`resobj/coreference.py`

`json_to_obj.py`

Round it into one anaphoric word and save it in a file.

`main.py`

`main.py`

Processing result

Natsume Soseki "heart"

Amazing COTOHA API

How about? COTOHA API

bonus

Ranpo Edogawa "I'm Twenty Faces"

Ki no Tsurayuki "Tosa Nikki"

Finally