The world of natural language processing (NLP), which has been developing remarkably recently. I think there are many people who want to enter the world of natural language processing. However, even though it is natural language processing, the number of tasks included in the word is enormous.

"What should I start with after all !?"

This article responds to such voices.

1.First of all

The purpose of this article is to raise the level of natural language processing even for amateurs to the level of "Natural language processing. I know various tasks and have used them!". However, if you explain all the tasks one by one, it will not fit in one article. Not only that, the more detailed the commentary, the more difficult it becomes, and in the end you may end up with nothing.

1.1 Aim of this article

Therefore, in this article, aside from the difficult algorithm, we aim to make it easy for anyone to "use" natural language processing </ font>. This does not mean that you just present the copy and paste that you present, but that it simplifies the implementation by using a convenient API and makes the difficulty easier by explaining the details.

Let the COTOHA API do the difficult things.

The COTOHA API is an API created based on the research results of the NTT Group that allows easy use of natural language processing [^ 1]. Implementation of difficult algorithms is simplified by hitting this API.

API may give you a difficult image. However, this API is very easy to use, and in addition, this article explained how to use it more carefully and in detail (I intend). Also, I implemented all the sample code in this article in Python to help you if you want to implement a difficult algorithm yourself in the future. No prior knowledge of the API is required! (So please rest assured)

1.2 Types of natural language processing that will be available in this article

** Sentence summary, parsing, named entity extraction, anaphora analysis, keyword extraction, similarity calculation between two sentences, sentence type estimation, human attribute estimation such as age and occupation, stagnation removal (ah, uh, etc.), Speech recognition error detection, sentiment analysis **

These are all natural language processes that can be used by free users (for Developpers) of the COTOHA API. In this article, I will introduce these tasks and explain the code that actually works. (However, since ending with copy / paste is contrary to the intention of this article, we do not provide an implementation that can use all natural language processing. There are actually three types of copy / paste that can be operated: summary, parsing, and named entity recognition. Is.)

1.3 Final result

Let's look at the result of the sentence summary as an example. I have summarized the Qiita article that I (@MonaCat) wrote earlier as input text.

> python cotoha_test.py -f data/input.txt -s 3

<Summary>

Sentence summaries are divided into extraction type and generation type (abstract type), but currently the generation type (combination of extraction type) is the mainstream.
Introduction of neural sentence summary model-M3 Tech Blog seq2seq based automatic summarization method is summarized.
Paper commentary Attention Is All You Need(Transformer) -Deep Learning Blog If you read this first, there is no doubt.

Let me explain briefly.

I'm running the program with python cotoha_test.py. Also, by passing various options as arguments, it is possible to specify which API to call. In the above figure, the part -f data / input.txt -s 3 is an option. It refers to a file called ʻinput.txt` and specifies that the automatic summarization API should be used with 3 summarization statements. Therefore, the output result should be a summary of 3 sentences.

By changing this option, it is possible to use other APIs and change settings (for example, summarizing with 5 sentences instead of 3 sentences). It became a little difficult by that amount, but I added a supplementary explanation of the implementation of options. I am very happy if only those who can afford it can read it because it is considered not to be related in the main line.

2 Let's know natural language processing

Before implementing it, let's learn a little about natural language processing. Natural language is a language we use in our daily lives, such as Japanese and English. That is, natural language processing refers to a technical field that "processes" "natural language."

This chapter briefly describes the natural language processing technologies that can be used with the COTOHA API. (Partially omitted)

Parsing

Parsing is the process of dividing a sentence (character string) into morphemes and analyzing the structural relationships between them. For example, the sentence "I saw" is composed of "noun" + "particle" + "verb". Parsing can clarify the structural relationships of unmarked strings.

--Named entity recognition

Named entity extraction is a technology that automatically extracts proper nouns (personal names, place names, etc.). There are countless proper nouns in the world, but it is not realistic to register all of them in a dictionary. Therefore, by automating the extraction, it becomes possible to process a large amount of text.

--Resolution analysis

Demonstratives and pronouns such as "this" and "he" are used as the same as antecedents, but they can be read because we naturally understand their relationships when understanding sentences. is there. This relationship is called anaphoric relationship, and the technique for analyzing this is called anaphoric analysis.

--Keyword extraction

Keywords have various meanings, but the COTOHA API calculates characteristic phrases and words and extracts them as keywords. In other words, it can be regarded as a typical phrase or word in a sentence.

--Attribute estimation

It is a technology that estimates attributes (age, gender, hobbies, occupation, etc.). The COTOHA API aims to estimate the attributes of Twitter users.

--Removal of stagnation

Iyodomi is a word used between sentences when speaking, such as "ah" and "uh". For example, a sentence transcribed by voice recognition may contain stagnation that is unnecessary for converting into text. In such a case, by removing the stagnation, the text can be processed into data that is easy to utilize.

--Sentiment analysis

Sentiment analysis (negative / positive judgment) refers to classifying emotions written from sentences into two values, negative and positive. Analysis using deep learning will be more active than the fields introduced above.

--Sentence summary

It is a technology that automatically summarizes sentences into one sentence or multiple sentences. Recently, it has become common for news sites to come with a three-sentence summary. It can be said that it is a task similar to keyword extraction in that it analyzes important phrases / sentences from long sentences.

If you are interested in text summarization, this article is recommended. Although it is titled "Introduction to Papers", it should be helpful because it summarizes information from the basics to the latest technologies: Introducing the paper "BERTSUM" that automatically summarizes with BERT + α --Qiita //qiita.com/MonaCat/items/5123d2f970cba1acd897)

3 Let's actually use it

I'm sorry to have kept you waiting. Let's experience natural language processing using the COTOHA API.

3.1 Common parts

First of all, from the necessary common parts regardless of the API you want to use.

The COTOHA API can be registered and used for free, so register from COTOHA API.

Registration information requires name and affiliation, but does not require credit card information. If you register, you will get the information shown below.

Of these, the "Client ID" and "Client secret" differ from person to person, so be prepared to copy them later.

3.1.1 Obtaining an access token

From here on, implement in Python. I'll leave the Python environment to you, but please install the necessary modules as appropriate with pip or conda so that you can use them.

First, define a function to get an access token.

CLIENT_ID = 'Rewrite as appropriate'
CLIENT_SECRET = 'Rewrite as appropriate'


def auth(client_id, client_secret):
    token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
    headers = {
        'Content-Type': 'application/json',
        'charset': 'UTF-8'
    }
    data = {
        'grantType': 'client_credentials',
        'clientId': client_id,
        'clientSecret': client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()['access_token']


if __name__ == "__main__":

    access_token = auth(CLIENT_ID, CLIENT_SECRET)  #Get access token

You may have suddenly found it difficult. But it's very simple. Let's look at them in order.

First, auth () is called with the above authentication information as an argument.

In auth (), access token is acquired by using authentication information. In Start Guide, it is obtained by a command, but this time the same thing is implemented in Python. Note that you can get the json file if successful, but here you only want the access token. Therefore, it returns r.json ['access_token'].

You will be able to use the API using this access token.

3.1.2 Specifying arguments (optional)

In this sample code, arguments can be specified using the ʻargparse` module. This was explained in Section 1.3, but the reason why we made it possible to specify arguments is that switching is convenient because it is assumed that multiple APIs will be used.

In this section, I will briefly explain ʻargparse`, but since it has nothing to do with the main line, please skip it if you can not afford it.

So first, before you get the access token

if __name__ == "__main__":

    document = 'The text used when no input text is specified. In this implementation, two patterns of "specify text by argument" and "specify text file by argument" can be used. Please give it a try.'
    args = get_args(document)  #Get arguments

    access_token = auth(CLIENT_ID, CLIENT_SECRET)  #Get access token

Let's call get_args () with. The contents of this function are as follows.

def get_args(document):
    argparser = argparse.ArgumentParser(description='Code to play natural language processing using the Cotoha API')
    argparser.add_argument('-t', '--text', type=str, default=document, help='The sentence you want to analyze')
    argparser.add_argument('-f', '--file_name', type=str, help='I want to analyze.txt path')
    argparser.add_argument('-p', '--parse', action='store_true', help='Specify if parsing')
    argparser.add_argument('-s', '--summarize', type=int, help='Specify the number of summary sentences for automatic summarization')

    return argparser.parse_args()

The first line describes the code in description. This is displayed when -h is used as a command line argument.

An example when -h is specified.

> python cotoha_test.py -h      
usage: cotoha_test.py [-h] [-t TEXT] [-f FILE_NAME] [-p] [-s SUMMARIZE]

Code to play natural language processing using the Cotoha API

optional arguments:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT The text you want to parse
  -f FILE_NAME, --file_name FILE_NAME
Path of the text file you want to parse
  -p, --parse parsing
  -n, --ne named entity recognition
  -s SUMMARIZE, --summarize SUMMARIZE
Number of summary sentences in automatic summarization

Subsequent lines define the arguments in order with ʻadd_argument`.

Of course, if you set an argument, nothing happens unless you implement what happens when that argument is specified. This time, I want to perform parsing when -p is specified. If this is all, it seems to work just by conditional branching with an if statement. Let's do it right away.

if (args.parse):
    parse(doc, access_token)  #Function call

If ʻargs.parse` is True, call the function parse. The function parse hasn't appeared so far, so you won't know what it is. Think of it here as just a function that performs parsing. Details will be explained in a later chapter.

ʻArgs.parse is the option I added earlier in ʻargparser.add_argument. Note that it must be ʻargs.parse, not ʻargs.p. Similarly, for example, ʻargs.summarize` can be used for summarization.

By the way, in the future, if you conditional branch many tasks with if statements in addition to parsing and summarization, it will look a little unattractive. Therefore, I will simplify it a little.

#API call
l = [doc, access_token]  #Common arguments
parse(*l) if (args.parse) else None  #Parsing

What I'm doing is the same as before, but I write an if statement in one line, put it in a list considering that there are many common arguments, expand it and pass it to the function to call it.

Finally, let's implement when the file is specified with -f. When a file is specified, it is prioritized, but if the specified file does not exist before that, an error may occur.

#If you specify a file, that is prioritized
if (args.file_name != None and os.path.exists(args.file_name)):
    with open(args.file_name, 'r', encoding='utf-8') as f:
        doc = f.read()
else:
    doc = args.text

If you follow this far, the implementation should be as follows.

import argparse
import requests
import json
import os


BASE_URL = 'https://api.ce-cotoha.com/api/dev/'
CLIENT_ID = ''  #Rewrite as appropriate
CLIENT_SECRET = ''  #Rewrite as appropriate


def get_args(document):
    argparser = argparse.ArgumentParser(description='Code to play natural language processing using the Cotoha API')
    argparser.add_argument('-t', '--text', type=str, default=document, help='Text you want to parse')
    argparser.add_argument('-f', '--file_name', type=str, help='Path of the text file you want to parse')
    argparser.add_argument('-p', '--parse', action='store_true', help='Parsing')
    argparser.add_argument('-n', '--ne', action='store_true', help='Named entity recognition')
    argparser.add_argument('-c', '--coreference', action='store_true', help='Resolution analysis')
    argparser.add_argument('-k', '--keyword', action='store_true', help='Keyword extraction')
    argparser.add_argument('-s', '--summarize', type=int, help='Number of summary sentences in automatic summarization')

    return argparser.parse_args()


def auth(client_id, client_secret):
    """
Function to get an access token
    """
    token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
    headers = {
        'Content-Type': 'application/json',
        'charset': 'UTF-8'
    }
    data = {
        'grantType': 'client_credentials',
        'clientId': client_id,
        'clientSecret': client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()['access_token']


def base_api(data, document, api_url, access_token):
    """
Header etc. common to all APIs
    """
    base_url = BASE_URL
    headers = {
        'Content-Type': 'application/json; charset=UTF-8',
        'Authorization': 'Bearer {}'.format(access_token)
    }
    r = requests.post(base_url + api_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()


if __name__ == "__main__":

    doc = 'I met Michael at Tokyo station yesterday. I started dating him a month ago.'
    args = get_args(doc)  #Get arguments

    #If you specify a file, that is prioritized
    if (args.file_name != None and os.path.exists(args.file_name)):
        with open(args.file_name, 'r', encoding='utf-8') as f:
            doc = f.read()
    else:
        doc = args.text

    access_token = auth(CLIENT_ID, CLIENT_SECRET)  #Get access token
    
    #API call
    l = [doc, access_token]  #Common arguments

    parse(*l) if (args.parse) else None  #Parsing
    ne(*l) if (args.ne) else None  #Named entity recognition
    coreference(*l) if (args.coreference) else None  #Resolution analysis
    keyword(*l) if (args.keyword) else None  #Keyword extraction
    summarize(*l, args.summarize) if (args.summarize) else None  #wrap up

Of course, this program still doesn't work. It doesn't define functions such as parse () and ne (). It will be implemented in the following sections.

3.2 Different parts for each task

For those who skipped 3.1.2, here we use a program that only parses, like the following program.

import argparse
import requests
import json
import os


BASE_URL = 'https://api.ce-cotoha.com/api/dev/'
CLIENT_ID = 'Rewrite as appropriate'
CLIENT_SECRET = 'Rewrite as appropriate'

def auth(client_id, client_secret):
    """
Function to get an access token
    """
    token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
    headers = {
        'Content-Type': 'application/json; charset: UTF-8',
    }
    data = {
        'grantType': 'client_credentials',
        'clientId': client_id,
        'clientSecret': client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()['access_token']


if __name__ == "__main__":

    doc = 'I met Michael at Tokyo station yesterday. I started dating him a month ago.'

    access_token = auth(CLIENT_ID, CLIENT_SECRET)  #Get access token
    parse(doc, access_token)  #Parsing

Those who have followed 3.1.2 should just use that implementation as it is. Either way, what we implement here is a function that calls APIs such as parse () and ne () to perform natural language processing.

3.2.1 Parsing

If you read Parsing Reference, you will find request headers, request bodies, etc., and you need to specify a key. You will see. However, the request header is actually common to all COTOHA APIs. Therefore, common parts such as request headers are implemented by a function called base_api (), and parts unique to parsing are implemented by a function called parse ().

def base_api(data, document, api_url, access_token):
    """
Header etc. common to all APIs
    """
    base_url = BASE_URL
    headers = {
        'Content-Type': 'application/json; charset=UTF-8',
        'Authorization': 'Bearer {}'.format(access_token)
    }
    r = requests.post(base_url + api_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()


def parse(sentence, access_token):
    """
Parsing
    """
    data = {'sentence': sentence}
    result = base_api(data, sentence, 'nlp/v1/parse', access_token)

    print('\n<Parsing>\n')
    result_list = list()
    
    for chunks in result['result']:
        for token in chunks['tokens']:
            result_list.append(token['form'])

    print(' '.join(result_list))

Let's start with base_api (). The request headers you just saw are specified in headers. You can see that the request is sent with this value, the request body, the text to be parsed, and the URL of the API as arguments. Returns the last obtained json file.

In parse (), the request body is first specified by the variable data. This time, the key sentence is specified by the variable sentence (you may think what you are saying, but it means the same).

Then the result variable contains the value of base_api (), that is, the json file.

This json file contains the result of parsing, so read it. The format is described in the response sample of the same reference. When I read the response sample of the parsing, it seems that the key form of the morpheme information object is what I was looking for, so I will add this to the list.

Doing this will give you the following results:

> python cotoha_test.py

<Parsing>

I met up with Michael at Tokyo station yesterday. I started dating him a month ago.

3.2.2 Named entity recognition

Next is named entity extraction, but I will omit it because it will only repeat the same explanation as before. In fact, you can use most of the other APIs as before.

def ne(sentence, access_token):
    """
Named entity recognition
    """
    data = {'sentence': sentence}
    result = base_api(data, sentence, 'nlp/v1/ne', access_token)
    
    print('\n<Named entity recognition>\n')
    result_list = list()

    for chunks in result['result']:
        result_list.append(chunks['form'])

    print(', '.join(result_list))

> python cotoha_test.py

<Named entity recognition>

yesterday,Michael,Tokyo Station,A month

3.2.3 Summary

It's a big deal, so let's do something a little different. If you read the abstract reference, you will find sent_len in the request body where you can specify the number of abstracts. This is also a big deal, so I want to use it.

def summarize(document, access_token, sent_len):
    """
wrap up
    """
    data = {
        'document': document,
        'sent_len': sent_len
    }
    result = base_api(data, document, 'nlp/beta/summary', access_token)

    print('\n <Summary>\n')
    result_list = list()

    for result in result['result']:
        result_list.append(result)
    
    print(''.join(result_list))

Now you can summarize with any number of summary sentences by entering a specific value in sent_len. If you can implement the option of 3.1.2, it will be convenient because you can specify sent_len as an argument.

> python cotoha_test.py -s 1

<Summary>

I met Michael at Tokyo station yesterday.

3.3.3 Other natural language processing

Other APIs can be implemented in the same way, so I will omit them.

4 Summary

Finally, a program that can perform parsing, named entity recognition, and summarization is presented and summarized. Again, other APIs can be implemented in the same way, so if you are interested in natural language processing, please try to implement it yourself.

import argparse
import requests
import json
import os


BASE_URL = 'https://api.ce-cotoha.com/api/dev/'
CLIENT_ID = ''  #Rewrite as appropriate
CLIENT_SECRET = ''  #Rewrite as appropriate


def get_args(document):
    argparser = argparse.ArgumentParser(description='Code to play natural language processing using the Cotoha API')
    argparser.add_argument('-t', '--text', type=str, default=document, help='Text you want to parse')
    argparser.add_argument('-f', '--file_name', type=str, help='Path of the text file you want to parse')
    argparser.add_argument('-p', '--parse', action='store_true', help='Parsing')
    argparser.add_argument('-n', '--ne', action='store_true', help='Named entity recognition')
    argparser.add_argument('-s', '--summarize', type=int, help='Number of summary sentences in automatic summarization')

    return argparser.parse_args()


def auth(client_id, client_secret):
    """
Function to get an access token
    """
    token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
    headers = {
        'Content-Type': 'application/json',
        'charset': 'UTF-8'
    }
    data = {
        'grantType': 'client_credentials',
        'clientId': client_id,
        'clientSecret': client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()['access_token']


def base_api(data, document, api_url, access_token):
    """
Header etc. common to all APIs
    """
    base_url = BASE_URL
    headers = {
        'Content-Type': 'application/json; charset=UTF-8',
        'Authorization': 'Bearer {}'.format(access_token)
    }
    r = requests.post(base_url + api_url,
                      headers=headers,
                      data=json.dumps(data))

    return r.json()


def parse(sentence, access_token):
    """
Parsing
    """
    data = {'sentence': sentence}
    result = base_api(data, sentence, 'nlp/v1/parse', access_token)

    print('\n<Parsing>\n')
    result_list = list()
    
    for chunks in result['result']:
        for token in chunks['tokens']:
            result_list.append(token['form'])

    print(' '.join(result_list))


def ne(sentence, access_token):
    """
Named entity recognition
    """
    data = {'sentence': sentence}
    result = base_api(data, sentence, 'nlp/v1/ne', access_token)
    
    print('\n<Named entity recognition>\n')
    result_list = list()

    for chunks in result['result']:
        result_list.append(chunks['form'])

    print(', '.join(result_list))


def summarize(document, access_token, sent_len):
    """
wrap up
    """
    data = {
        'document': document,
        'sent_len': sent_len
    }
    result = base_api(data, document, 'nlp/beta/summary', access_token)

    print('\n <Summary>\n')
    result_list = list()

    for result in result['result']:
        result_list.append(result)
    
    print(''.join(result_list))


if __name__ == "__main__":

    doc = 'I met Michael at Tokyo station yesterday. I started dating him a month ago.'
    args = get_args(doc)  #Get arguments

    #If you specify a file, that is prioritized
    if (args.file_name != None and os.path.exists(args.file_name)):
        with open(args.file_name, 'r', encoding='utf-8') as f:
            doc = f.read()
    else:
        doc = args.text

    access_token = auth(CLIENT_ID, CLIENT_SECRET)  #Get access token
    
    #API call
    l = [doc, access_token]  #Common arguments

    parse(*l) if (args.parse) else None  #Parsing
    ne(*l) if (args.ne) else None  #Named entity recognition
    summarize(*l, args.summarize) if (args.summarize) else None  #wrap up

[^ 1]: In the commercial plan (for Enterprise plan), not only the restrictions on the type of dictionary and the number of calls will be relaxed, but also the API for voice recognition and voice synthesis will be available. However, it is not covered in this article because it is not realistic for personal use.

[PYTHON] Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-