[PYTHON] Create "Typoglycemia" sentences using COTOHA

This pheasant will squeeze the good

Suddenly, please take a quick look at the following sentences.

Hello, what is Misa? I'm Genki. This is a sequel to the sardines, but it's a sequel to the sardines. When you squeeze the egg

If you have the same time Jiban-n is a mess, but it's also a sequel to read I deliberately put in the letters Jinban too much. How are you? Chinya and Yachimeu right?

how is it? Isn't it surprisingly easy to read? This sentence is (personally) a relatively famous copy and paste, and its official name is [Typoglycemia](https://ja.wikipedia.org/wiki/%E3%82%BF%E3%82%A4%E3% 83% 9D% E3% 82% B0% E3% 83% AA% E3% 82% BB% E3% 83% 9F% E3% 82% A2).

Roughly speaking, when recognizing a word, humans do not understand it character by character, but visually recognize it as a set of characters **. At that time, the words are understood and predicted instantly in the brain, so even if the letters that make up the words are slightly replaced, they can be corrected and read **. *** * These corrections depend on individual knowledge and vocabulary, so there are individual differences. *** ***

This time, the Parsing API provided by COTOHA API Use reference / apireference.html # parsing) to parse the input text and output it as a typoglycemia text.

It ’s like this.

before :
Post typoglycemia sentences output using Python and COTOHA to Qiita
after :
Let's use pothyn and choota, and let's send the taposi miriguabushon to qiita.

What a result like that! !!

What kind of API is CHOOTA?

COTOHA is a ** natural language processing / speech processing API platform ** provided by NTT Communications. In addition to the syntax analysis introduced in this article, named entity extractionresolution analysis keyword extractionsimilarity calculation sentence type judgmentuser attribute estimationsentiment analysis Various functions such as summary are provided.

User registration is easy, and each API can be used ** 1000 calls / day ** even within the free tier, so you can play around with it. Now I'm collaborating with Qiita and doing this kind of project, so please join us. Please give me! !!

You can register for free as a user from the COTOHA API Portal. After entering some basic items, a user ID and secret for using the API will be issued, so please make a note of it if you want to try the subsequent scripts at hand.

I'm going to squeeze Pothyn

I referred to the following article. Both articles are very easy to understand and are highly recommended!

-I tried using the COTOHA API rumored to be easy to handle natural language processing in Python -The result of having COTOHA summarize "Mentos and the memories of Go". COTOHA with the fastest tutorial

The base is based on the above article, but I've tweaked the endpoint part of the API a bit. Originally, BASE_URL included nlp, but it is omitted according to the official COTOHA format.

Main program

cotoha_api.py


import os
import urllib.request
import json
import configparser
import codecs
import re
import jaconv
import random


#COTOHA API operation class
class CotohaApi:
    #Initialization
    def __init__(self, client_id, client_secret, developer_api_base_url, access_token_publish_url):
        self.client_id = client_id
        self.client_secret = client_secret
        self.developer_api_base_url = developer_api_base_url
        self.access_token_publish_url = access_token_publish_url
        self.getAccessToken()

    #Get access token
    def getAccessToken(self):
        #Access token acquisition URL specification
        url = self.access_token_publish_url

        #Header specification
        headers={
            "Content-Type": "application/json;charset=UTF-8"
        }

        #Request body specification
        data = {
            "grantType": "client_credentials",
            "clientId": self.client_id,
            "clientSecret": self.client_secret
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()

        #Request generation
        req = urllib.request.Request(url, data, headers)

        #Send a request and receive a response
        res = urllib.request.urlopen(req)

        #Get response body
        res_body = res.read()

        #Decode the response body from JSON
        res_body = json.loads(res_body)

        #Get an access token from the response body
        self.access_token = res_body["access_token"]


    #Parsing API
    def parse(self, sentence):
        #Parsing API URL specification
        url = self.developer_api_base_url + "nlp/v1/parse"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "sentence": sentence
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body


    #Named entity recognition API
    def ne(self, sentence):
        #Named entity extraction API URL specification
        url = self.developer_api_base_url + "nlp/v1/ne"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "sentence": sentence
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body


    #Resolution API
    def coreference(self, document):
        #Correspondence analysis API acquisition URL specification
        url = self.developer_api_base_url + "beta/coreference"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "document": document
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body


    #Keyword extraction API
    def keyword(self, document):
        #Keyword extraction API URL specification
        url = self.developer_api_base_url + "nlp/v1/keyword"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "document": document
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body


    #Similarity calculation API
    def similarity(self, s1, s2):
        #Similarity calculation API URL specification
        url = self.developer_api_base_url + "nlp/v1/similarity"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "s1": s1,
            "s2": s2
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body


    #Statement type determination API
    def sentenceType(self, sentence):
        #Statement type determination API URL specification
        url = self.developer_api_base_url + "nlp/v1/sentence_type"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "sentence": sentence
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body


    #User attribute estimation API
    def userAttribute(self, document):
        #User attribute estimation API URL specification
        url = self.developer_api_base_url + "beta/user_attribute"
        #Header specification
        headers={
            "Authorization": "Bearer " + self.access_token,
            "Content-Type": "application/json;charset=UTF-8",
        }
        #Request body specification
        data = {
            "document": document
        }
        #Encode request body specification to JSON
        data = json.dumps(data).encode()
        #Request generation
        req = urllib.request.Request(url, data, headers)
        #Send a request and receive a response
        try:
            res = urllib.request.urlopen(req)
        #What to do if an error occurs in the request
        except urllib.request.HTTPError as e:
            #If the status code is 401 Unauthorized, get the access token again and request again
            if e.code == 401:
                print ("get access token")
                self.access_token = getAccessToken(self.client_id, self.client_secret)
                headers["Authorization"] = "Bearer " + self.access_token
                req = urllib.request.Request(url, data, headers)
                res = urllib.request.urlopen(req)
            #Show cause for errors other than 401
            else:
                print ("<Error> " + e.reason)

        #Get response body
        res_body = res.read()
        #Decode the response body from JSON
        res_body = json.loads(res_body)
        #Get analysis result from response body
        return res_body



if __name__ == '__main__':
    #Get the location of the source file
    APP_ROOT = os.path.dirname(os.path.abspath( __file__)) + "/"

    #Get set value
    config = configparser.ConfigParser()
    config.read(APP_ROOT + "config.ini")
    CLIENT_ID = config.get("COTOHA API", "Developer Client id")
    CLIENT_SECRET = config.get("COTOHA API", "Developer Client secret")
    DEVELOPER_API_BASE_URL = config.get("COTOHA API", "Developer API Base URL")
    ACCESS_TOKEN_PUBLISH_URL = config.get("COTOHA API", "Access Token Publish URL")

    #COTOHA API instantiation
    cotoha_api = CotohaApi(CLIENT_ID, CLIENT_SECRET, DEVELOPER_API_BASE_URL, ACCESS_TOKEN_PUBLISH_URL)

    #Analysis target sentence
    sentence = "Post typoglycemia sentences output using Python and COTOHA to Qiita"

    #Display before shaping
    print('before :')
    print(sentence)

    #Parsing API execution
    result_json = cotoha_api.parse(sentence)

    #String list before formatting
    word_list_base = []
    #Character string list after formatting
    word_list = []

    #Regular expression for alphanumeric judgment
    alnumReg = re.compile(r'^[a-zA-Z0-9 ]+$')

    #Loop processing of analysis results
    for i in range(len(result_json['result'])):
        for j in range(len(result_json['result'][i]['tokens'])):
            #For alphanumeric characters'form'For Japanese, the value of'kana'Use the value of
            word = result_json['result'][i]['tokens'][j]['form']
            kana = result_json['result'][i]['tokens'][j]['kana']
            #Judgment whether it is half-width alphanumeric characters
            if alnumReg.match(word) is not None:
                #Determine if it is one word or not
                if ' ' in word:
                    #If more than one is configured later, disassemble further
                    word_list_base.extend(word.split(' '))
                else :
                    word_list_base.append(word)
            #Japanese
            else :
                #Convert katakana to hiragana and add to list
                word_list_base.append(jaconv.kata2hira(kana))
            
    #Parse each word and replace the characters other than the beginning and end of 4 or more characters
    for i in range(len(word_list_base)):
        #4 characters or more
        if len(word_list_base[i]) > 3:
            #First break down into a character-by-character list
            wl_all = list(word_list_base[i])
            #Keep the first and last characters
            first_word = wl_all[0]
            last_word = wl_all[len(wl_all) - 1]
            #Get the characters inside in list format
            wl = wl_all[1:len(wl_all) - 1]
            #Shuffle
            random.shuffle(wl)
            word_list.append(first_word + ''.join(wl) + last_word)
        #If it is less than 4 characters, leave it as it is
        else :
            word_list.append(word_list_base[i])

    #Display formatting results
    print('after :')
    print(' '.join(word_list))
Configuration file

config.ini


[COTOHA API]
Developer API Base URL: https://api.ce-cotoha.com/api/dev/
Developer Client id:[Client ID]
Developer Client secret:【secret】
Access Token Publish URL: https://api.ce-cotoha.com/v1/oauth/accesstokens

To use it, enter the client ID and secret in config.ini and place it in the same hierarchy as cotoha_api.py. The execution is as follows.

python cotoha_api.py

Execution result


before :
Post typoglycemia sentences output using Python and COTOHA to Qiita
after :
Let's use pothyn and choota, and let's send the taposi miriguabushon to qiita.

Summary

Even though I didn't have any knowledge about Python or natural language processing, I went on to implement it (** almost human script **). However, even in such a state, I personally got satisfactory results, so COTOHA is very easy to handle, and I think it is the best introduction.

We hope that you will be interested in COTOHA even a little after reading this article.

Sakonbubuken

-COTOHA API Reference -I tried using the COTOHA API rumored to be easy to handle natural language processing in Python -The result of having COTOHA summarize "Mentos and the memories of Go". COTOHA with the fastest tutorial

When

As an aside, ** "Typoglycemia" ** means ** "typographical error" ** and hypoglycemia ** "Hypoglycemi" ** portmanteau .org / wiki /% E3% 81% 8B% E3% 81% B0% E3% 82% 93% E8% AA% 9E). Portmanteau is a word made using multiple words, such as "Netsusama-to" and "Nespresso".

that? This seems to be interesting if you handle it with COTOHA ...

If I were to do it, it would be ** "I tried to create a now portmanteau using COTOHA" ** (laughs)

Recommended Posts

Create "Typoglycemia" sentences using COTOHA
Create JIRA tickets using Python
Create a python GUI using tkinter
Create a nested dictionary using defaultdict
Search for profitable brands using COTOHA
Create API using hug with mod_wsgi
Create a CRUD API using FastAPI
Easily create homemade RPA using Python
Create a C wrapper using Boost.Python