Introduction

[Qiita x COTOHA API present plan] Let's analyze text with COTOHA API!

I decided to touch the COTOHA API because of the rumor of the present project. For me, who loves puns, I wondered if I could parse puns using this API, so I tried to touch it. I don't think I'll get a present at all.

Preparation

First, register for Developer on the COTOHA API page. You can hit the API immediately.

As long as you deal with puns, I think it's definitely a trial and error process. However, the for Developer plan has a limit of 1000 calls per day for each API. If so, try creating a class that caches the same input and does not uselessly call the API.

COTOHA.py (maybe redundant)

import os
import sys
import pathlib
import time
import requests
import json
import hashlib


class COTOHA:
    __BASE_URL = 'https://api.ce-cotoha.com/api/dev/'

    def __init__(self, id, secret, cache_dir='./COTOHA_cache'):
        self.id = id
        self.secret = secret
        self.cache_dir = cache_dir
        self._get_token()

    def _create_cache_path(self, func, key):
        hash = hashlib.md5(key.encode()).hexdigest()
        hashpath = "{}/{}/{}".format(hash[:2], hash[2:4], hash[4:])
        return self.cache_dir + '/' + func + '/' + hashpath

    def _save_cache(self, path, content):
        pathlib.Path(os.path.dirname(path)).mkdir(exist_ok=True, parents=True)
        with open(path, mode="w") as c:
            c.write(content)
        return

    def _load_cache(self, path):
        content = None
        if os.path.exists(path):
            with open(path, mode="r") as c:
                content = c.read()
        return content

    def _get_token(self):
        token_cache = self.cache_dir + '/token'  # format: token expired_time
        cache = self._load_cache(token_cache)
        if cache:
            token, expired_time = cache.split()
            if int(expired_time) > int(time.time()):
                self.token = token
                return

        # get new token
        token_url = 'https://api.ce-cotoha.com/v1/oauth/accesstokens'
        headers = {'content-type': 'application/json'}
        payload = {"grantType": "client_credentials",
                   "clientId": self.id,
                   "clientSecret": self.secret}
        res = requests.post(token_url, headers=headers, data=json.dumps(payload))
        res.raise_for_status()
        res_json = json.loads(res.text)

        self.token = res_json['access_token']
        expired_time = int(time.time()) + int(res_json['expires_in'])
        self._save_cache(token_cache, self.token + ' ' + str(expired_time))
        return

    #Speech recognition error detection(β)
    def detect_misrecognition(self, data):
        func = sys._getframe().f_code.co_name
        cache_path = self._create_cache_path(func, data)
        cache = self._load_cache(cache_path)
        if cache:
            print("[INFO] use '"+func+"' api cache", file=sys.stderr)
            return cache

        api_url = self.__BASE_URL + 'nlp/beta/detect_misrecognition'
        payload = {'sentence': data}
        headers = {'content-type': 'application/json;charset=UTF8',
                   'Authorization': 'Bearer ' + self.token}

        res = requests.post(api_url, headers=headers, data=json.dumps(payload))
        res.raise_for_status()

        self._save_cache(cache_path, res.text)
        return res.text

    #wrap up(β)
    def summary(self, data, sent_len=1):
        func = sys._getframe().f_code.co_name
        cache_path = self._create_cache_path(func, data+str(sent_len))
        cache = self._load_cache(cache_path)
        if cache:
            print("[INFO] use '"+func+"' api cache", file=sys.stderr)
            return cache

        api_url = self.__BASE_URL + 'nlp/beta/summary'
        payload = {'document': data, 'sent_len': sent_len}
        headers = {'content-type': 'application/json;charset=UTF8',
                   'Authorization': 'Bearer ' + self.token}

        res = requests.post(api_url, headers=headers, data=json.dumps(payload))
        res.raise_for_status()

        self._save_cache(cache_path, res.text)
        return res.text

    def keyword(self, data, type='default', max_keyword_num=5):
        func = sys._getframe().f_code.co_name
        cache_path = self._create_cache_path(func, data+type+str(max_keyword_num))
        cache = self._load_cache(cache_path)
        if cache:
            print("[INFO] use '"+func+"' api cache", file=sys.stderr)
            return cache

        if type != 'kuzure' and type != 'default':
            print("[ERROR] type must be default or kuzure! :" + type)
            return

        api_url = self.__BASE_URL + 'nlp/v1/keyword'
        payload = {'document': data, 'type': type, 'max_keyword_num': max_keyword_num}
        headers = {'content-type': 'application/json;charset=UTF8',
                   'Authorization': 'Bearer ' + self.token}

        res = requests.post(api_url, headers=headers, data=json.dumps(payload))
        res.raise_for_status()

        self._save_cache(cache_path, res.text)
        return res.text

    def parse(self, data, type='default'):
        func = sys._getframe().f_code.co_name
        cache_path = self._create_cache_path(func, data+type)
        cache = self._load_cache(cache_path)
        if cache:
            print("[INFO] use '"+func+"' api cache", file=sys.stderr)
            return cache

        if type != 'kuzure' and type != 'default':
            print("[ERROR] type must be default or kuzure! :" + type)
            return

        api_url = self.__BASE_URL + 'nlp/v1/parse'
        payload = {'sentence': data, 'type': type}
        headers = {'content-type': 'application/json;charset=UTF8',
                   'Authorization': 'Bearer ' + self.token}

        res = requests.post(api_url, headers=headers, data=json.dumps(payload))
        res.raise_for_status()

        self._save_cache(cache_path, res.text)
        return res.text

    : (The following is omitted)

So, let's make a command so that you can call it as you like. I'm new to python, so I'm not sure if argparse is good, but for the time being, it's quick. No options are attached.

coto.py

!/usr/bin/env python
import sys
import argparse
from COTOHA import COTOHA

parser = argparse.ArgumentParser()
parser.add_argument('api', choices=['summary', 'keyword', 'parse',
                    'detect_misrecognition'])
parser.add_argument('infile', nargs='?', type=argparse.FileType('r'),
                    default=sys.stdin)
args = parser.parse_args()

id = ''  #Enter your ID!
secret = ''  #Put in a secret!
coto = COTOHA(id, secret)

data = args.infile.read()
if args.api == "summary":
    sent_len = 1
    res = coto.summary(data, sent_len)
elif args.api == "keyword":
    type = 'default'
    max_keyword_num = 5
    res = coto.keyword(data, type, max_keyword_num)
elif args.api == "parse":
    type = 'default'
    res = coto.parse(data, type)
elif args.api == "detect_misrecognition":
    res = coto.detect_misrecognition(data)
elif args.api == "sentiment":
    res = coto.sentiment(data)
else:
    print("unexpected api:" + args.api, file=sys.stderr)
    sys.exit(1)

print(res)

Somehow I will try to hit the Pillow Book (explosion) as a starting point.

$ cat makurano.txt 
Spring is Akebono. The mountains, which are finally turning white, are a little lighter, and the purple clouds are fluttering.
Summer is night.... (The following is omitted, until winter.)
$ [hoshino@localhost py_scrape]$ ./coto.py summary makurano.txt 
{"result":"In the daytime, if you relax warmly, the fire in the brazier will tend to be white ash.","status":0}
$ ./coto.py keyword makurano.txt 
{
  "result" : [ {
    "form" : "See",
    "score" : 21.3012
  }, {
    "form" : "Itotsuki",
    "score" : 20.0
  }, {
    "form" : "fire",
    "score" : 17.12786
  }, {
    "form" : "Where to sleep",
    "score" : 11.7492
  }, {
    "form" : "Sunset",
    "score" : 11.4835
  } ],
  "status" : 0,
  "message" : ""
}

Yeah, suddenly throwing ancient texts has a bad personality. I don't know the validity of the result w Let's hit it a little more clearly.

$ echo "I'm sukiyaki today, I'm looking forward to it." | ./coto.py sentiment
{"result":{"sentiment":"Positive","score":0.6113335958534332,"emotional_phrase":[{"form":"I'm looking forward to it","emotion":"P"}]},"status":0,"message":"OK"}

$ echo "Today is sukiyaki, I don't want to eat it." | ./coto.py sentiment
{"result":{"sentiment":"Neutral","score":0.2837920794741674,"emotional_phrase":[]},"status":0,"message":"OK"}

$ echo "Today is sukiyaki, I don't want to eat it." | ./coto.py sentiment
{"result":{"sentiment":"Negative","score":0.7608419653662558,"emotional_phrase":[{"form":"Spicy","emotion":"N"}]},"status":0,"message":"OK"}

$ echo "Today is sukiyaki. Why do you eat that kind of food?" | ./coto.py sentiment
{"result":{"sentiment":"Neutral","score":0.3482213983910368,"emotional_phrase":[]},"status":0,"message":"OK"}

$ echo "Today is sukiyaki. Alright." | ./coto.py sentiment
{"result":{"sentiment":"Positive","score":0.0976613052132079,"emotional_phrase":[{"form":"Yosha","emotion":"Rejoice"}]},"status":0,"message":"OK"}

Hmmm, sentiment analysis doesn't seem to get caught without easy-to-understand expressions. I have a lot of things to think about.

With that feeling, you can now call the API freely as it is. I thought after making it, but if I make 1000 calls, I don't need cash (hey)

Scraping to bring puns

By the way, I'm going to bring a playful pun from the fucking blog that I used to write. Scraping uses the familiar (?) Beautiful Soup in "Let Python do the boring things". You can understand this article in 10 minutes.

This is also cached because it is poor. Favorite cache. I will do my best to look at the html source and extract the article part. Let's display one article content parsed by get_text.

`scrape.py`


if not os.path.exists("hijili/top"):
    res = requests.get("https://ameblo.jp/hijili/")
    res.raise_for_status()
    os.makedirs("./hijili/")
    with open("./hijili/top", mode="w") as f:
        f.write(res.text)

with open("hijili/top", mode="r") as f:
    top_soup = bs4.BeautifulSoup(f, 'html.parser')

top_soup.find('entryBody')
bodies = [n.get_text() for n in top_soup.select('div#entryBody')]

print(bodies[1])

$ python scrape.py
When I was talking about Katana, I thought it was good to be able to return "I bought it a long time ago" (1 slip). I thought I won, though I didn't say it. (2 slip) It's not serious to say that I bought it. (3 slips that appeal seriously)

There are lots of puns like this, so you can parse as much as you want! I chose something that seems to be relatively easy to understand (I'm not sure how the "serious" part is involved ...). However, the part "(N slip)" that is the promise of this blog seems to interfere with the analysis, so I will cut it.

# bodies = [n.get_text() for n in top_soup.select('div#entryBody')]
bodies = [re.sub('（[^）]*Sliding)', '', n.get_text()) for n in top_soup.select('div#entryBody')]

$ python scrape.py
When I was talking about Katana, I thought it was good to be able to return to him, "I bought it a long time ago." I thought I won, though I didn't say it. Even if I say I bought it, it's not serious.

Alright, this should make COTOHA easy to read.

Play with puns with COTOHA API

For the time being, let's pass through the Gashigashi API with the COTOHA class. As an expectation, I would like to get a clue to judge puns as puns ...

#wrap up
{"result":"When I was talking about Katana, I thought it was good to be able to return to him, "I bought it a long time ago."","status":0}

#Keyword extraction
{
  "result" : [ {
    "form" : "Serious",
    "score" : 22.027
  }, {
    "form" : "mouth",
    "score" : 8.07787
  }, {
    "form" : "Talk",
    "score" : 7.23882
  } ],
  "status" : 0,
  "message" : ""
}

#Sentiment analysis
{"result":{"sentiment":"Positive","score":0.06420815417815495,"emotional_phrase":[{"form":"Was good","emotion":"P"},{"form":"I won","emotion":"P"},{"form":"Not serious","emotion":"PN"}]},"status":0,"message":"OK"}

#User attribute estimation(β)
{
  "result" : {
    "age" : "20-29-year-old",
    "civilstatus" : "married",
    "hobby" : [ "CAMERA", "COOKING", "INTERNET", "MUSIC", "SHOPPING" ],
    "location" : "Kanto",
    "moving" : [ "WALKING" ],
    "occupation" : "employee"
  },
  "status" : 0,
  "message" : "OK"
}

Hmmm, what's the difference between what is formatted and returned and what isn't ... Well, it's okay because it's processed by python.

Does the summarization API not summarize, it just returns what seems to be important in a long sentence?

Keyword extraction, I wanted you to extract Katana in this sentence ... It seems that you can't find the pun keyword ...

Sentiment analysis was also useful! The person himself is just writing puns, so it's always positive.

User attribute estimation, hmm, I feel that specs that seem to be puns are coming out, but Sman, I'm more old and single.

With that feeling, I couldn't confirm the effectiveness for puns ... (I'm sorry).

However, there is some hope for voice recognition error detection! A word that sounds similar to "serious" comes out!

#Speech recognition error detection(β)
{"result":{"score":0.7298996038673021,"candidates":[{"begin_pos":76,"end_pos":78,"form":"Serious","detect_score":0.7298996038673021,"correction":[{"form":"test","correct_score":0.7367702251579388},{"form":"Literature","correct_score":0.7148903980845676},{"form":"Nervous","correct_score":0.6831886720211694},{"form":"Shinken","correct_score":0.6443806737641236},{"form":"Ren County","correct_score":0.6403453079473043}]}]},"status":0,"message":"OK"}

Take the test seriously! (1 slip) Focus your nerves on the test! !! (2 slip) Ren County investigated something seriously (No, what is Ren County !?)! !! !! (3 slip)

It seems that COTOHA is thinking about pun candidates! I see, should I use it like this! (Wrong

I will discover puns properly

So, I got some pun-like results, but it seems difficult to find puns. However, I felt that it would be possible to use the results of the parsing API well.

The result of the parsing API is long, so I'll omit it ... You can try it out on the Demo page.

What I thought was that if I found the longest part with the same vowel between the clauses separated by this parsing, wouldn't it be a pun? about it.

I thought I'd try it for the time being, but how about "finding the longest part with the same vowel"? I wonder if it will be possible to convert it to romaji and extract [aiueo] and do it messed up ... Search ... I see, KAKASI's python implementation has pykakashi. Well then, let's try the conversion!

Try using pykakashi

kakasi = kakasi()  # Generate kakasi instance
kakasi.setMode("H", "a")  # Hiragana to ascii
kakasi.setMode("K", "a")  # Katakana to ascii
kakasi.setMode("J", "a")  # Japanese(kanji) to ascii
kakasi.setMode("r", "Kunrei")   #Instruction ceremony
conv = kakasi.getConverter()

res = coto.parse(data, 'default')
j = json.loads(res)

org_sentence = []
ascii_sentence = []
tmp_org_sentence = []
tmp_ascii_sentence = []
org_chunk = ""
kana_chunk = ""
for chunk_info in j['result']:
    for token in chunk_info['tokens']:
        is_end = False
        if token['form'] == '。' or token['form'] == '.' or token['form'] == '!':
            is_end = True
            break
        else:
            # chunk += conv.do(token['kana'])
            org_chunk += token['form']
            kana_chunk += token['kana']

    tmp_org_sentence.append(org_chunk)
    tmp_ascii_sentence.append(conv.do(kana_chunk))
    org_chunk = ""
    kana_chunk = ""
    if is_end:
        org_sentence.append(tmp_org_sentence)
        ascii_sentence.append(tmp_ascii_sentence)
        tmp_org_sentence = []
        tmp_ascii_sentence = []

print("org")
print(*org_sentence, sep='\n')
print("ascii")
print(*ascii_sentence, sep='\n')

result.

org
['Katana', 'Talk', 'Was', 'sometimes', '"Old times', 'I bought it. "', 'In a hurry', 'I was able to return', 'It was good', 'thought']
['I won', 'thought,', 'In the mouth', 'I didn't put it out']
['I bought it', 'After all', 'Not serious']
ascii
['katanano', 'hanasio', 'siteita', 'tokini', 'mukasi', 'kattanaato', 'tossani', 'kaesetanoha', 'yokattato', 'omoimasita']
['kattanato', 'omoimasita', 'kutiniha', 'dasanakattakedo']
['kattanaato', 'ittemo', 'sinkendehaarimasen']

Oh, I was able to disassemble it in a nice way. Then I'll try this.

for i, osen in enumerate(org_sentence):
    c = find_dajare_candidate(ascii_sentence[i], osen)  #I can't show it so much, so if it feels good in the future ...
    if not c:
        continue
    dump_candidate(c)
    print('----')

find_dajare_candidate assumes that a sentence should contain puns for the time being, compares the vowels of each phrase in that sentence, and returns the part with the highest number of matches as candidate data.

Katana
I bought it. "
vowel: aaao  score:4
----
I won
I didn't put it out
vowel: aaao  score:4
----
I bought it
Not serious
vowel: aaa  score:3
----

Well, the first one is correct, but the rest is subtle. I mean, ['I won,',' I thought,','I didn't say it,'] The second sentence alone does not include puns, it just depends on the first sentence "Katana".

Then, try to put in a comparison keyword for a moment so that you can search by specifying "katana" ...

c = find_dajare_candidate(ascii_sentence[1], org_sentence[1], "katana")
dump_candidate(c)
print('----')
c = find_dajare_candidate(ascii_sentence[2], org_sentence[2], "katana")
dump_candidate(c)

I won
vowel: aaa  score:3
----
I bought it
vowel: aaa  score:3

I can guide you to some extent! If you can compare the similarity of consonants properly with this, it seems that the accuracy will be improved, okay, I will do my best! !! !! !!

that…?

** COTOHA is no longer relevant! It's pretty much irrelevant in the liver part! !! ** (Haraichi flavor)

End Production / writing 　　　　　　　　　━━━━━ 　　　　　　　　　　ⓃⒽⓀ

in conclusion

So, how was the article "Using the COTOHA API ~~ to play with puns ~~"? It seemed impossible for COTOHA to discover puns, but it seemed that he could get hints for puns and use them as a foothold for analysis.

I hope someone becomes a kid who uses COTOHA. (1 slip)

My commitment to puns like that is described in Mach new book like shit. Please see if you are too free to die.

I mean, I'm interested in the gift planning, and I'm grateful for it because it's become a lot more fun. I will continue this pun analysis for a while (although I'm tired of it).

[PYTHON] Play with puns using the COTOHA API