[PYTHON] Let's use COTOHA and get a good message from Pokemon ▼

Welcome to COTOHA's World! ▼

How is this pheasant?
I saw a project
               ▼
I'm trying to give it a try
It's a pheasant
               ▼

[Qiita x COTOHA API present plan] Let's analyze text with COTOHA API!

What is a Pokemon nice message? ▼

This is the message display used on the screen of the famous game "Pokemon" (released from Pokemon Co., Ltd.) (The image below is Pokemon Red and Green). From the game screen of). Animation of screen image The point is that I want to reproduce this (why do you want to do that? Just because I like Pokemon. Congratulations on the 24th anniversary of February 27, 2020!).

For this Pokemon-like message display

--Kanji is not used (with some exceptions) ――It is divided by "space"

There is a feature [^ 1]. This time, we aim to use COTOHA to convert arbitrary Japanese sentences into messages with this kind of atmosphere. [^ 1]: In recent works, kanji may be displayed, but in this article, I will try to match the specifications of "Pokemon Red and Green", which is the so-called "first generation".

Result sample

$ python3 poke_msg.py "Pokemon, shortened to Pokemon. The mysterious and mysterious creature of this star. You can see Pokemon in the sky, mountains, and the sea everywhere." 
Pokemon
               ▼
Small Pokemon
               ▼
This hoshino mysterious mysterious
living things
               ▼
Sora ni Yamani Umi
Pokemon is everywhere
               ▼
You can see that
it can
               ▼

I will make something like this.

Source code

The source code is listed in the GitHub repository below. mkuriki1990/poke_msg - GitHub The license is the MIT License.

What is COTOHA API? ▼

COTOHA API is a natural language processing and speech recognition API that utilizes a Japanese dictionary and was developed by NTT Communications. https://api.ce-cotoha.com/ It is a great guy who can analyze Japanese and perform parsing, keyword extraction, emotion extraction, voice recognition, etc. cotoha.png It is limited by the "for Developer Plan", but you can use it for free. Here, I made it available in Python by referring to the following article. I tried using the COTOHA API rumored to be easy to handle natural language processing in Python --Qiita

At COTOHA, first of all, Kaiseki ▼

I referred to the article about COTOHA in Qiita Ore Program Ugokas Omae Genshijin Naru --Qiita.

This article was a code that outputs the character string given to the argument after parsing the input sentence so that it becomes a primitive human (?) Word, omitting particles etc., so first there I tried to erase the part that omits words such as particles from. Also, if the output is only katakana, there will be no atmosphere, so use the jaconv library to forcibly convert it to hiragana.

Execution environment

First code

Code that is a little tampered with the code of the original article (click to expand)

pokemon_msg.py



import requests
import json
import sys
import jaconv

BASE_URL = "https://api.ce-cotoha.com/api/dev/nlp/"
CLIENT_ID = "Enter the COTOHA Client ID"
CLIENT_SECRET = "Insert COTOHA Client Secret"


def auth(client_id, client_secret):
    token_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8"
    }
    
    data = {
        "grantType": "client_credentials",
        "clientId": client_id,
        "clientSecret": client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))
    return r.json()["access_token"]


def parse(sentence, access_token):
    base_url = BASE_URL
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(access_token)
    }
    data = {
        "sentence": sentence,
        "type": "default"
    }
    r = requests.post(base_url + "v1/parse",
                      headers=headers,
                      data=json.dumps(data))
    return r.json()


if __name__ == "__main__":
    document = "You have now taken the first step to the Kanto region!"
    args = sys.argv
    if len(args) >= 2:
        document = str(args[1])
   
    access_token = auth(CLIENT_ID, CLIENT_SECRET)
    parse_document = parse(document, access_token)
    result_list = list()
    for chunks in parse_document['result']:
        for token in chunks["tokens"]:
            result_list.append(jaconv.kata2hira(token["kana"]))

    print(' '.join(result_list))

result

$ python3 pokemon_msg.py "You have now taken the first step to the Kanto region!"
You are now Fumida, a good luck to Kanto Chiho.

Something like that came out.

More Pokemon, nice fun, ▼

However, the actual Pokemon text messages do not have so many space delimiters. I want the part of speech not to be divided so much, and the last "Fumida" to be a single block, "Fumida". In addition, katakana can also be displayed, so I would like to display the "canto" that was originally written in katakana as it is.

Change the location of the delimiter space

As mentioned above, there are too many delimiters at the moment. In the game, it seems that there is a delimiter space for each phrase basically, and it seems that there are many cases where there is a delimiter space after the proper noun. If you look at the API Reference, fortunately the COTOHA API can read each clause as chunk_info, so you can read the character string for each clause. Can be linked. Furthermore, if you look at features in tokens, you can identify the proper noun as a subpart of speech, so I changed it so that a full-width space is added only after that.

result_list = list()
for chunks in parse_document['result']:
    text = "" #Have an empty text ready
    for token in chunks["tokens"]:
        word = jaconv.kata2hira(token["kana"]
        if "Unique" in token["features"]:
            text += word + " " #Add full-width space
        else:
            text += word
    result_list.append(text)

Convert only words other than katakana to hiragana

I also want to use the katakana words as they are, so I defined the following function. In COTOHA, the original word is form in tokens, and its reading kana is in kana. By comparing this, if form and kana match, it is a katakana word, otherwise it is a function that returns in hiragana. By putting this in between, the katakana word is used as it is.

def conv_kana(token):
    if token["form"] != token["kana"]:
        word = jaconv.kata2hira(token["kana"])
    else:
        word = token["kana"]
    return word

For the time being, the code so far

The entire code so far (click to expand)

pokemon_msg.py



import requests
import json
import sys
import jaconv

BASE_URL = "https://api.ce-cotoha.com/api/dev/nlp/"
CLIENT_ID = "Enter the COTOHA Client ID"
CLIENT_SECRET = "Insert COTOHA Client Secret"


def auth(client_id, client_secret):
    token_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8"
    }

    data = {
        "grantType": "client_credentials",
        "clientId": client_id,
        "clientSecret": client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))
    return r.json()["access_token"]


def parse(sentence, access_token):
    base_url = BASE_URL
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(access_token)
    }
    data = {
        "sentence": sentence,
        "type": "default"
    }
    r = requests.post(base_url + "v1/parse",
                      headers=headers,
                      data=json.dumps(data))
    return r.json()

#Convert only words other than katakana to hiragana
def conv_kana(token):
    if token["form"] != token["kana"]:
        word = jaconv.kata2hira(token["kana"])
    else:
        word = token["kana"]
    return word


if __name__ == "__main__":
    document = "You have now taken the first step to the Kanto region!" #Sample text
    args = sys.argv
    if len(args) >= 2:
        document = str(args[1]) #Replace with sample if there is an argument

    access_token = auth(CLIENT_ID, CLIENT_SECRET)
    parse_document = parse(document, access_token)
    result_list = list()
    for chunks in parse_document['result']:
        text = "" #Have an empty text ready
        for token in chunks["tokens"]:

            word = conv_kana(token)
            if token["pos"] == "noun":
                text += word + " "
            else:
                text += word

        result_list.append(text)

    print(' '.join(result_list))

result

$ python3 pokemon_msg.py "You have now taken the first step to the Kanto region!"
You have just started to give a good idea to Can Tho Chiho

It has become quite like that.

Make a long bunsho a nice Pokemon message ▼

The words of a character in the original game are as follows (Kanji conversion and punctuation marks are added without permission).

Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon.

When this is converted, it becomes as follows.

$ python3 pokemon_msg.py "Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon."
Strong Pokemon Yowai Pokemon That kind of person should do his best to be a favorite Pokemon if he is a really strong trainer.

It's a fairly verbose display. In the Pokemon game, due to screen width issues, the text is wrapped to an appropriate length (16 characters for the red / green version). Also, where there are punctuation marks, it looks like there is a line break. Adjust around that.

Punctuation processing

Punctuation can also be found by looking inside tokens. If there is a "punctuation mark" or "comma" in pos, insert a line break. In addition, the message may contain "exclamation marks" and "question marks" such as "!" And "?". Unlike "." And ",", these are displayed as messages in Pokemon, so I will insert them as a set with line breaks by checking features.

result_list = list()
for chunks in parse_document['result']:
    text = "" #Have an empty text ready
    for token in chunks["tokens"]:
        word = jaconv.kata2hira(token["kana"]
        if "Unique" in token["features"]:
            text += word + " " #Add full-width space
        elif token["pos"] == "Kuten" or token["pos"] == "Comma":
            if "question mark" in token["features"]:
                text += "?\n"
            elif "Exclamation point" in token["features"]:
                text += "!\n"
            else:
                text += "\n"
        else:
            text += word
    result_list.append(text)

Wrap string

The first Pokemon used GAMEBOY as the game hardware. The screen resolution is only 160x144 dots, and it seems that Pokemon can only display up to 16 characters horizontally. Therefore, if it exceeds 16 characters, it will be displayed with a line break. Rewrite the last result_list to join as follows.

# print(' '.join(result_list))
line = ""
for word in result_list:
    if len(line) == 0:
        line = word
        newLine = line
    else:
        newLine = line + ' ' + word

    if '\n' in newLine:
        if len(newLine) > 16:
            print(line)
            print(word)
        else:
            print(newLine)
        line = ""
    elif len(newLine) <= 16:
        line = newLine
    else:
        print(line)
        line = word

print(line, end='') #Excluding the last line break

Code up to this point

The entire code so far (click to expand)

pokemon_msg.py



import requests
import json
import sys
import jaconv

BASE_URL = "https://api.ce-cotoha.com/api/dev/nlp/"
CLIENT_ID = "Enter the COTOHA Client ID"
CLIENT_SECRET = "Insert COTOHA Client Secret"


def auth(client_id, client_secret):
    token_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8"
    }

    data = {
        "grantType": "client_credentials",
        "clientId": client_id,
        "clientSecret": client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))
    return r.json()["access_token"]


def parse(sentence, access_token):
    base_url = BASE_URL
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(access_token)
    }
    data = {
        "sentence": sentence,
        "type": "default"
    }
    r = requests.post(base_url + "v1/parse",
                      headers=headers,
                      data=json.dumps(data))
    return r.json()

#Convert only words other than katakana to hiragana
def conv_kana(token):
    if token["form"] != token["kana"]:
        word = jaconv.kata2hira(token["kana"])
    else:
        word = token["kana"]
    return word


if __name__ == "__main__":
    document = "You have now taken the first step to the Kanto region!" #Sample text
    document = "Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon." #Sample text
    args = sys.argv
    if len(args) >= 2:
        document = str(args[1]) #Replace with sample if there is an argument

    access_token = auth(CLIENT_ID, CLIENT_SECRET)
    parse_document = parse(document, access_token)
    result_list = list()
    for chunks in parse_document['result']:
        text = "" #Have an empty text ready
        for token in chunks["tokens"]:

            word = conv_kana(token)
            if "Unique" in token["features"]:
                text += word + " " #Add full-width space
            elif token["pos"] == "Kuten" or token["pos"] == "Comma":
                if "question mark" in token["features"]:
                    text += "?\n"
                elif "Exclamation point" in token["features"]:
                    text += "!\n"
                else:
                    text += "\n"
            else:
                text += word

        result_list.append(text)

    line = ""
    for word in result_list:
        if len(line) == 0:
            line = word
            newLine = line
        else:
            newLine = line + ' ' + word

        if '\n' in newLine:
            if len(newLine) > 16:
                print(line)
                print(word)
            else:
                print(newLine)
            line = ""
        elif len(newLine) <= 16:
            line = newLine
        else:
            print(line)
            line = word

    print(line, end='') #Excluding the last line break

result

$ python3 pokemon_msg.py "Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon."
Strong Pokemon

Good Pokemon

That kind of person

Really strong
If you are a trainer

I like Pokemon
You should do your best

Oops!

Somehow it looks like that It became a message!

Allows you to send messages

Up to this point, it is possible to display something like that. In the original game, instead of displaying all the text on the screen at once, it is displayed in about two lines, and you can send a message by pressing a button. I will try to reproduce this. Also, change it so that can be displayed at the end of the message to indicate that the message can be sent. I decided to put ʻinput ()` and wait for the Enter key to send the message.

line = ""
lineCounter = 0
for word in result_list:
    if len(line) == 0:
        line = word
        newLine = line
    else:
        newLine = line + ' ' + word

    if '\n' in newLine:
        if len(newLine) > 16:
            print(line)
            print(word)
        else:
            print(newLine);
        lineCounter = 2
        line = ""
    elif len(newLine) <= 16:
        line = newLine
    else:
        print(line); lineCounter += 1
        line = word

    if lineCounter >= 2:
        print("               ▼"); input()
        lineCounter = 0

print(line, end='') #Excluding the last line break

Code up to this point

The entire code so far (click to expand)

pokemon_msg.py



import requests
import json
import sys
import jaconv

BASE_URL = "https://api.ce-cotoha.com/api/dev/nlp/"
CLIENT_ID = "Enter the COTOHA Client ID"
CLIENT_SECRET = "Insert COTOHA Client Secret"

def auth(client_id, client_secret):
    token_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8"
    }

    data = {
        "grantType": "client_credentials",
        "clientId": client_id,
        "clientSecret": client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))
    return r.json()["access_token"]


def parse(sentence, access_token):
    base_url = BASE_URL
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(access_token)
    }
    data = {
        "sentence": sentence,
        "type": "default"
    }
    r = requests.post(base_url + "v1/parse",
                      headers=headers,
                      data=json.dumps(data))
    return r.json()

#Convert only words other than katakana to hiragana
def conv_kana(token):
    if token["form"] != token["kana"]:
        word = jaconv.kata2hira(token["kana"])
    else:
        word = token["kana"]
    return word


if __name__ == "__main__":
    document = "You have now taken the first step to the Kanto region!" #Sample text
    document = "Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon." #Sample text
    args = sys.argv
    if len(args) >= 2:
        document = str(args[1]) #Replace with sample if there is an argument

    access_token = auth(CLIENT_ID, CLIENT_SECRET)
    parse_document = parse(document, access_token)
    result_list = list()
    for chunks in parse_document['result']:
        text = "" #Have an empty text ready
        for token in chunks["tokens"]:

            word = conv_kana(token)
            if "Unique" in token["features"]:
                text += word + " " #Add full-width space
            elif token["pos"] == "Kuten" or token["pos"] == "Comma":
                if "question mark" in token["features"]:
                    text += "?\n"
                elif "Exclamation point" in token["features"]:
                    text += "!\n"
                else:
                    text += "\n"
            else:
                text += word

        result_list.append(text)

    line = ""
    lineCounter = 0
    for word in result_list:
        if len(line) == 0:
            line = word
            newLine = line
        else:
            newLine = line + ' ' + word

        if '\n' in newLine:
            if len(newLine) > 16:
                print(line)
                print(word)
            else:
                print(newLine);
            lineCounter = 2
            line = ""
        elif len(newLine) <= 16:
            line = newLine
        else:
            print(line); lineCounter += 1
            line = word

        if lineCounter >= 2:
            print("               ▼"); input()
            lineCounter = 0


    print(line, end='') #Excluding the last line break

result

$ python3 pokemon_msg.py "Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon."
Strong Pokemon
               ▼
Good Pokemon
               ▼
That kind of person
               ▼
Really strong
If you are a trainer
               ▼
I like Pokemon
You should do your best
               ▼

If you actually execute this on the command line, you can send messages one by one with the Enter key, which makes it more atmosphere.

Reigaishori ▼

As I wrote at the beginning, in the Pokemon-like message display

--Kanji is not used (with some exceptions)

However, there are some exceptions, which is the "yen" in the price display. Kanji is used only here. If you try to replace the letter "yen" as it is, all the idioms such as "smoothly proceed" will change, which is a problem. However, the COTOHA API can distinguish classifier by part of speech, so I decided to use it to distinguish.

If you want to use classifiers to judge 〇〇 yen ...?

However, this was no good. If you throw the word "500 yen" following the reference of COTOHA API, the following information will be returned.

$ curl -X POST -H "Content-Type:application/json;charset=UTF-8" -H "Authorization:Bearer [Access Token]" -d '{"sentence":"500 yen","type": "default"}' "[API Base URL]/nlp/v1/parse"
{
  "result" : [ {
    "chunk_info" : {
      "id" : 0,
      "head" : -1,
      "dep" : "O",
      "chunk_head" : 0,
      "chunk_func" : 0,
      "links" : [ ]
    },
    "tokens" : [ {
      "id" : 0,
      "form" : "500 yen",
      "kana" : "Gohyakun",
      "lemma" : "500 yen",
      "pos" : "noun",
      "features" : [ ],
      "dependency_labels" : [ ],
      "attributes" : { }
    } ]
  } ],
  "status" : 0,
  "message" : ""
}

That noun ...?

What was expected was that the pos of the" circle "became a classifier, divided into "500" and "yen". In the case of "151 animals" or "10 years old", "animals" and "ages" are properly separated as classifiers, but the price seems to become a noun in a lump. I'm not very familiar with linguistics, so I'm not very familiar with these classifications, but apparently I was disappointed.

$ python3 pokemon_msg.py "The secret Pokemon, Magikarp, costs only 500 yen! How do you buy it?"
Secret Pokemon
               ▼
Magikarp
What a hell!
               ▼
How is it?
               ▼

~~ Unfortunately, I will give up on the amount. ~~ Well, isn't it so strange as a game text?

Use named entity extraction

(Added on 2020/03/12) In the comment, I received the advice that "Named entity recognition should be used" (Thanks to @hanamizuno). .. Certainly, it seems that we can judge whether the class of this result is MNY.

$ curl -X POST -H "Content-Type:application/json;charset=UTF-8" -H "Authorization:Bearer [Access Token]" -d '{"sentence":"The secret Pokemon, Magikarp, costs only 500 yen! How do you buy it?","type": "default"}' "[API Base URL]/nlp/v1/ne"
{
  "result" : [ {
    "begin_pos" : 3,
    "end_pos" : 7,
    "form" : "Pokémon",
    "std_form" : "Pokémon",
    "class" : "ART",
    "extended_class" : "",
    "source" : "basic"
  }, {
    "begin_pos" : 8,
    "end_pos" : 13,
    "form" : "Magikarp",
    "std_form" : "Magikarp",
    "class" : "ART",
    "extended_class" : "",
    "source" : "basic"
  }, {
    "begin_pos" : 21,
    "end_pos" : 25,
    "form" : "500 yen",
    "std_form" : "500 yen",
    "class" : "MNY",
    "extended_class" : "",
    "source" : "basic"
  } ],
  "status" : 0,
  "message" : ""
}

First, store all the MNY elements that are the “money expression” contained in the text in the list.

#List words that contain monetary expressions
def make_pricelist(ne_document):

    pricelist = list()

    for result in ne_document['result']:
        if result['class'] == 'MNY':
            pricelist.append(result['form'])

    return pricelist

The conv_kana function that converts the katakana created above to hiragana is made to correspond to the monetary expression so that the character strings stored in this list are scanned in order and the original word is returned in the case of monetary expression. Rewritten as conv_word. However, if you return the monetary expression with Chinese numerals as it is, such as "50 yen", the atmosphere will be ruined, so I will convert it to Arabic numerals in each case. I used the kanjize` library introduced in Numeric numerals for Python <-> Kanjize mutual conversion library" Kanjize "--Qiita. .. Also, we will output full-width numbers instead of half-width numbers.

#Convert only words other than katakana to hiragana,
#If monetary representation is included"Circle"Convert while leaving only the kanji of
def conv_word(token, pricelist):

    if len(pricelist) > 0:
        price = pricelist[0]
        if token["form"] == price:
            price = pricelist.pop(0)
            #If it is expressed in Chinese numerals, change it to Arabic numerals.
            if not re.search('[0-9].+', price):
                price = str(kanji2int(price.replace("Circle", ""))) + "Circle"

            #Return half-width numbers to full-width numbers
            return jaconv.h2z(price, digit=True, ascii=True)

    if token["form"] != token["kana"]:
        word = jaconv.kata2hira(token["kana"])
    else:
        word = token["kana"]
    return word

Code up to this point

The entire code so far (click to expand)

pokemon_msg.py


import requests
import json
import sys
import jaconv
import re
from kanjize import int2kanji, kanji2int

BASE_URL = "https://api.ce-cotoha.com/api/dev/nlp/"
CLIENT_ID = "Enter the COTOHA Client ID"
CLIENT_SECRET = "Insert COTOHA Client Secret"


def auth(client_id, client_secret):
    token_url = "https://api.ce-cotoha.com/v1/oauth/accesstokens"
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8"
    }

    data = {
        "grantType": "client_credentials",
        "clientId": client_id,
        "clientSecret": client_secret
    }
    r = requests.post(token_url,
                      headers=headers,
                      data=json.dumps(data))
    return r.json()["access_token"]


def parse(sentence, access_token):
    base_url = BASE_URL
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(access_token)
    }
    data = {
        "sentence": sentence,
        "type": "default"
    }
    r = requests.post(base_url + "v1/parse",
                      headers=headers,
                      data=json.dumps(data))
    return r.json()

def ne(sentence, access_token):
    base_url = BASE_URL
    headers = {
        "Content-Type": "application/json",
        "charset": "UTF-8",
        "Authorization": "Bearer {}".format(access_token)
    }
    data = {
        "sentence": sentence,
        "type": "default"
    }
    r = requests.post(base_url + "v1/ne",
                      headers=headers,
                      data=json.dumps(data))
    return r.json()

#Convert only words other than katakana to hiragana,
#If monetary representation is included"Circle"Convert while leaving only the kanji of
def conv_word(token, pricelist):

    if len(pricelist) > 0:
        price = pricelist[0]
        if token["form"] == price:
            price = pricelist.pop(0)
            #If it is expressed in Chinese numerals, change it to Arabic numerals.
            if not re.search('[0-9].+', price):
                price = str(kanji2int(price.replace("Circle", ""))) + "Circle"

            #Return half-width numbers to full-width numbers
            return jaconv.h2z(price, digit=True, ascii=True)

    if token["form"] != token["kana"]:
        word = jaconv.kata2hira(token["kana"])
    else:
        word = token["kana"]
    return word

#List words that contain monetary expressions
def make_pricelist(ne_document):

    pricelist = list()

    for result in ne_document['result']:
        if result['class'] == 'MNY':
            pricelist.append(result['form'])

    return pricelist
    

if __name__ == "__main__":
    document = "You have now taken the first step to the Kanto region!" #Sample text
    document = "Strong Pokemon, weak Pokemon, such people's selfishness. If you are a really strong trainer, you should do your best to win with your favorite Pokemon." #Sample text
    document = "The secret Pokemon, Magikarp, costs only 500 yen! How do you buy it?" #Sample text
    args = sys.argv
    if len(args) >= 2:
        document = str(args[1]) #Replace with sample if there is an argument

    access_token = auth(CLIENT_ID, CLIENT_SECRET)
    parse_document = parse(document, access_token)
    ne_document = ne(document, access_token)
    pricelist = make_pricelist(ne_document)
    result_list = list()
    for chunks in parse_document['result']:
        text = "" #Have an empty text ready
        for token in chunks["tokens"]:

            word = conv_word(token, pricelist)
            if "Unique" in token["features"]:
                text += word + " " #Add full-width space
            elif token["pos"] == "Kuten" or token["pos"] == "Comma":
                if "question mark" in token["features"]:
                    text += "?\n"
                elif "Exclamation point" in token["features"]:
                    text += "!\n"
                else:
                    text += "\n"
            else:
                text += word

        result_list.append(text)

    line = ""
    lineCounter = 0
    for word in result_list:
        if len(line) == 0:
            line = word
            newLine = line
        else:
            newLine = line + ' ' + word

        if '\n' in newLine:
            if len(newLine) > 16:
                print(line)
                print(word)
            else:
                print(newLine);
            lineCounter = 2
            line = ""
        elif len(newLine) <= 16:
            line = newLine
        else:
            print(line); lineCounter += 1
            line = word

        if lineCounter >= 2:
            print("               ▼"); input()
            lineCounter = 0


    print(line, end='') #Excluding the last line break

result

$ python3 pokemon_msg.py "The secret Pokemon, Magikarp, costs only 500 yen! How do you buy it?"
Secret Pokemon
               ▼
Magikarp
What a mere 500 yen!
               ▼
How is it?
               ▼

It's done.

Summary ▼

Kagaku no Chikara is amazing!
               ▼
Now with PC communication
Send Japanese
               ▼
The result of the analysis is
You can see it
               ▼

To make it more like that

It looks like that, but there was something I wanted to do more.

Processing of classifier "yen"

~~ As mentioned above. I think that it can be judged by conditional branching in combination with numbers, but I gave up for the time being. ~~ (Added on 2020/03/12): It is now possible to process using the named entity extraction function.

Alphanumeric processing

The COTOHA API is excellent, so it's easy to handle in Japanese. For example, COTOHA should be` I will. But if anything, I think that it is better to display alphabetic letters and Arabic numerals as they are.

Katsuyohoho ▼

If you use it in combination with voice recognition, it seems that you can convert the spoken content into a retro game text style and display it by overlapping it with the video. Basically, it becomes hiragana and katakana, so it may be used for children's programs (?)

Link ▼

-COTOHA API --NTT Communications -Ore Program Ugokas Omae Genshijin Naru --Qiita -I tried using the COTOHA API rumored to be easy to handle natural language processing in Python --Qiita -Python number <-> Kanjize mutual conversion library "Kanjize" --Qiita -Pokemon Red / Green / Blue / Pikachu Digest Video --YouTube: I used it as a reference for displaying the message. -mkuriki1990 / poke_msg --GitHub: The source code is MIT License.

License ▼

This sentence

This pheasant's inyobubunto
Boyto's game is noodles
               ▼
Nozoku
Kantori-Kurabubaiyonten Zerode
Licensed
               ▼

(The text excluding the quoted part of this article and the game screen at the beginning is licensed under CC BY 4.0.) [^ 2] [^ 2]: In the COTOHA API, "CC" seems to become a "country club". Of course, here "CC" means "Creative Commons".

Source code

The source code is licensed under the MIT License as described in the GitHub repository. mkuriki1990/poke_msg - GitHub