[PYTHON] I tried to score the syntax that was too humorous and humorous using the COTOHA API.

Introduction

Recently, due to the influence of a certain video series [^ tokkun]

"It was too OO, and it became OO."

I am addicted to the expression. Recently, it was too cold when I left home in the morning, and I became Samgyeopsal. Like.

However, in the latter half, the part "became XX" did not come up with a surprisingly good word.

"It was so dazzling that it became a marble chocolate."

I think there are times when you compromise with a near-miss word selection. In order to warn myself, I wrote a program that uses the COTOHA API to score the syntax (I'd like to call it with respect) that I thought was too crazy.

Scoring method

The syntax is too good to be true,

"(Adjective stem) is too much, ... (noun)"

It can be said that the higher the similarity between the adjective stem and the noun, the better the syntax. But the noun shouldn't be a word that doesn't make sense. If the noun part is not a general word, I would like to give it 0 points.

For the similarity between adjective stems and nouns, I would like to use [Levenshtein distance](#Levenshtein distance implementation) for the time being. I don't want to deduct points even if the noun side becomes unnecessarily long, as I said at the beginning, "It's too cold and I became Samgyeopsal." Therefore, the noun side only looks at the number of characters on the adjective stem side [^ tukkomi].

Example of use

$ echo "I became a horse because I was too horse" | python orochimaru.py
100.0 points
I'm too horsey ... I'm a horse ...

$ echo "It was so dazzling that it became a marble chocolate" | python orochimaru.py 
33.3 points
It's too mabushi, it's become marble chocolate ...

$ echo "It was so funny that it became funny" | python orochimaru.py 
0 points
It's too funny, it's become funny ...

Preparation

Use the COTOHA API to parse the input text and determine if the noun is a general noun. Create a Developers account from the COTOHA API Portal (https://api.ce-cotoha.com/) and make a note of your Client ID and Client secret. Also, although it is not registered in PyPi, the python library is published on GitHub, so install it referring to this article. Of course, it is not necessary if you hit the API directly, and it seems that it only supports parsing and similarity calculation at the moment, so please be careful.

python is using 3.6. If you are using pyenv or something, please do it well.

$ git clone https://github.com/obilixilido/cotoha-nlp.git
$ cd cotoha-nlp/
$ pip install -e .

You can now use the COTOHA API parsing.

Implementation

Levenshtein distance implementation

[wikipedia](https://ja.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3%83 The algorithm of% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2) is just implemented, so I will omit the details.

levenshtein.py


def levenshtein_distance(s1, s2):
    l1 = len(s1)
    l2 = len(s2)
    dp = [[0 for j in range(l2+1)] for i in range(l1+1)]
    for i in range(l1+1):
        dp[i][0] = i
    for i in range(l2+1):
        dp[0][i] = i
    
    for i in range(1, l1+1):
        for j in range(1, l2+1):
            cost = 0 if (s1[i-1] == s2[j-1]) else 1
            dp[i][j] = min([dp[i-1][j]+1, dp[i][j-1]+1, dp[i-1][j-1]+cost])
 
    return dp[l1][l2]

Parsing too much

It's a very rough implementation, but if you enter a text that matches "It's too XX and it's XX", the score and the text of the analysis result will be output.

orochimaru.py


from cotoha_nlp.parse import Parser
import levenshtein_distance

def find_orochi_sentence(tokens):
    form_list = ["", "Too", "hand", "", "To", "Nana", "Tsu", "Ta"]
    pos_list = ["", "Adjective suffix", "Verb suffix", "", "Case particles", "Verb stem", "Verb conjugation ending", "Verb suffix"]
    i = 0
    s1 = ""; s2 = ""
    is_unknown = False
    for token in tokens:
        if (i > 7): return 1
        if (i == 0):
            if not (token.pos == "Adjective stem"): return 1
            s1 = token.kana
        elif (i == 3):
            if not (token.pos == "noun"): return 1
            s2 = token.kana
            if ("Undef" in token.features):
                is_unknown = True
        else:
            if (i == 4 and token.pos == "noun"):
                s2 += token.kana
                if ("Undef" in token.feautes):
                    is_unknown = True
                continue
            if not (token.pos == pos_list[i] and token.form == form_list[i]): return 1
        i += 1

    if is_unknown:
        print("0 points")
    else:
        dist = levenshtein_distance.levenshtein_distance(s1, s2[:len(s1)])
        print(f"{(100 * (len(s1) - dist) / len(s1)):.1f}Dot")
    print(f"{s1}Too much ...{s2}I'm ...")
    return 0

parser = Parser("YOUR_CLIENT_ID",
    "YOUR_CLIENT_SECRET",
    "https://api.ce-cotoha.com/api/dev/nlp",
    "https://api.ce-cotoha.com/v1/oauth/accesstokens"
)
s = parser.parse(input())
if find_orochi_sentence(s.tokens) == 1:
    print("This is too humorous and not syntactic")

In the parsing of COTOHA API, the morpheme information [^ morpheme] of each word is obtained, but if the word is an unknown word, the information "Undef" is added to "features" in it. Based on that information, it is judged whether the noun part of the syntax is a general noun because it is too horsey.

Also, if kanji is included in the similarity calculation, there is a problem of notational fluctuation, so we compare using katakana readings. Therefore, if the COTOHA API recognizes that the reading is different from what you expected, it will not be judged correctly. (Example: It was too spicy and became a face)

Some syntax masters are too enthusiastic to deal with the problem of not being able to come up with a good word by saying "too much to become a cedar", but this is a sly thing and I will not evaluate it.

in conclusion

Now, whenever I'm too crazy to come up with a syntax, I'm able to get an objective evaluation.

This time, I tried using the COTOHA API for morphological analysis, but I found it convenient because it is easy to use and supports quite a lot of words. I think it is also great that XX is presumed to be a noun even if it is an unknown word in the part that "became XX". The free version has a limit on the number of API requests (1000 times a day), but I think that there is no problem with using it for play.

Everyone, please try using the syntax too much. Thank you very much.

reference

[^ tokkun]: Link to Youtube channel.

[^ tukkomi]: This is a Tsukkomi point, and I think there is a more proper implementation method.

[^ morpheme]: Official Reference. There is no mention of "Undef" here, so it may eventually become unusable ...

Recommended Posts

I tried to score the syntax that was too humorous and humorous using the COTOHA API.
I tried to touch the COTOHA API
I tried to extract and illustrate the stage of the story using COTOHA
[First COTOHA API] I tried to summarize the old story
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried using the checkio API
I tried using the COTOHA API (there is code on GitHub)
I tried using Twitter api and Line api
I tried using the BigQuery Storage API
I tried using pyenv, which I hated without eating, and it was too convenient to sit down.
I tried to summarize various sentences using the automatic summarization API "summpy"
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried using the Google Cloud Vision API
I tried to touch the API of ebay
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
I tried using Google Translate from Python and it was just too easy
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to compress the image using machine learning
A story that was convenient when I tried using the python ip address module
It's too easy to access the Twitter API with rauth and I have her ...
I tried to search videos using Youtube Data API (beginner)
I tried to simulate ad optimization using the bandit algorithm.
I tried to get Web information using "Requests" and "lxml"
I tried to illustrate the time and time in C language
I tried to display the time and today's weather w
[TF] I tried to visualize the learning result using Tensorboard
Miscellaneous notes that I tried using python for the matter
[Python] I tried collecting data using the API of wikipedia
I tried to enumerate the differences between java and python
I tried to get various information from the codeforces API
I tried to approximate the sin function using chainer (re-challenge)
I tried to output the access log to the server using Node.js
[For beginners] I tried using the Tensorflow Object Detection API
[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2
When I tried to install Ubuntu 18.04, "Initramfs unpacking failed: Decoding failed" was displayed and the startup failed.
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to create Quip API
I tried to get the index of the list using the enumerate function
[Slack api + Python] I tried to summarize the methods such as status confirmation and message sending
How to get followers and followers from python using the Mastodon API
I tried to analyze my favorite singer (SHISHAMO) using Spotify API
I tried to digitize the stamp stamped on paper using OpenCV
I tried ranking the user name and password of phpMyAdmin that was targeted by the server attack
I became horror when I tried to detect the features of anime faces using PCA and NMF.
I tried to touch Tesla's API
[Python] I tried to make a simple program that works on the command line using argparse.
I tried the Naruro novel API
[Python] I tried to get various information using YouTube Data API!
I tried to move the ball
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried to estimate the interval.
I tried using docomo speech recognition API and Google Speech API in Java
For the time being using FastAPI, I want to display how to use API like that on swagger
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to summarize until I quit the bank and became an engineer
I tried to execute SQL from the local environment using Looker SDK
I tried moving the image to the specified folder by right-clicking and left-clicking