[PYTHON] "Obake" can be sung by "you"

Introduction

sample

result



"Obake" can be sung by "you"[Degree of similarity:0.24651645]

Main subject

External API used to create the tool

Source code

What I made

-① Word search tool that can step on the rhyme --API used -COTOHA Similarity Judgment API -② Search word pool generation tool --API used -COTOHA Parsing API -Qiita Post Article Acquisition API

① Word search tool that can step on the rhyme

role

--A tool that extracts specified words and words that can be rhetoric from CSV files --Similarity judgment between words using COTOHA API as an option

Figure

1.png

How it works

--Convert the specified word and the word in the CSV file to Romaji using pykakashi

converter


    def convert_hiragana_to_roma(self, target_word_hiragana):
        #In the case of sokuon
        #Since "tsu" and "tsu" are converted to the same "tsu", use "x" as a special character.
        if target_word_hiragana == "Tsu":
            return "x"
        else:
            kakasi_lib = kakasi()
            #Hiragana in romaji
            kakasi_lib.setMode('H', 'a')
            conv = kakasi_lib.getConverter()

            target_word_roma = conv.do(target_word_hiragana)
            return target_word_roma

--Extract vowel patterns from romaji and compare if they are the same pattern

conditions Original word Before conversion After conversion
Vowels only Obake obake oae
Contains a sokuon full ippai ixai
"N" is included pike sanma ana
"-" Is included Thunder sanda- anaa

converter


    #Convert reading kana to phonological patterns
    def convert_roma_to_phoneme_pattern(self, target_char_roma_list):
        pre_phoneme = None
        hit_list = []
        for target_char_roma in target_char_roma_list:
            #Vowel case
            #Any of "Ah, uh, eh, oh"
            vowel_char = self.__find_vowel_char(
                target_char_roma
            )
            specific_char = self.__find_specific_char(
                pre_phoneme,
                target_char_roma
            )

            if vowel_char:
                hit_list.append(vowel_char)
                pre_phoneme = vowel_char
            elif specific_char:
                #Not a vowel, but a target case
                #"Tsu"
                #"Hmm"
                #"-"
                hit_list.append(specific_char)
                pre_phoneme = specific_char
            else:
                continue

        phoneme_pattern = "".join(hit_list)
        return phoneme_pattern

    def __find_vowel_char(self, char_roma):
        #For vowels
        vowel_list = ["a", "i", "u", "e", "o"]
        for vowel in vowel_list:
            if char_roma.find(vowel) > -1:
                return vowel
            else:
                continue
        #If not a vowel
        return None

    def __find_specific_char(self, pre_phoneme, char_roma):
        #In the case of "n"
        #In the case of "tsu":
        if char_roma == "n" or char_roma == "x":
            return char_roma
        #In the case of "-"
        #Consider the same as the previous vowel
        #Example)Daa-> a
        elif pre_phoneme != None and char_roma == "-":
            return pre_phoneme
        else:
            return None

Execution example

execute


$cd src
$python main.py ghost

result


"Obake" can be rhymed with "answer"
"Obake" can linger with "you"

Similarity judgment

--After extracting the combination of words that can be rhymed, set the specified word to base_word and the word extracted from CSV to pool_word for analysis.

cotoha_client.py


    def check_score(self, base_word, pool_word, access_token):
        headers = {
            "Content-Type": COTOHA_CONTENT_TYPE,
            "charset": COTOHA_CHAR_SET,
            "Authorization": "Bearer {}".format(access_token)
        }
        data = {
            "s1": base_word,
            "s2": pool_word,
            "type": "default"
        }
        req = urllib.request.Request(
            f"{COTOHA_BASE_URL}/{COTOHA_SIMILARITY_API_NAME}",
            json.dumps(data).encode(),
            headers
        )
        time.sleep(COTOHA_REQUEST_SLEEP_TIME)
        with urllib.request.urlopen(req) as res:
            body = res.read()
            return json.loads(body.decode())["result"]["score"]

Execution example

execute


$cd src
$python main.py ghost

result


"Obake" can be sung with "answer"[Degree of similarity:0.063530244]
"Obake" can be sung by "you"[Degree of similarity:0.24651645]

Task

The original CSV file is fixed

2.png

--Originally, during development, I used the noun list attached to mecab as a word pool. ――I thought it would be more interesting if I could create a mechanism to increase the number and types of vocabulary, so I came up with a tool to generate a word pool.

② Word pool generation tool

role

-(1) A tool that generates CSV for word search used in word search tools that can be used in rhymes.

Schematic

3.png

How it works

--Get the title of the posted article obtained by Qiita's posted article acquisition API

qiita_client.py


    def list_articles(self):
        req = urllib.request.Request(
            f"{QIITA_BASE_URL}/{QIITA_API_NAME}?page={QIITA_PAGE_NUMBERS}&per_page={QIITA_ITEMS_PAR_PAGE}"
        )
        with urllib.request.urlopen(req) as res:
            body = res.read()
            return json.loads(body.decode())

--Classify the acquired titles into part of speech by applying COTOHA's parsing API

cotoha_client.py


    # target_Put Qiita article title in sentence
    def parse(self, target_sentence, access_token):
        headers = {
            "Content-Type": COTOHA_CONTENT_TYPE,
            "charset": COTOHA_CHAR_SET,
            "Authorization": "Bearer {}".format(access_token)
        }
        data = {
            "sentence": target_sentence,
        }
        req = urllib.request.Request(
            f"{COTOHA_BASE_URL}/{COTOHA_PARSE_API_NAME}",
            json.dumps(data).encode(),
            headers
        )
        time.sleep(COTOHA_REQUEST_SLEEP_TIME)
        with urllib.request.urlopen(req) as res:
            body = res.read()
            return json.loads(body.decode())["result"]

--Extract only nouns from part of speech and output to CSV file

finder.py


    #Extracts only nouns from parsing results and returns a list of them
    def find_noun(self, target_sentence_element):
        noun_list = []
        for element_num in range(len(target_sentence_element)):
            tokens = target_sentence_element[element_num]["tokens"]

            for tokens_num in range(len(tokens)):
                target_form = tokens[tokens_num]["form"]
                target_kana = tokens[tokens_num]["kana"]
                target_pos = tokens[tokens_num]["pos"]

                #If it is a noun, store it in the list
                if target_pos == TARGET_CLASS:
                    #English, numbers, and symbolic words store reading kana instead
                    # TODO:There is room for improvement in the judgment.
                    if re.match(FINDER_REGEX, target_form):
                        noun_list.append(target_kana)
                    else:
                        noun_list.append(target_form)

        return noun_list

Execution example

execute


$cd tool
$python word_pool_generator.py

word_pool.csv


backup
tool
ABC
string
visual
studio
code
Note
management
Expansion
Summary
paper
Commentary

Task

――Honestly, it's heavy. Even 40 posted articles will take about 5 minutes to process. ――The number of nouns that can be extracted in one article title is about 2 to 5. ――However, since I was touching pandas for the first time when outputting to a CSV file, I think that I can improve the logic further. ――It's a level that I made something that works for the time being.

--Improved judgment of English words --In the current logic, "Raspberry Pi" will be "Raspberry Pi" instead of "Raspberry Pi". ――For example, if you can pass only "Raspberry" to the parsing API and judge it as "Raspberry", you can make it feel a little better by devising the way of passing words. ――By the way, "Google" was "Google".

――Be able to increase the variation of words ――It seems that you can collect words from other fields by scraping on other sites.

Articles that I used as a reference

Other

About the motive that made

――The reason why I made this kind of thing in the first place is that I saw this article about half a year ago and said with a friend, "I can use natural language processing to rhyme. I had a conversation, "Can I find a word?" ――However, I didn't know the field of natural language processing at all at that time (even now), and when I happened to see this project, I thought that I could make something close to it, so I decided to create this tool. I made it.

Recommended Posts

"Obake" can be sung by "you"
List the classes that can be referenced by ObjCClass
As you may know, Python can be written like this
Implement a thread that can be paused by exploiting yield
Can you solve sports scheduling?
Maybe it can be recursed
Can you delete the file?
Investigation of DC power supplies that can be controlled by Python
Sort of tuple array can be accelerated by specifying key (Python)