[PYTHON] I want to handle the rhyme part2

__ Content __

Last time Is it necessary to improve the method of dividing input data? I felt that, so I tried various division methods. The input data uses the lyrics of a certain rapper as before. For the time being, I also have a theme of the rhyme that I was warming up for verification.

__ For word-separation __

import MeCab

mecab = MeCab.Tagger("-Owakati")
mecab_text = mecab.parse(data).split()

There is a part where the "lyric" can be extracted according to the lyrics, but it cannot be recognized because the part that is stepped on at "tens of thousands of yen" is divided into "tens of thousands, yen" by word division. As an aside, it seems that you can't pass a lot of data to conv.do of kakashi at once. text_data = [conv.do (text) for text in mecab_text]. By the way, the maximum length of the word converted into vowels after the word division was 8 characters, and the average was 2.16 characters. It can be said that the division by word division is not suitable because it is divided too finely. ~~ I was frustrated many times before I could use this MeCab. I may have overlooked various articles. Here Thanks ~~

__N-gram __

So what if we simply split it by N characters? N will try from 4.

def make_ngram(words, N):
    ngram = []
    for i in range(len(words)-N+1):
        #Remove double-byte spaces and line breaks
        row = "".join(words[i:i+N]).replace("\u3000","").replace("\n","")
        ngram.append(row)
    return ngram

As for the experience, N seems to be good at 5 or more. (If it is 4 or less, there is no difference in score.) The rhyme can be detected according to the lyrics. When I put in the verification data as a trial, I was able to detect a seemingly unnoticed rhyme. Since words are cut out in various ways, try changing the way they are scored.

def make_score_ngram(word_a, word_b):
    score = 0
    for i in range(len(word_a)):
        if word_a[-i:] == word_b[-i:]:
            score += i
     return score

The output is easy to see by seeing the vowel match from the end of the word. For the value of N, len (target_word_vo) (the length of the vowel of the original word looking for the rhyme) would be good. I feel like I've been able to express what I want to do. I had a hard time making it possible to use MeCab, and I thought about "quantifying the rhyme" in my own way, so I want to use it. Let's combine these two.

Isn't __target_word narrowed down? !! __

In "quantification of rhyme", the part where the vowels match was searched, and the matching length len (word [i: j) was used as the score. This word [i: j] has a shape such as "eoi", and if you count the number of occurrences, you should be able to find the most appearing vowel in the input data. The idea is that if you specify a word that includes it in target_word, you can expect many recommendations. I'm sorry to use the text prepared for word-separation and verification.

from pykakasi import kakasi
import re
from collections import defaultdict
import MeCab

with open("./test.txt","r",encoding="utf-8") as f:
    data = f.read()

mecab = MeCab.Tagger("-Owakati")
mecab_text = mecab.parse(data).split()

kakasi = kakasi()

kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')

conv = kakasi.getConverter()
text_data = [conv.do(text) for text in mecab_text]
vowel_data = [re.sub(r"[^aeiou]+","",text) for text in text_data]
dic_vo = {k:v for k,v in enumerate(vowel_data)}
#voel_Create a dictionary so that you can see the data before vowel conversion by the data index.
dic = {k:v for k,v in enumerate(mecab_text)}

#Use defaultdict to skip initialization when adding new keys{vowel:Number of appearances}
dic_rhyme = defaultdict(int)
for word_a in vowel_data:
    for word_b in vowel_data:
        if len(word_a) > len(word_b):
            word_len = len(word_b)
            for i in range(word_len):
                for j in range(word_len + 1):
                    #Only count 2 or more characters
                    if word_b[i:j] in word_a and not len(word_b[i:j])<2:
                        dic_rhyme[word_b[i:j]] += 1
        else:
            word_len = len(word_a)
            for i in range(word_len):
                for j in range(word_len + 1):
                    if word_a[i:j] in word_b and not len(word_a[i:j])<2:
                        dic_rhyme[word_a[i:j]] += 1
#Sort in descending order of count
dic_rhyme = sorted(dic_rhyme.items(), key=lambda x:x[1], reverse=True)
print(dic_rhyme)
#dic_Search for things that include the ones that came to the top of rhyme. here"ai"use
bool_index = ["ai" in text for text in vowel_data]

for i in range(len(vowel_data)):
    if bool_index[i]:
        print(dic[i])

I was able to get the sequence of vowels that appear frequently and output where they are used. However, the words subdivided by word-separation were difficult to understand. Perhaps there was a slightly longer rhyme.

Improvement

I don't feel the need to narrow down the target_word (because I want to specify what I want to say the most), but it may be good to be able to confirm which vowel sequence is frequent. It didn't work out in the word-separation, but I want to improve it using MeCab (~~ many times, I struggled until it became usable ~~). Also, by adopting N-gram, we were able to simplify the "quantification of rhyme", so we will consider whether we can redefine "rhin" a little more complicatedly. (Currently, "tsu" is not considered) However, I made a detour. "Numericalization of rhyme" is intended to be thought out in my own way so as to correspond to the input data so as not to leak the rhyme. No way, slicing various arguments can solve the input data by slicing variously (the expression may be different) ... or not noticing it. The basics are important. But don't you feel that N-gram can speak meaningless Japanese? However, there are ways to emphasize the "lyric" part and pronounce it. After all, it is important to try using simple data anyway.

Recommended Posts

I want to handle the rhyme part1
I want to handle the rhyme part3
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to handle the rhyme part7 (BOW)
I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part8 (finished once)
I want to automate ssh using the expect command! part2
I want to pin Spyder to the taskbar
I want to output to the console coolly
I want to display the progress bar
I want to customize the appearance of zabbix
I want to use the activation function Mish
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to scroll the Django shift table, but ...
I want to handle optimization with python and cplex
I want to solve Sudoku (Sudoku)
I want to inherit to the back with python dataclass
I want to write in Python! (3) Utilize the mock
I tried to erase the negative part of Meros
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to be able to analyze data with Python (Part 3)
I want to use the latest gcc without sudo privileges! !!
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
I want to calculate the allowable downtime from the operating rate
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
I want to make the Dictionary type in the List unique
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to understand systemd roughly
I calculated the stochastic integral (I to integral)
I want to scrape images to learn
I want to do ○○ with Pandas
I want to copy yolo annotations
I want to debug with Python
I tried to move the ball
I tried to estimate the interval.
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to manually assign the training parameters of the [Pytorch] model
I want to automatically find high-quality parts from the videos I shot
I want to know the weather with LINE bot feat.Heroku + Python