[PYTHON] I want to handle the rhyme part1

__ Introduction __

I like rap, so I always wanted to spread the word. "Rhythm" is an interesting and deep part. I tried to write the lyrics several times, but it didn't look right (it became like a pun, and the story was messy with the rhythm in mind). Then, if you enter the source of the lyrics, why not output something like "lyrics"? ?? I will try it. ~~ (Did you guys write love songs or lyrics in your youth? Lol.) ~~

__ Judgment of rhyme __

Even if you say the rhyme in a bite, the pronunciation changes the sound ("so" and "soo"), and so on. This time, if the vowels [a, i, u, e, o] are in the same sequence, it is regarded as "rym". Romaji conversion is required, but [kakashi] [link](referred to here) makes it possible. For the time being, I will use the text data with the lyrics of one song of a certain rapper as the input data. ~~ (Speaking of "scarecrow", "Robo or Scarecrow average by KICK THE CAN CREW" comes to mind) ~~ [link]:https://crimnut.hateblo.jp/entry/2018/08/29/180455

from pykakasi import kakasi
import re

with open("gennama.txt","r") as f:
    data = f.read()

kakasi = kakasi()
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')

conv = kakasi.getConverter()
filename = data
data = conv.do(filename)
data_re = re.sub(r"[^aeiou]+","",data)

print(data_re)#Output oeauiaau...

__ Quantification of rhyme __

I converted it to Romaji with kakasi and extracted only the target [a, i, u, e, o] withre.sub (). Next, I'd like to divide the input data according to a certain rule and do some processing on the part where the vowels match, but I get lost here. It's hard to tell what you want to do and what is the "lyric". I had been worried about the shape of the input data and how to transform it, but I couldn't make it, so I thought about "quantifying the rhyme" for the time being. ~~ ("Everyone said that the evolved rhyme is good. In the Ginza Line by YOSHI (Gaki Ranger)" [In] sounds comfortable. I haven't thought about [n] this time) ~~

#Slice the shorter word, and if it is included in the other, consider it as a "rhin" and add its length as a score
def make_score(word_a, word_b):
    score = 0
    if len(word_a) > len(word_b):
        word_len = len(word_b)
        for i in range(word_len):
            for j in range(word_len + 1):
                if word_b[i:j] in word_a:
                    score += len(word_b[i:j])
    else:
        word_len = len(word_a)
        for i in range(word_len):
            for j in range(word_len + 1):
                if word_a[i:j] in word_b:
                    score += len(word_a[i:j])
    return score

I think I was able to express "quantification of rhyme" with this score. It was okay to score only the ending match, but I decided to slice the word. ~~ All Japanese verbs end with the sound "u". It's not good to just verb and rhyme ~~

__ What to output __

By the way, the input data is still uncertain, but let's think about the output so far. The output is irrelevant, but it's a "lyric" one. .. It can be said that "quantification of rhyme" is similar between words. If you enter your favorite word in one word, it will recommend the one with high similarity from the data. The input data is the lyrics of a certain rapper, the division method is blank according to the lyrics card, and the previous ones are summarized under this condition. ~~ I think catchphrases and puns are also rhymes. Mostly homonyms. The finish during rap is also pleasant. I think that stepping on the rhyme strengthens the feeling of saying good things. ~~

from pykakasi import kakasi
import re

with open("./gennama.txt","r", encoding="utf-8") as f:
    data = f.read()
#This time "hoge hoge hoge...Because it is divided only by the space like "split"()Only pretreatment
data_sp = data.split()
target_word_origin = "Gennama"

kakasi = kakasi()

kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')

conv = kakasi.getConverter()
#Convert to romaji
target_word = conv.do(target_word_origin)
text_data = conv.do(data).split()
text_data = list(text_data)
#Leave only vowels
target_word_vo = re.sub(r"[^aeiou]+","",target_word)
vowel_data = [re.sub(r"[^aeiou]+","",text) for text in text_data]
#vowel_Create a dictionary so that you can see the data before vowel conversion by the data index.
dic = {k:v for k,v in enumerate(data_sp)}

#Slice the shorter word, and if it is included in the other, consider it as a "rhin" and add its length as a score
def make_score(word_a, word_b):
    score = 0
    if len(word_a) > len(word_b):
        word_len = len(word_b)
        for i in range(word_len):
            for j in range(word_len + 1):
                if word_b[i:j] in word_a:
                    score += len(word_b[i:j])
    else:
        word_len = len(word_a)
        for i in range(word_len):
            for j in range(word_len + 1):
                if word_a[i:j] in word_b:
                    score += len(word_a[i:j])
    return score
#Pass data with only vowels and arbitrary words. Get the index and score as a set so that you can understand the original words later.
def get_idx_score(vowel_data, target_word):
    ranking = []
    for i, word_b in enumerate(vowel_data):
        score = make_score(target_word, word_b)
        ranking.append([i, score])
    return sorted(ranking, key=lambda x:-x[1])

ranking = get_idx_score(vowel_data, target_word_vo)
print(target_word_origin)
for i in range(len(ranking)):
    idx = ranking[i][0]
    score = ranking[i][1]
    print("Score:" + str(score))
    print("word:" + dic[idx])

I used the completed lyrics as the input data, but even if I use the word in the lyrics for target_word, the part that is lingering does not always come to the top. That should be the case, and the method of dividing the input data is appropriate. However, I was able to think more concretely about the unreasonable part because the outline was created. Also, the problem became clear.

__ Impressions, future development __

I could see the direction. The input is a list of my own words, and the output is a feeling that recommends words that can be rhymed from the input as I did this time. By doing this, I think it's interesting to know that you can linger in your own words. Also, if you want to write lyrics like yourself, I think that the central part of the lyrics should be completed if you write down what you want to say without worrying about the lyrics. In the future, it will be necessary to improve the method of dividing the input data into words and the scoring by quantifying the rhyme. I want to have the freedom of inputting my own words (I want to keep dialects and unique phrases), so I will try various things. (Simply extracting by part of speech does not seem to produce the desired result.)

in conclusion

* Below memorandum I'm a beginner in programming. After receiving online learning, when I tried to make something, I tried to use everything I learned, scraping the input data and morphologically analyzing it ... I thought that the output would be using the completed lyrics, LSTM, etc. Nothing went on. As mentioned in the text, the input and output were unreasonable, and what I wanted to do was not concrete, so I threw it out. I wanted something I made for the time being, so I tried to squeeze someone's coding, but I couldn't find anyone doing the same thing. (A person who wants to step on the original rhyme was trying to plagiarize ...) What changed the flow this time was "content-based filtering of recommendations" when reviewing online learning, and if the similarity was set to "quantification of rhyme", it would be possible to make recommendations. I suddenly thought. Actually, I was still paying attention to the similarity, but I couldn't think of how to use it. It seems easy, so I tried it, but it didn't produce the desired output. Here, it became clear what was the point. Furthermore, I thought that the input data could be a list of words like the lyrics used this time. What I want to say is that it is important to first create a general framework like this one, the data to be used should be something that can roughly predict the output, and the usage of the completed product can be considered later. There are also links to things that you don't think are unnecessary. When I was less motivated to study, I felt like remembering it.

Recommended Posts

I want to handle the rhyme part1
I want to handle the rhyme part3
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to handle the rhyme part7 (BOW)
I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part8 (finished once)
I want to automate ssh using the expect command! part2
I want to display the progress bar
I want to customize the appearance of zabbix
I want to use the activation function Mish
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to scroll the Django shift table, but ...
I want to handle optimization with python and cplex
I want to solve Sudoku (Sudoku)
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
I want to write in Python! (3) Utilize the mock
I tried to erase the negative part of Meros
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to be able to analyze data with Python (Part 3)
I want to use the latest gcc without sudo privileges! !!
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
I want to calculate the allowable downtime from the operating rate
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
I want to make the Dictionary type in the List unique
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to align the significant figures in the Numpy array
I want to know the legend of the IT technology world
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to understand systemd roughly
I calculated the stochastic integral (I to integral)
I want to scrape images to learn
I want to do ○○ with Pandas
I want to copy yolo annotations
I want to debug with Python
I tried to move the ball
I tried to estimate the interval.
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to automatically find high-quality parts from the videos I shot