[PYTHON] I want to handle the rhyme part8 (finished once)

Content

Let's try the previous correction, usage, and count expression. Since there was a duplication in the part where the vowel sequence is regarded as a word, correct it so that there is no duplication. After that, like the last time, I will express the sentences in the text in binary expression and display the ones with high cosine similarity. Let's do the same for the count expression.

Modifications

from pykakasi import kakasi
import re
import numpy as np
import pandas as pd
import itertools

with open("./test.txt","r", encoding="utf-8") as f:
    data = f.read()

#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list2 = [i[0]+i[1] for i in itertools.product("aiueo", repeat=2)]
word_list3 = [i[0]+i[1]+i[2] for i in itertools.product("aiueo", repeat=3)]
word_list4 = [i[0]+i[1]+i[2]+i[3] for i in itertools.product("aiueo", repeat=4)]
word_list = word_list2 + word_list3 + word_list4

text_data = re.split("\u3000|\n", data)
kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
vowel_text_list = [conv.do(d) for d in text_data]
vowel_text_list = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_list]

ʻItertools` is used to prevent duplication of word list part. It was also used to avoid calculating (0,1) and (1,0) twice when examining the cosine similarity of usage. itertools

Binary representation

df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"If it appears in the text, it will be 1, otherwise it will be 0.
binali_dic = {}
temp = []
for word in word_list:
    for vowel in vowel_text_list:
        if word in vowel:
            temp.append(1)
        else:
            temp.append(0)
        binali_dic[word] = temp
    temp = []

for k, v in binali_dic.items():
    df[k] = v

The third and subsequent columns indicate whether or not there is a sequence of vowels in the sentence that are likened to words such as "aa".

the way to use

#Cosine similarity
def cosine_similarity(v1, v2):
    cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    return cos_sim

#Pass an index of df and return a common vowel
def common_vowel(index1, index2):
    idx = df.iloc[index1, 2:].values + df.iloc[index2, 2:].values
    vowel_word = df.columns[2:]
    common_list = [vowel_word[i] for i in range(len(idx)) if idx[i] == 2]
    return common_list

#Cosine similarity ranking. list(index,index,cos_sim,Common vowel list)
def cos_sim_ranking(df, threshold):
    ranking = []
    idx = itertools.combinations(df.index, 2)
    for i in idx:
        cos_sim = cosine_similarity(df.iloc[i[0]][2:].values, df.iloc[i[1]][2:].values)
        if cos_sim > threshold:
            com_list = common_vowel(i[0], i[1])
            ranking.append((i[0],i[1],cos_sim,com_list))
    return sorted(ranking, key=lambda x:-x[2])

ranking = cos_sim_ranking(df, 0.4)
for r in ranking:
    print(df["Sentence"][r[0]] + ":" + df["Sentence"][r[1]])
    print("Common vowels:{}".format(r[3]))
    print()

For items above the cosine similarity threshold (arbitrary value), the original sentence and the sequence of common vowels are output in descending order of similarity. The rhyme can be emphasized by moving the common vowel at the beginning or end of the sentence by using the inversion method of the original sentence.

Count expression

df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"Etc., the value is the number of occurrences
count_dic = {}
temp = []
for word in word_list:
    for vowel in vowel_text_list:
        temp.append(vowel.count(word))
    count_dic[word] = temp
    temp = []

for k, v in count_dic.items():
    df[k] = v

#Pass the index of df and return the common vowel, the number of occurrences
def common_vowel(index1, index2):
    idx = df.iloc[index1, 2:].values + df.iloc[index2, 2:].values
    vowel_word = df.columns[2:]
    common_list = [(vowel_word[i], idx[i]) for i in range(len(idx)) if idx[i] >= 2]
    return common_list

The difference between creating a data frame and adding the "number of occurrences" in common_vowel and returning it. The output results are different even if the same threshold is used, and I felt that the count expression that shows the number of occurrences is good.

Summary

The output of the test data was quite satisfactory. A vowel with a count of 2 or more is used as a common vowel, but there are some that have a count of 2 in one sentence. This showed that the text itself could be rhymed, and it was an unexpected harvest. At first, I tried to handle it as long as possible, but I remember giving up thinking that I couldn't capture the "sentence that the sentence itself can step on." After that, I was worried about how to divide it, and it was interesting that I ended up handling the sentences without dividing them. Well, it's not that what I've done so far isn't bad, and I'm glad that I realized the advantages and disadvantages. There may be minor corrections and improvements, but I found it interesting how to express the sentence by "what kind of vowel sequence it has", so "I want to handle the rhyme" ends once.

What I want to do in the future

I would like to try this count expression with the lyrics of the actual rapper and see if there is any discovery.

[PYTHON] I want to handle the rhyme part8 (finished once)

__ Content __

__ Modifications __

__ Binary representation __