[PYTHON] I want to handle the rhyme part8 (finished once)

__ Content __

Let's try the previous correction, usage, and count expression. Since there was a duplication in the part where the vowel sequence is regarded as a word, correct it so that there is no duplication. After that, like the last time, I will express the sentences in the text in binary expression and display the ones with high cosine similarity. Let's do the same for the count expression.

__ Modifications __

from pykakasi import kakasi
import re
import numpy as np
import pandas as pd
import itertools

with open("./test.txt","r", encoding="utf-8") as f:
    data = f.read()

#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list2 = [i[0]+i[1] for i in itertools.product("aiueo", repeat=2)]
word_list3 = [i[0]+i[1]+i[2] for i in itertools.product("aiueo", repeat=3)]
word_list4 = [i[0]+i[1]+i[2]+i[3] for i in itertools.product("aiueo", repeat=4)]
word_list = word_list2 + word_list3 + word_list4

text_data = re.split("\u3000|\n", data)
kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
vowel_text_list = [conv.do(d) for d in text_data]
vowel_text_list = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_list]

ʻItertools` is used to prevent duplication of word list part. It was also used to avoid calculating (0,1) and (1,0) twice when examining the cosine similarity of usage. itertools

__ Binary representation __

df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"If it appears in the text, it will be 1, otherwise it will be 0.
binali_dic = {}
temp = []
for word in word_list:
    for vowel in vowel_text_list:
        if word in vowel:
            temp.append(1)
        else:
            temp.append(0)
        binali_dic[word] = temp
    temp = []

for k, v in binali_dic.items():
    df[k] = v

The third and subsequent columns indicate whether or not there is a sequence of vowels in the sentence that are likened to words such as "aa".

the way to use

#Cosine similarity
def cosine_similarity(v1, v2):
    cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    return cos_sim

#Pass an index of df and return a common vowel
def common_vowel(index1, index2):
    idx = df.iloc[index1, 2:].values + df.iloc[index2, 2:].values
    vowel_word = df.columns[2:]
    common_list = [vowel_word[i] for i in range(len(idx)) if idx[i] == 2]
    return common_list

#Cosine similarity ranking. list(index,index,cos_sim,Common vowel list)
def cos_sim_ranking(df, threshold):
    ranking = []
    idx = itertools.combinations(df.index, 2)
    for i in idx:
        cos_sim = cosine_similarity(df.iloc[i[0]][2:].values, df.iloc[i[1]][2:].values)
        if cos_sim > threshold:
            com_list = common_vowel(i[0], i[1])
            ranking.append((i[0],i[1],cos_sim,com_list))
    return sorted(ranking, key=lambda x:-x[2])

ranking = cos_sim_ranking(df, 0.4)
for r in ranking:
    print(df["Sentence"][r[0]] + ":" + df["Sentence"][r[1]])
    print("Common vowels:{}".format(r[3]))
    print()

For items above the cosine similarity threshold (arbitrary value), the original sentence and the sequence of common vowels are output in descending order of similarity. The rhyme can be emphasized by moving the common vowel at the beginning or end of the sentence by using the inversion method of the original sentence.

__ Count expression __

df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"Etc., the value is the number of occurrences
count_dic = {}
temp = []
for word in word_list:
    for vowel in vowel_text_list:
        temp.append(vowel.count(word))
    count_dic[word] = temp
    temp = []

for k, v in count_dic.items():
    df[k] = v

#Pass the index of df and return the common vowel, the number of occurrences
def common_vowel(index1, index2):
    idx = df.iloc[index1, 2:].values + df.iloc[index2, 2:].values
    vowel_word = df.columns[2:]
    common_list = [(vowel_word[i], idx[i]) for i in range(len(idx)) if idx[i] >= 2]
    return common_list

The difference between creating a data frame and adding the "number of occurrences" in common_vowel and returning it. The output results are different even if the same threshold is used, and I felt that the count expression that shows the number of occurrences is good.

__ Summary __

The output of the test data was quite satisfactory. A vowel with a count of 2 or more is used as a common vowel, but there are some that have a count of 2 in one sentence. This showed that the text itself could be rhymed, and it was an unexpected harvest. At first, I tried to handle it as long as possible, but I remember giving up thinking that I couldn't capture the "sentence that the sentence itself can step on." After that, I was worried about how to divide it, and it was interesting that I ended up handling the sentences without dividing them. Well, it's not that what I've done so far isn't bad, and I'm glad that I realized the advantages and disadvantages. There may be minor corrections and improvements, but I found it interesting how to express the sentence by "what kind of vowel sequence it has", so "I want to handle the rhyme" ends once.

__ What I want to do in the future __

I would like to try this count expression with the lyrics of the actual rapper and see if there is any discovery.

Recommended Posts

I want to handle the rhyme part8 (finished once)
I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to automate ssh using the expect command! part2
I want to display the progress bar
I want to customize the appearance of zabbix
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to scroll the Django shift table, but ...
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
I want to write in Python! (3) Utilize the mock
I tried to erase the negative part of Meros
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to be able to analyze data with Python (Part 3)
I want to use the latest gcc without sudo privileges! !!
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to be able to analyze data with Python (Part 4)
I want to calculate the allowable downtime from the operating rate
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
I want to make the Dictionary type in the List unique
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to align the significant figures in the Numpy array
I want to know the legend of the IT technology world
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to automatically find high-quality parts from the videos I shot
I want to know the weather with LINE bot feat.Heroku + Python
[Linux] I want to know the date when the user logged in
I want to understand systemd roughly
I want to read the html version of "OpenCV-Python Tutorials" OpenCV 3.1 version
I want to output the beginning of the next month with Python
I want to run the Python GUI when starting Raspberry Pi
I want to find the shortest route to travel through all points
LINEbot development, I want to check the operation in the local environment
I want to create a system to prevent forgetting to tighten the key 1
I calculated the stochastic integral (I to integral)
For the time being, I want to convert files with ffmpeg !!
I want to do ○○ with Pandas
I want to check the position of my face with OpenCV!
I want to know the population of each country in the world.
I want to copy yolo annotations