[PYTHON] I want to handle the rhyme part6 (organize once)

__ Content __

While organizing what you have done so far, organize your thoughts and clarify your future policy. Introducing the method judged to be the most suitable at present. First of all, as a theme, I want you to feel "rhythm" closer to you. Therefore, I thought it would be fun if I could linger with the words I usually use. I thought about something like a rhyme search function, but the idea was that everyone would follow the same rhyme and lose their individuality. This is an attempt to look at the data from the perspective of "lyrics" when something like what you usually think, what you are talking about, or something like a bulleted list of such things is input.

__ Data preprocessing __

from pykakasi import kakasi
import re
import numpy as np

with open("./gennama.txt","r", encoding="utf-8") as f:
    data = f.read()
    
kakasi = kakasi()
kakasi.setMode('J', 'K')
kakasi.setMode('H', 'K')
conv = kakasi.getConverter()
#Katakana conversion. (The main purpose is to read kanji. It also makes it easier to convert e-i.)
text_data = conv.do(data)

#e i → ee,Get the converted text like o u → oo
def expansion(text_data):
    text_data_len = len(text_data)
    text_data = text_data.replace("good", "I i").replace("U","U u")
    text_data = text_data.split("I")
    new_text_data = []
    kakasi.setMode('K', 'a')
    conv = kakasi.getConverter()
    for i in range(len(text_data)):
        if len(text_data[i]) > 0:
            if ("e" in conv.do(text_data[i][-1])):
                new_text_data.append(text_data[i] + "e")
            else:
                new_text_data.append(text_data[i] + "i")
            
    text_data = "".join(new_text_data).split("C")
    new_text_data = []
    for i in range(len(text_data)):
        if len(text_data[i]) > 0:
            if ("o" in conv.do(text_data[i][-1])):
                new_text_data.append(text_data[i] + "o")
            else:
                new_text_data.append(text_data[i] + "u")
    return "".join(new_text_data)[:text_data_len]

#Separation such as spaces is considered meaningful and divided.
def ngram(text_data, N):
    #Double-byte spaces and line breaks are split
    text_data = re.split("\u3000|\n", text_data)
    ngram = []
    for text in text_data:
        if len(text) < N:
            ngram.append(text)
        else:
            for i in range(len(text)-N+1):
                row = "".join(text[i:i+N])
                ngram.append(row)
    return ngram

#Splitting and transforming text
ex_text_data = expansion(text_data)
n = 5
n_text_data = ngram(text_data, n)
n_ex_text_data = ngram(ex_text_data, n)

dic_text_data = {k:v for k,v in enumerate(n_text_data)}
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
#Romaji conversion → Leave only vowels and give serial numbers to the keys to make a dictionary.
vowel_text_data = [conv.do(t) for t in n_text_data]
vowel_text_data = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_data]
vowel_ex_data = [conv.do(t) for t in n_ex_text_data]
vowel_ex_data = [re.sub(r"[^aeiou]+","",text) for text in vowel_ex_data]
dic_vo_t = {k:v for k,v in enumerate(vowel_text_data)}
dic_vo_ex = {k:v for k,v in enumerate(vowel_ex_data)}

Here, all the text is first converted to katakana. By reading kanji (reading depends on dictionary ability), the idea that the variation of division increases and the part called ʻexpansion` expands the rhyme and makes it easier to grasp. I think the best way to divide is N character division. For example, "many (ooi)" to "far (ooi)" are easy to come up with, but "technology (~~ eu ~~ ooi)" and "the view (ooi ~~ aa ~~)" should be hard to come up with. I think that part-speech binding and segmentation can cover parts that may be overlooked. After the division, the romaji conversion is performed and only the vowels are left, so that the serial numbers are given, so it becomes like a word ID, and the three types of dic elements can be accessed by that ID.

__ How to catch the rhyme __

#Slice the shorter word, and if it is included in the other, consider it as a "rhin" and add its length as a score
def create_weight(word_a, word_b):
    weight = 0
    if len(word_a) > len(word_b):
        for i in range(len(word_b)):
            for j in range(len(word_b) + 1):
                if word_b[i:j] in word_a and len(word_b[i:j]) > 1 and not word_a==word_b:
                    weight += len(word_b[i:j])
    else:
        for i in range(len(word_a)):
            for j in range(len(word_a) + 1):
                if word_a[i:j] in word_b and len(word_a[i:j]) > 1 and not word_a==word_b:
                    weight += len(word_a[i:j])
    return weight

#Pass each dic and the index is node,The value is edge,The weight is weight(node,node,weight)make.
def create_edge_list(dic_text_data, dic_vo_t, dic_vo_ex):
    edge_list = []
    for i in dic_text_data.keys():
        for j in dic_text_data.keys():
            text_weight = create_weight(dic_text_data[i],dic_text_data[j])
            vo_weight = create_weight(dic_vo_t[i],dic_vo_t[j])
            ex_weight = create_weight(dic_vo_ex[i],dic_vo_ex[j])
            #Is it possible to change the ratio?
            weight = text_weight*1.0 + vo_weight*1.0 + ex_weight*1.0
            #It may be okay to make a threshold here. Also i-It may be okay to add j as well.
            if weight != 0:
                edge_list.append((i,j,weight))
    return edge_list

edge_list = create_edge_list(dic_text_data, dic_vo_t, dic_vo_ex)

I think the best vowel match is the one I used first. For example, to retrieve "iu, ue, iue" from "aiueo", it is not enough to see the vowel match from the front or back. Initially, I was thinking of comparing the input data with an arbitrary word called target_word, but decided that it would be better to complete it within the input data. And since what we are doing here is looking at the relationship and similarity of each divided word, we will look at the graph. In addition, by using three types of dic for the weight, the matching of consonants and the like, the matching of vowels, and the matching of vowels when the rhyme is expanded are added together.

import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_weighted_edges_from(edge_list)
pos = nx.spring_layout(G)
nx.draw_networkx_edges(G, pos)
plt.figure(figsize=(20,20))
plt.show()

download.png

As you can see, I don't know what it is. Since it is divided into N characters, adjacent word = node is inevitably connected. What I'm thinking about now is like commenting out in a function named create_edge_list. If you try clustering with this for the time being

from networkx.algorithms.community import greedy_modularity_communities

com = list(greedy_modularity_communities(G, weight="weight"))
for i in range(len(com)):
    if len(com[i]) > 2:
        p = com[i]
        print("community:"+str(i))
        for k in p:
            print(dic_text_data[k])

There are some good points, but it is frustrating to see some that are slightly misaligned. With highly modular clustering, the connections within each community are strong (they can be rhymed), and the connections between the communities are weak (they cannot be rhymed with other communities), and one phrase is used for each community. I feel like I can step on it. For that reason, I cannot deny the feeling that the edges are too much.

__ Summary __

You don't need to see a match with text_data_dic. If you do, it is better to add when the vowels match and if the consonants match. It seems that clustering and N character division are incompatible. However, I felt that clustering might be the ultimate goal. As a future policy, we will reconsider the division method, search for an improvement method such as whether to connect if the distance between the nodes is not more than a certain distance, and we will be troubled by the division all the time, so we will compare it without dividing it at all. I will think about the method. In any case, I can't deny the feeling that I'm one step ahead and one step down ...

Recommended Posts

I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part8 (finished once)
I want to handle the rhyme part1
I want to handle the rhyme part3
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to handle the rhyme part7 (BOW)
I want to automate ssh using the expect command! part2
I want to pin Spyder to the taskbar
I want to output to the console coolly
I want to display the progress bar
I want to customize the appearance of zabbix
I want to use the activation function Mish
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to handle optimization with python and cplex
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
I tried to organize SVM.
I want to solve Sudoku (Sudoku)
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to be able to analyze data with Python (Part 3)
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to know the features of Python and pip
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to align the significant figures in the Numpy array
I want to know the legend of the IT technology world
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to manually assign the training parameters of the [Pytorch] model
I want to automatically find high-quality parts from the videos I shot
I want to know the weather with LINE bot feat.Heroku + Python
[Linux] I want to know the date when the user logged in
I want to understand systemd roughly
I want to read the html version of "OpenCV-Python Tutorials" OpenCV 3.1 version
[Updated as appropriate] I tried to organize the basic visualization methods
I want to output the beginning of the next month with Python
I want to run the Python GUI when starting Raspberry Pi
I want to find the shortest route to travel through all points
LINEbot development, I want to check the operation in the local environment
I want to create a system to prevent forgetting to tighten the key 1
I calculated the stochastic integral (I to integral)
I want to make the second line the column name in pandas