[PYTHON] I want to handle the rhyme part4

__ Content __

In the flow so far, let's perform network analysis by using the divided input text data as a node and the matching of the vowels between the divided ones as the edge weight. The goal is to draw the graph and see the centrality.

__ Divide the data and make a dictionary __

from pykakasi import kakasi
import re
from collections import defaultdict
from janome.tokenizer import Tokenizer

with open("./gennama.txt","r") as f:
    data = f.read()
    
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(data)

surface_list = []
part_of_speech_list = []
for token in tokens:
    surface_list.append(token.surface)
    part_of_speech_list.append(token.part_of_speech.split(",")[0])
    
segment_text = []
for i in range(len(surface_list)):
    if part_of_speech_list[i] == "symbol":
        continue
    elif part_of_speech_list[i] == "Particle" or part_of_speech_list[i] == "Auxiliary verb":
        row = segment_text.pop(-1) + surface_list[i]
    else:
        row = surface_list[i]
    segment_text.append(row)

kakasi = kakasi()

kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')

conv = kakasi.getConverter()
text_data = [conv.do(text) for text in segment_text]
vowel_data = [re.sub(r"[^aeiou]+","",text) for text in text_data]
#{0:"oea"}
dic_vo = {k:v for k,v in enumerate(vowel_data)}
#voel_Create a dictionary so that you can see the data before vowel conversion by the data index.{0:"I am"}
dic = {k:v for k,v in enumerate(segment_text)}

Utilize what you did in part3. Judging that N-gram is not suitable this time. There are as many nodes as there are keys of dic_vo, and the connection is checked by whether there is a vowel match between the nodes. The longer the matching length of the vowels, the more weight is given. Use the method that was made in part 1, but make it possible to make an edge from the connection to itself and the match of two or more characters.

__ Make a graph __

#dic_Pass vo and index is node,The value is edge,The weight is score(node,node,score)make.
def create_edge(dic_vo):
    node_len = len(dic_vo)
    
    edge_list = []
    for i in range(node_len):
        for j in range(node_len):
            score = create_weight(dic_vo[i],dic_vo[j])
            if score != 0:
                edge_list.append((i,j,score))
    return edge_list
            
def create_weight(word_a, word_b):
    weight = 0
    if len(word_a) > len(word_b):
        max_len = len(word_b)
        for i in range(max_len):
            for j in range(max_len + 1):
                if word_b[i:j] in word_a:
                    if word_b == word_a:
                        continue
                    elif len(word_b[i:j]) < 2:
                        continue
                    else:
                        weight += len(word_b[i:j])
    else:
        max_len = len(word_a)
        for i in range(max_len):
            for j in range(max_len + 1):
                if word_a[i:j] in word_b:
                    if word_a == word_b:
                        continue
                    elif len(word_b[i:j]) < 2:
                        continue
                    else:
                        weight += len(word_a[i:j])
    return weight

edge_list = create_edge(dic_vo)            

After that, draw a graph based on this edge_list. Next, get a node with high eigenvector centrality and mediation centrality, and display the original data.

import networkx as nx
import matplotlib.pyplot as plt

G = nx.Graph()
G.add_weighted_edges_from(edge_list)
pos = nx.spring_layout(G)
nx.draw_networkx_edges(G, pos)

plt.show()

#Eigenvector centrality
cent = nx.eigenvector_centrality_numpy(G)
max_cent_node = max(list(cent.keys()), key=lambda val: cent[val])
#Mediation centrality
between_cent = nx.communicability_betweenness_centrality(G, weight="weight)
max_betw_node = max(list(between_cent.keys()), key=lambda val: between_cent[val])

print("High eigenvector centrality:" + dic[max_cent_node])
print("High mediation centrality:" + dic[max_betw_node])

As expected, the result is the same as "I can narrow down the target_word" done in part2. Well, it's natural because it's doing the same thing, but with networkx, there seems to be something that can still be done based on this graph, so I will pursue it.

Plan from now on

When scoring, pay attention to "i" and "u", and if the previous sound is "e" and "o", that is, if "e i" and "o u", convert it to "ee" and "oo". I'm thinking of seeing the matching of vowels. It is difficult to distinguish even in Japanese (referring to the pronunciation of foreign words), and it can be said that the sound is the same. It seems to be NG in the treatment of rhymes in the rap world, but I will try it. ~~ By the way, have you ever heard the real "ABC song"? That's what makes "LMNOP" "elenenopy" all at once. "Rhythm" seems to be familiar to me since I was a child. I will not forget the respect for Japanese rap, but I will try to expand the rhyme a little ~~

Recommended Posts

I want to handle the rhyme part1
I want to handle the rhyme part3
I want to handle the rhyme part2
I want to handle the rhyme part4
I want to handle the rhyme part7 (BOW)
I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part8 (finished once)
I want to automate ssh using the expect command! part2
I want to pin Spyder to the taskbar
I want to output to the console coolly
I want to customize the appearance of zabbix
I want to use the activation function Mish
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to scroll the Django shift table, but ...
I want to handle optimization with python and cplex
I want to solve Sudoku (Sudoku)
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
I want to write in Python! (3) Utilize the mock
I tried to erase the negative part of Meros
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to use the latest gcc without sudo privileges! !!
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
I want to calculate the allowable downtime from the operating rate
I want to be able to analyze data with Python (Part 2)
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
I want to make the Dictionary type in the List unique
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to align the significant figures in the Numpy array
I want to know the legend of the IT technology world
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to understand systemd roughly
I want to scrape images to learn
I want to do ○○ with Pandas
I want to copy yolo annotations
I want to debug with Python
I tried to move the ball
I tried to estimate the interval.
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to manually assign the training parameters of the [Pytorch] model
I want to automatically find high-quality parts from the videos I shot
I want to know the weather with LINE bot feat.Heroku + Python
[Linux] I want to know the date when the user logged in