While organizing what you have done so far, organize your thoughts and clarify your future policy. Introducing the method judged to be the most suitable at present. First of all, as a theme, I want you to feel "rhythm" closer to you. Therefore, I thought it would be fun if I could linger with the words I usually use. I thought about something like a rhyme search function, but the idea was that everyone would follow the same rhyme and lose their individuality. This is an attempt to look at the data from the perspective of "lyrics" when something like what you usually think, what you are talking about, or something like a bulleted list of such things is input.
from pykakasi import kakasi
import re
import numpy as np
with open("./gennama.txt","r", encoding="utf-8") as f:
data = f.read()
kakasi = kakasi()
kakasi.setMode('J', 'K')
kakasi.setMode('H', 'K')
conv = kakasi.getConverter()
#Katakana conversion. (The main purpose is to read kanji. It also makes it easier to convert e-i.)
text_data = conv.do(data)
#e i → ee,Get the converted text like o u → oo
def expansion(text_data):
text_data_len = len(text_data)
text_data = text_data.replace("good", "I i").replace("U","U u")
text_data = text_data.split("I")
new_text_data = []
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
for i in range(len(text_data)):
if len(text_data[i]) > 0:
if ("e" in conv.do(text_data[i][-1])):
new_text_data.append(text_data[i] + "e")
else:
new_text_data.append(text_data[i] + "i")
text_data = "".join(new_text_data).split("C")
new_text_data = []
for i in range(len(text_data)):
if len(text_data[i]) > 0:
if ("o" in conv.do(text_data[i][-1])):
new_text_data.append(text_data[i] + "o")
else:
new_text_data.append(text_data[i] + "u")
return "".join(new_text_data)[:text_data_len]
#Separation such as spaces is considered meaningful and divided.
def ngram(text_data, N):
#Double-byte spaces and line breaks are split
text_data = re.split("\u3000|\n", text_data)
ngram = []
for text in text_data:
if len(text) < N:
ngram.append(text)
else:
for i in range(len(text)-N+1):
row = "".join(text[i:i+N])
ngram.append(row)
return ngram
#Splitting and transforming text
ex_text_data = expansion(text_data)
n = 5
n_text_data = ngram(text_data, n)
n_ex_text_data = ngram(ex_text_data, n)
dic_text_data = {k:v for k,v in enumerate(n_text_data)}
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
#Romaji conversion → Leave only vowels and give serial numbers to the keys to make a dictionary.
vowel_text_data = [conv.do(t) for t in n_text_data]
vowel_text_data = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_data]
vowel_ex_data = [conv.do(t) for t in n_ex_text_data]
vowel_ex_data = [re.sub(r"[^aeiou]+","",text) for text in vowel_ex_data]
dic_vo_t = {k:v for k,v in enumerate(vowel_text_data)}
dic_vo_ex = {k:v for k,v in enumerate(vowel_ex_data)}
Here, all the text is first converted to katakana. By reading kanji (reading depends on dictionary ability), the idea that the variation of division increases and the part called ʻexpansion` expands the rhyme and makes it easier to grasp. I think the best way to divide is N character division. For example, "many (ooi)" to "far (ooi)" are easy to come up with, but "technology (~~ eu ~~ ooi)" and "the view (ooi ~~ aa ~~)" should be hard to come up with. I think that part-speech binding and segmentation can cover parts that may be overlooked. After the division, the romaji conversion is performed and only the vowels are left, so that the serial numbers are given, so it becomes like a word ID, and the three types of dic elements can be accessed by that ID.
#Slice the shorter word, and if it is included in the other, consider it as a "rhin" and add its length as a score
def create_weight(word_a, word_b):
weight = 0
if len(word_a) > len(word_b):
for i in range(len(word_b)):
for j in range(len(word_b) + 1):
if word_b[i:j] in word_a and len(word_b[i:j]) > 1 and not word_a==word_b:
weight += len(word_b[i:j])
else:
for i in range(len(word_a)):
for j in range(len(word_a) + 1):
if word_a[i:j] in word_b and len(word_a[i:j]) > 1 and not word_a==word_b:
weight += len(word_a[i:j])
return weight
#Pass each dic and the index is node,The value is edge,The weight is weight(node,node,weight)make.
def create_edge_list(dic_text_data, dic_vo_t, dic_vo_ex):
edge_list = []
for i in dic_text_data.keys():
for j in dic_text_data.keys():
text_weight = create_weight(dic_text_data[i],dic_text_data[j])
vo_weight = create_weight(dic_vo_t[i],dic_vo_t[j])
ex_weight = create_weight(dic_vo_ex[i],dic_vo_ex[j])
#Is it possible to change the ratio?
weight = text_weight*1.0 + vo_weight*1.0 + ex_weight*1.0
#It may be okay to make a threshold here. Also i-It may be okay to add j as well.
if weight != 0:
edge_list.append((i,j,weight))
return edge_list
edge_list = create_edge_list(dic_text_data, dic_vo_t, dic_vo_ex)
I think the best vowel match is the one I used first. For example, to retrieve "iu, ue, iue" from "aiueo", it is not enough to see the vowel match from the front or back. Initially, I was thinking of comparing the input data with an arbitrary word called target_word, but decided that it would be better to complete it within the input data. And since what we are doing here is looking at the relationship and similarity of each divided word, we will look at the graph. In addition, by using three types of dic for the weight, the matching of consonants and the like, the matching of vowels, and the matching of vowels when the rhyme is expanded are added together.
import networkx as nx
import matplotlib.pyplot as plt
G = nx.Graph()
G.add_weighted_edges_from(edge_list)
pos = nx.spring_layout(G)
nx.draw_networkx_edges(G, pos)
plt.figure(figsize=(20,20))
plt.show()
As you can see, I don't know what it is. Since it is divided into N characters, adjacent word = node is inevitably connected. What I'm thinking about now is like commenting out in a function named create_edge_list
. If you try clustering with this for the time being
from networkx.algorithms.community import greedy_modularity_communities
com = list(greedy_modularity_communities(G, weight="weight"))
for i in range(len(com)):
if len(com[i]) > 2:
p = com[i]
print("community:"+str(i))
for k in p:
print(dic_text_data[k])
There are some good points, but it is frustrating to see some that are slightly misaligned. With highly modular clustering, the connections within each community are strong (they can be rhymed), and the connections between the communities are weak (they cannot be rhymed with other communities), and one phrase is used for each community. I feel like I can step on it. For that reason, I cannot deny the feeling that the edges are too much.
You don't need to see a match with text_data_dic. If you do, it is better to add when the vowels match and if the consonants match. It seems that clustering and N character division are incompatible. However, I felt that clustering might be the ultimate goal. As a future policy, we will reconsider the division method, search for an improvement method such as whether to connect if the distance between the nodes is not more than a certain distance, and we will be troubled by the division all the time, so we will compare it without dividing it at all. I will think about the method. In any case, I can't deny the feeling that I'm one step ahead and one step down ...
Recommended Posts