[PYTHON] I want to handle the rhyme part3

__ Content __

I want to make more use of MeCab in the previous flow. It was too subdivided in the word-separation. We will do things like segmentation using MeCab. I don't know that there are Cabocha and KNP (I'll say it again, ~~ yellow B-boy ~~ I want to use MeCab). I don't use morphological analysis, so I will try morphological analysis and attach the attached words (particles, auxiliary verbs) to the previous words.

__ Concatenation of attached words __

From the results, I didn't use MeCab. I decided that janome was more concise because I only needed the surface form and part of speech. ~~ However, I tried various things with MeCab, read the article to put in DataFrame, and it turned strangely in for minutes and it became a memory error, so I enjoyed it so much ~~ By concatenating the adjuncts with the code below, the average length of the divided words was 2.96 (2.16 in word-separation).

from janome.tokenizer import Tokenizer

with open("./gennama.txt","r") as f:
    data = f.read()
    
tokenizer = Tokenizer()
tokens = tokenizer.tokenize(data)

surface_list = []
part_of_speech_list = []
for token in tokens:
    surface_list.append(token.surface)
    part_of_speech_list.append(token.part_of_speech.split(",")[0])
    
text_data = []
for i in range(len(surface_list)):
    if part_of_speech_list[i] == "symbol":
        continue
    elif part_of_speech_list[i] == "Particle" or part_of_speech_list[i] == "Auxiliary verb":
        row = text_data.pop(-1) + surface_list[i]
    else:
        row = surface_list[i]
    text_data.append(row)

With this, as in the previous time, we ranked which vowels are arranged in the input data, and output the words that include them. The output result is more than that of the word-separation, but no practicality has been found. After all, N-gram is the most suitable at present, the word order remains, and it is divided where it should not be divided, so if you connect it, you can detect the rhyme. Notice that it is here. Hiragana for kanji before N-gram ... Originally, if you convert kanji to reading, the variation of N-gram is likely to increase. (~~ Kanji can be made with kakasi ~~) By the way, when I morphologically analyzed with MeCab earlier, there was an item of "reading". Let's use MeCab.

__ Convert to reading data __

import MeCab

with open("./gennama.txt","r") as f:
    data = f.read()

yomi_data = ""
mecab = MeCab.Tagger("-Ochasen")
lines = mecab.parse(data).split("\n")
for line in lines:
    if line.split("\t")[0] == "EOS":
        break
    else:
        yomi_data += line.split("\t")[1]        

"Shitamachi"-> "Shitamachi"-> "iaai", and 4 vowels can be represented by 4 letters. But what about the case of "moment"-> "shunkan"-> "ua"? Two vowels are to be represented by five letters. If you divide it into N characters according to how you read it, you can do things like "Tama / Chi" instead of "Shita / Machi". However, "Shunkan" becomes uselessly long. Then, do you use N-gram after making vowel data? The surface layer cannot be retrieved even if the index is assigned to the data. It cannot be good that the sequence of "aa" in "Shitamachi" is not detected. I can't think of an improvement plan right away, but I want to think about it here.

Plan from now on

For one, it may be good to add various ways of scoring. For example, even if you leave consonants, if they match, you can add them, and you can add up multiple scores. In that case, it seems possible to treat the "refrigerator" like (Reizoko, Reizoko). The other is to draw a graph of words divided into nodes using score on the edges. Actually, I tried it when the division method could only be space division, but it didn't come out as expected (I guess the theme comes to mediation centrality?). I will try to study networkx from scratch. (~~ I'm worried about when I'll post next. However, when I was studying search engines and saw the sum of PageRank and word distance scores, I thought it might be useful, so I studied. I will ~~)

Recommended Posts

I want to handle the rhyme part3
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to handle the rhyme part7 (BOW)
I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part8 (finished once)
I want to automate ssh using the expect command! part2
I want to display the progress bar
I want to customize the appearance of zabbix
I want to use the activation function Mish
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to scroll the Django shift table, but ...
I want to solve Sudoku (Sudoku)
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
I want to write in Python! (3) Utilize the mock
I tried to erase the negative part of Meros
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to be able to analyze data with Python (Part 3)
I want to use the latest gcc without sudo privileges! !!
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
I want to calculate the allowable downtime from the operating rate
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
I want to make the Dictionary type in the List unique
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to align the significant figures in the Numpy array
I want to know the legend of the IT technology world
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to understand systemd roughly
I calculated the stochastic integral (I to integral)
I want to scrape images to learn
I want to do ○○ with Pandas
I want to copy yolo annotations
I want to debug with Python
I tried to move the ball
I tried to estimate the interval.
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to automatically find high-quality parts from the videos I shot
I want to know the weather with LINE bot feat.Heroku + Python
[Linux] I want to know the date when the user logged in