[PYTHON] I want to handle the rhyme part7 (BOW)

__ Content __

Try something a little different than before. I was worried about how to divide the text, but if you see the match of the vowel "aiueo", you can compare the sentences by arranging various vowels and showing whether the vowel appears in the sentence. Isn't it? I will try it based on the idea. In other words, each word in the binary expression method, which "does not care about the frequency of appearance and focuses only on whether or not each word appears", is arranged in various vowels.

__ Creating a word_list that resembles a sequence of various vowels as a word __

from pykakasi import kakasi
import re
import numpy as np
import pandas as pd

with open("./gennama.txt","r", encoding="utf-8") as f:
    data = f.read()

vowel_list = ["a","i","u","e","o"]
#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list = []
for i in vowel_list:
    for j in vowel_list:
        for k in vowel_list:
            for l in vowel_list:
                    word_list.append(i+j)
                    word_list.append(i+j+k)
                    word_list.append(i+j+k+l)                    

text_data = re.split("\u3000|\n", data)
kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()
vowel_text_list = [conv.do(d) for d in text_data]
vowel_text_list = [re.sub(r"[^aeiou]+","",text) for text in vowel_text_list]

Other than that, it is simple. There are not so many types of words, 755. As a rule of thumb so far, matching of vowels with 5 or more characters is extremely rare, so we limited it to 4 characters. Until now, I have created various dictionaries, but I will summarize them in DataFrame.

__DataFrame creation __

df = pd.DataFrame({"Sentence": text_data, "vowel": vowel_text_list})
#Column name"aa"If it appears in the text, it will be 1, otherwise it will be 0.
binaly_dic = {}
temp = []
for word in word_list:
    for vowel in vowel_text_list:
        if word in vowel:
            temp.append(1)
        else:
            temp.append(0)
        binaly_dic[word] = temp
    temp = []

for k, v in binaly_dic.items():
    df[k] = v
df.to_csv("df_test.csv")

The columns are "sentences, vowels, words ...", "sentences" are sentences in which the original text data is divided, "vowels" are those converted into vowels only, and "words ..." are sentences. 1 was given if it was inside, and 0 was given if it was not inside.

__ Usage example __

#Cosine similarity
def cosine_similarity(v1, v2):
    cos_sim = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    return cos_sim

print(cosine_similarity(df.iloc[0, 2:].values, df.iloc[3, 2:].values))

For example, in this way, the similarity between sentence 0 and sentence 3 is displayed. You can also use sum to quickly find out which words are most commonly used.

__ Summary __

Since I focused only on vowels, the number of words was limited to 755 even when considering all combinations with 2 to 4 letters. Until now, I had tried to divide the text and handle it, but there were some things I could do as it was. It's a big event for me, so I wrote an article though the content is thin. In the future, I will do something based on the created DataFrame, such as whether I can do something more based on the similarity of the sentences.

Recommended Posts

I want to handle the rhyme part7 (BOW)
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to handle the rhyme part6 (organize once)
I want to handle the rhyme part8 (finished once)
I want to automate ssh using the expect command! part2
I want to display the progress bar
I want to customize the appearance of zabbix
I want to use the activation function Mish
I want to display the progress in Python!
I want to see the file name from DataLoader
I want to grep the execution result of strace
I want to scroll the Django shift table, but ...
I want to inherit to the back with python dataclass
I want to fully understand the basics of Bokeh
I want to write in Python! (3) Utilize the mock
I tried to erase the negative part of Meros
I want to automate ssh using the expect command!
I want to publish the product at the lowest cost
I want to use the R dataset in python
I want to increase the security of ssh connections
I want to solve Sudoku (Sudoku)
[TensorFlow] I want to master the indexing for Ragged Tensor
I want to be able to analyze data with Python (Part 3)
I want to use the latest gcc without sudo privileges! !!
I want to initialize if the value is empty (python)
I want to save the photos sent by LINE to S3
maya Python I want to fix the baked animation again.
I want to move selenium for the time being [for mac]
I want to be able to analyze data with Python (Part 1)
I want to use only the normalization process of SudachiPy
I want to get the operation information of yahoo route
I want to change the Japanese flag to the Palau flag with Numpy
I want to be able to analyze data with Python (Part 4)
I want to calculate the allowable downtime from the operating rate
I want to be able to analyze data with Python (Part 2)
[Python] I want to use the -h option with argparse
I want to judge the authenticity of the elements of numpy array
I want to know the features of Python and pip
I want to make the Dictionary type in the List unique
I want to map the EDINET code and securities number
Keras I want to get the output of any layer !!
I want to align the significant figures in the Numpy array
I want to know the legend of the IT technology world
I want to create a Dockerfile for the time being.
I didn't want to write the AWS key in the program
I want to understand systemd roughly
I calculated the stochastic integral (I to integral)
I want to scrape images to learn
I want to do ○○ with Pandas
I want to copy yolo annotations
I want to debug with Python
I tried to move the ball
I tried to estimate the interval.
I want to get the name of the function / method being executed
I want to record the execution time and keep a log.
I want to automatically find high-quality parts from the videos I shot
I want to know the weather with LINE bot feat.Heroku + Python
[Linux] I want to know the date when the user logged in
I want to read the html version of "OpenCV-Python Tutorials" OpenCV 3.1 version