[PYTHON] Wrap analysis part1 (data preparation)

Purpose

By analyzing the lyrics of the actual rapper, there is no particular goal in trying to find something. We will analyze what kind of vowel sequence the text has, which was done in "I want to handle the rhyme". The prepared data is for 8 songs (about 90 songs) of a group with two contrasting rappers. The two have the difference between a freaky flow (good at putting on the sound ?!) and a hard vowel (difficult to explain), but I hope to find out how to step on the rhythm, the number of appearances, and the vowels you like. .. ~~ "123" is read as "Hi-Fumi" or "Andu Troyes", so the data was prepared by hand with that in mind ~~

Text data to DataFrame

from pykakasi import kakasi
import re
import pandas as pd
import itertools

with open("./data/yoshi.txt","r", encoding="utf-8") as f:
    data = f.read()

#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list2 = [i[0]+i[1] for i in itertools.product("aiueo", repeat=2)]
word_list3 = [i[0]+i[1]+i[2] for i in itertools.product("aiueo", repeat=3)]
word_list4 = [i[0]+i[1]+i[2]+i[3] for i in itertools.product("aiueo", repeat=4)]
word_list = word_list2 + word_list3 + word_list4
#Divided into songs one by one.{number:List of lyrics separated by double-byte spaces and line breaks}
text_data = data.split("！")
text_data_dic = {k:re.split("\u3000|\n", v) for k,v in enumerate(text_data)}

kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()

#{number:List converted to vowels}
vowel_text_dic = {}
for k,v in text_data_dic.items(): 
    vowel_text_dic[k] = [conv.do(d) for d in v]
for k,v in vowel_text_dic.items():
    vowel_text_dic[k] = [re.sub(r"[^aeiou]+","",d) for d in v]

#Column name"aa"Etc., the value is the number of appearances. Count expression for one song
count_dic = {}
temp1 = []
temp2 = 0
for word in word_list:
    for k,v in vowel_text_dic.items():
        for vowel in v:
            temp2 += vowel.count(word)
        temp1.append(temp2)
        temp2 = 0
        vowel_text_len = 0
    count_dic[word] = temp1
    temp1 = []
    
df = pd.DataFrame(count_dic)
df["label"] = 0
df.to_csv("./data/yoshi.csv", index=False)

	aa	ai	au	…	ooou	oooe	oooo	label
0	4	9	7	…	1	1	0	0
1	21	18	7	…	1	1	2	0
2	8	18	18	…	1	0	0	0
3	19	26	23	…	0	0	0	0
…	…	…	…	…	…	…	…	…
88	12	14	2	…	0	0	0	0
89	17	17	10	…	1	0	1	0

There are two prepared text files for each wrapper. In the text, "!" Indicates the part where the song changes, and the lyrics include "double-byte space, line break" as per the lyrics card. I try not to make vowels continuous across it. The other text file was saved as df ["label"] = 1 for easy identification.

Look at the contents of the data

import pandas as pd

df1 = pd.read_csv("./data/pochomkin.csv")
df2 = pd.read_csv("./data/yoshi.csv")

#Focus on the average value of the two-character part
df1_2vowel = df1.describe().iloc[:, :25]
df1_2vowel = df1_2vowel.loc["mean", :]
print(df1_2vowel.sort_values(ascending=False))
df2_2vowel = df2.describe().iloc[:, :25]
df2_2vowel = df2_2vowel.loc["mean", :]
print(df2_2vowel.sort_values(ascending=False))

There are 776 columns, so let's look at them separately. I checked what happened to the one with the highest average value for each number of characters in the column (iloc [:, 25: 150] for 3 characters, 125 types of 5 × 5 × 5, and iloc [:, 150: 775] for 4 characters. ]). In the case of 2 characters, the top 4 of both df1 and df2 matched with "ai, ia, ou, aa", and in the case of 3 characters, the top 2 "aia, aai" matched. In all cases, the average value of df2 was higher.

#Sum up each in the column direction and get the value.(Count number in about 90 songs)
value_count_1 = df1.sum(axis=0).values
value_count_2 = df2.sum(axis=0).values
#A bool value that counts less than 10 times. Less than 10 is True
bool_1 = value_count_1 < 10
bool_2 = value_count_2 < 10
#Print a sequence of vowels that count less than 10 times on both
print(df1.columns[bool_1 * bool_2])

Examining the sequence of infrequent vowels in both of the two data, 37 were applicable, most of which were 4-letter vowels containing "ee".

Summary and future policy

The result of the high-ranking match may be that there are many vowels in Japanese, or that the rapper likes it. Also, because of the group name "Ai Ranger", it may appear frequently in the prepared data. I expected that "aa" would appear most frequently because there are many words of "aaaa" such as "sloppy, different bodies" in the sequence of "aa", but different results are obtained. It's interesting. However, no matter how much the lyrics of the rapper are used as data, not all vowels are related to the rhyme, so I can't say anything. Another thing I expected was that "each rapper has a favorite sequence of vowels", but the difference in characteristics is not as remarkable as I expected, so why not focus on that? It's like that. What I can say this time is that the rapper on the df2 side (hard rhyme) seems to have the same vowel sequence more frequently than the df1 rapper. This is as expected. It is also a new discovery that the frequency of "ee" is low. There may be a reason for the rapper to avoid it, such as the difficulty of picking up the sound. If you divide the data into two, you may be able to classify them. It was a feeling, but it seems that it is not so easy. I'll look at the data a little more in the future to see if there is a difference between the two.

	aa	ai	au	…	ooou	oooe	oooo	label
0	4	9	7	…	1	1	0	0
1	21	18	7	…	1	1	2	0
2	8	18	18	…	1	0	0	0
3	19	26	23	…	0	0	0	0
…	…	…	…	…	…	…	…	…
88	12	14	2	…	0	0	0	0
89	17	17	10	…	1	0	1	0

	aa	ai	au	…	ooou	oooe	oooo	label
0	4	9	7	…	1	1	0	0
1	21	18	7	…	1	1	2	0
2	8	18	18	…	1	0	0	0
3	19	26	23	…	0	0	0	0
…	…	…	…	…	…	…	…	…
88	12	14	2	…	0	0	0	0
89	17	17	10	…	1	0	1	0

[PYTHON] Wrap analysis part1 (data preparation)

Purpose

__ Text data to DataFrame __

__ Look at the contents of the data __

__ Summary and future policy __

Text data to DataFrame

Look at the contents of the data

Summary and future policy

	aa	ai	au	…	ooou	oooe	oooo	label
0	4	9	7	…	1	1	0	0
1	21	18	7	…	1	1	2	0
2	8	18	18	…	1	0	0	0
3	19	26	23	…	0	0	0	0
…	…	…	…	…	…	…	…	…
88	12	14	2	…	0	0	0	0
89	17	17	10	…	1	0	1	0