[PYTHON] Wrap analysis part1 (data preparation)

Purpose

By analyzing the lyrics of the actual rapper, there is no particular goal in trying to find something. We will analyze what kind of vowel sequence the text has, which was done in "I want to handle the rhyme". The prepared data is for 8 songs (about 90 songs) of a group with two contrasting rappers. The two have the difference between a freaky flow (good at putting on the sound ?!) and a hard vowel (difficult to explain), but I hope to find out how to step on the rhythm, the number of appearances, and the vowels you like. .. ~~ "123" is read as "Hi-Fumi" or "Andu Troyes", so the data was prepared by hand with that in mind ~~

__ Text data to DataFrame __

from pykakasi import kakasi
import re
import pandas as pd
import itertools

with open("./data/yoshi.txt","r", encoding="utf-8") as f:
    data = f.read()

#Word list. A 2- to 4-letter word that can be made using only vowels. 775 types
word_list2 = [i[0]+i[1] for i in itertools.product("aiueo", repeat=2)]
word_list3 = [i[0]+i[1]+i[2] for i in itertools.product("aiueo", repeat=3)]
word_list4 = [i[0]+i[1]+i[2]+i[3] for i in itertools.product("aiueo", repeat=4)]
word_list = word_list2 + word_list3 + word_list4
#Divided into songs one by one.{number:List of lyrics separated by double-byte spaces and line breaks}
text_data = data.split("!")
text_data_dic = {k:re.split("\u3000|\n", v) for k,v in enumerate(text_data)}

kakasi = kakasi()
kakasi.setMode('J', 'a')
kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
conv = kakasi.getConverter()

#{number:List converted to vowels}
vowel_text_dic = {}
for k,v in text_data_dic.items(): 
    vowel_text_dic[k] = [conv.do(d) for d in v]
for k,v in vowel_text_dic.items():
    vowel_text_dic[k] = [re.sub(r"[^aeiou]+","",d) for d in v]

#Column name"aa"Etc., the value is the number of appearances. Count expression for one song
count_dic = {}
temp1 = []
temp2 = 0
for word in word_list:
    for k,v in vowel_text_dic.items():
        for vowel in v:
            temp2 += vowel.count(word)
        temp1.append(temp2)
        temp2 = 0
        vowel_text_len = 0
    count_dic[word] = temp1
    temp1 = []
    
df = pd.DataFrame(count_dic)
df["label"] = 0
df.to_csv("./data/yoshi.csv", index=False)
aa ai au ooou oooe oooo label
0 4 9 7 1 1 0 0
1 21 18 7 1 1 2 0
2 8 18 18 1 0 0 0
3 19 26 23 0 0 0 0
88 12 14 2 0 0 0 0
89 17 17 10 1 0 1 0

There are two prepared text files for each wrapper. In the text, "!" Indicates the part where the song changes, and the lyrics include "double-byte space, line break" as per the lyrics card. I try not to make vowels continuous across it. The other text file was saved as df ["label"] = 1 for easy identification.

__ Look at the contents of the data __

import pandas as pd

df1 = pd.read_csv("./data/pochomkin.csv")
df2 = pd.read_csv("./data/yoshi.csv")

#Focus on the average value of the two-character part
df1_2vowel = df1.describe().iloc[:, :25]
df1_2vowel = df1_2vowel.loc["mean", :]
print(df1_2vowel.sort_values(ascending=False))
df2_2vowel = df2.describe().iloc[:, :25]
df2_2vowel = df2_2vowel.loc["mean", :]
print(df2_2vowel.sort_values(ascending=False))

There are 776 columns, so let's look at them separately. I checked what happened to the one with the highest average value for each number of characters in the column (iloc [:, 25: 150] for 3 characters, 125 types of 5 × 5 × 5, and iloc [:, 150: 775] for 4 characters. ]). In the case of 2 characters, the top 4 of both df1 and df2 matched with "ai, ia, ou, aa", and in the case of 3 characters, the top 2 "aia, aai" matched. In all cases, the average value of df2 was higher.

#Sum up each in the column direction and get the value.(Count number in about 90 songs)
value_count_1 = df1.sum(axis=0).values
value_count_2 = df2.sum(axis=0).values
#A bool value that counts less than 10 times. Less than 10 is True
bool_1 = value_count_1 < 10
bool_2 = value_count_2 < 10
#Print a sequence of vowels that count less than 10 times on both
print(df1.columns[bool_1 * bool_2])

Examining the sequence of infrequent vowels in both of the two data, 37 were applicable, most of which were 4-letter vowels containing "ee".

__ Summary and future policy __

The result of the high-ranking match may be that there are many vowels in Japanese, or that the rapper likes it. Also, because of the group name "Ai Ranger", it may appear frequently in the prepared data. I expected that "aa" would appear most frequently because there are many words of "aaaa" such as "sloppy, different bodies" in the sequence of "aa", but different results are obtained. It's interesting. However, no matter how much the lyrics of the rapper are used as data, not all vowels are related to the rhyme, so I can't say anything. Another thing I expected was that "each rapper has a favorite sequence of vowels", but the difference in characteristics is not as remarkable as I expected, so why not focus on that? It's like that. What I can say this time is that the rapper on the df2 side (hard rhyme) seems to have the same vowel sequence more frequently than the df1 rapper. This is as expected. It is also a new discovery that the frequency of "ee" is low. There may be a reason for the rapper to avoid it, such as the difficulty of picking up the sound. If you divide the data into two, you may be able to classify them. It was a feeling, but it seems that it is not so easy. I'll look at the data a little more in the future to see if there is a difference between the two.

Recommended Posts

Wrap analysis part1 (data preparation)
Multidimensional data analysis library xarray Part 2
Data analysis Titanic 2
Data analysis Titanic 1
Data analysis Titanic 3
Python 3 Engineer Certified Data Analysis Exam Preparation
Data analysis planning collection processing and judgment (Part 1)
Data analysis planning collection processing and judgment (Part 2)
Data analysis with python 2
Data analysis parts collection
Kaggle ~ Housing Analysis ③ ~ Part1
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis with Python
My python data analysis container
Multidimensional data analysis library xarray
Python for Data Analysis Chapter 4
[Python] Notes on data analysis
Time series analysis part 4 VAR
Time series analysis Part 3 Forecast
Python for Data Analysis Chapter 2
Time series analysis Part 1 Autocorrelation
Data analysis using python pandas
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Analyzing Twitter Data | Trend Analysis
First satellite data analysis by Tellus
Python Application: Data Cleansing Part 1: Python Notation
Python: Time Series Analysis: Preprocessing Time Series Data
Python Application: Data Handling Part 3: Data Format
Preprocessing template for data analysis (Python)
November 2020 data analysis test passing experience
Data analysis for improving POG 3-Regression analysis-
Japanese analysis processing using Janome part1
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Python application: data visualization part 1: basic
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data handling 2 Analysis of various data formats