[PYTHON] I made AI think about the lyrics of Kenshi Yonezu (pre-processing)

Introduction

Kenshi Yonezu sells every time he composes. The lyrics that are spun out seem to have the power to fascinate people. This time, I decided to let deep learning learn its charm.


This article is up to "Data preprocessing". Here are the general steps:

  1. Scraping and getting all the lyrics written by Mr. Yonezu
  2. Format the data to match the deep learning problem settings
  3. Divide training data and test data at 8: 2.

Generally, "pre-processing" refers to data processing for improving accuracy such as normalization, but this "pre-processing" means shaping so that it becomes an input or output of deep learning. ..


Model used

Framework: Pytorch Model: seq2seq with Attention


seq2seq and Attention background

It is one of the methods used for "machine translation". The following is an image of seq2seq. seq2seq.png

Quoted article: [Encoder-decoder model and Teacher Forcing, Scheduled Sampling, Professor Forcing](https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83] % AB% E3% 81% A8teacher-forcingscheduled-samplingprofessor-forcing /)


This makes it possible to generate sentences with Decoder based on the information encoded on the Encoder side, but there are actually some problems. That is, the Decoder input can only be represented by a fixed length vector. The output of the Encoder is a hidden layer $ h $, but this size is fixed. Therefore, a dataset with an input sequence length that is too long will not be able to properly compress the information into $ h $, and a dataset with an input sequence length that is too short will incorporate wasteful information into $ h $. I will. Therefore, you will want to use ** not only the state of the last hidden layer of the Encoder, but also the state of the hidden layer in the middle **.

This is the background behind Attention's inventor.


Attention mechanism

Attention is a method for paying attention to important points in the past (= Attention) when dealing with time series data. This time, the "next passage" is predicted for the "lyric passage" of a certain song, so in order to predict the next passage, what should we pay attention to in the previous passage **? Become. Below is an image of Attention.

attention.png

source: Effective Approaches to Attention-based Neural Machine Translation


According to the reference paper, it is more accurately called the Global Attention model. By collecting all the hidden states of the Encoder as a vector and taking the inner product of them and the output of the Decoder, ** "similarity between all the hidden states of the Encoder and the output of the Decoder" ** can be obtained. Measuring this similarity by inner product is the reason why it is called Attention that "focuses on important factors".


Implementation

After uploading the required self-made module to Google colab Copy and execute main.py described later.

** Required self-made module ** スクリーンショット 2020-05-09 18.28.25.png


Problem setting

As shown below, Kenshi Yonezu predicts the "next passage" from the "one passage" of the songs that have been released so far.

|Input text|Output text| |-------+-------| |I'm really happy to see you| _All of them are sad as a matter of course| |All of them are sad as a matter of course| _I have painfully happy memories now| |I have painfully happy memories now| _Raise and walk the farewell that will come someday| |Raise and walk the farewell that will come someday| _It ’s already enough to take someone's place|

This was created by scraping from Lyrics Net.


Data preparation

Scraping

Get the lyrics by scraping with the code below In addition, these are executed by Google Colab.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.select import Select
import requests
from bs4 import BeautifulSoup
import re
import time

 setting
# Chrome option to launch Selenium in any environment
options = Options()
options.add_argument('--disable-gpu');
options.add_argument('--disable-extensions');
options.add_argument('--proxy-server="direct://"');
options.add_argument('--proxy-bypass-list=*');
options.add_argument('--start-maximized');
options.add_argument('--headless');

class DriverConrol():
    def __init__(self, driver):
        self.driver = driver
        
    def get(self, url):
        self.driver.get(url)
        
    def get_text(self, selector):
        element = self.driver.find_element_by_css_selector(selector)
        return element.text
        
    def get_text_by_attribute(self, selector, attribute='value'):
        element = self.driver.find_element_by_css_selector(selector)
        return element.get_attribute(attribute)
    
    def input_text(self, selector, text):
        element = self.driver.find_element_by_css_selector(selector)
        element.clear()
        element.send_keys(text)
        
    def select_option(self, selector, text):
        element = driver.find_element_by_css_selector(selector)
        Select(element).select_by_visible_text(text)
        
    def click(self, selector):
        element = self.driver.find_element_by_css_selector(selector)
        element.click()
        
    def get_lyric(self, url):
        self.get(url)
        time.sleep(2)
        element = self.driver.find_element_by_css_selector('#kashi_area')
        lyric = element.text
        return lyric
    
    def get_url(self):
        return self.driver.current_url
        
    def quit(self):
        self.driver.quit()

BASE_URL = 'https://www.uta-net.com/'
 search_word ='Kenshi Yonezu'
 search_jenre ='Lyricist name'
driver = webdriver.Chrome(chrome_options=options)
dc = DriverConrol(driver)
 dc.get (BASE_URL) #access

# Search
dc.input_text('#search_form > div:nth-child(1) > input.search_input', search_word)
dc.select_option('#search_form > div:nth-child(2) > select', search_jenre)
dc.click('#search_form > div:nth-child(1) > input.search_submit')
time.sleep(2)

# Get url at once with requests
response = requests.get(dc.get_url())
 response.encoding = response.apparent_encoding # Anti-garbled characters
soup = BeautifulSoup(response.text, "html.parser")
 side_td1s = soup.find_all (class_="side td1") # Get all td elements with class side td1
 lyric_urls = [side_td1.find ('a', href = re.compile ('song')). get ('href') for side_td1 in side_td1s] # side_td1s contains, href contains''song a tag Get the href element of
 music_names = [side_td1.find ('a', href = re.compile ('song')). text for side_td1 in side_td1s] # Get all song titles

# Get the lyrics and add them to lyrics_lis
lyric_lis = list()
for lyric_url in lyric_urls:
    lyric_lis.append(dc.get_lyric(BASE_URL + lyric_url))
with open(search_word + '_lyrics.txt', 'wt') as f_lyric, open(search_word + '_musics.txt', 'wt') as f_music:
    for lyric, music in zip(lyric_lis, music_names):
        f_lyric.write(lyric + '\n\n')
        f_music.write(music + '\n')

** Excerpt from the acquired lyrics **

 I'm really happy to see you
 All of them are sad as a matter of course
 I have painfully happy memories now
 Raise and walk the farewell that will come someday

 It ’s enough to take someone's place and live
 I wish I could be a stone
 If so, there is no misunderstanding or confusion
 That way without even knowing you

...

Data shaping

At present, it is far from the data shown in [Problem setting], so "format data" is performed.

That is, it does this.

スクリーンショット 2020-05-09 19.01.02.png

Format the data with the following code The code is confusing, but this completes the pre-processing.

from datasets import LyricDataset
import torch
import torch.optim as optim
from modules import *
from device import device
from utils import *
from dataloaders import SeqDataLoader
import math
import os
from utils

 ==========================================
# Data preparation
 ==========================================
# Kenshi Yonezu_lyrics.txt path
 file_path = "lyric / Kenshi Yonezu_lyrics.txt"
 edited_file_path = "lyric / Kenshi Yonezu_lyrics_edit.txt"

yonedu_dataset = LyricDataset(file_path, edited_file_path)
yonedu_dataset.prepare()
 check
print(yonedu_dataset[0])

# Divide into train and test at 8: 2
train_rate = 0.8
data_num = len(yonedu_dataset)
train_set = yonedu_dataset[:math.floor(data_num * train_rate)]
test_set = yonedu_dataset[math.floor(data_num * train_rate):]

from sklearn.model_selection import train_test_split
from janome.tokenizer import Tokenizer
import torch
from utils import *

class LyricDataset(torch.utils.data.Dataset):
    def __init__(self, file_path, edited_file_path, transform=None):
        self.file_path = file_path
        self.edited_file_path = edited_file_path
        self.tokenizer = Tokenizer(wakati=True)

 self.input_lines = [] # NN input array (each element is text)
 self.output_lines = [] # Array that is the correct answer data for NN (each element is text)
        self.word2id = {}  # e.g.) {'word0': 0, 'word1': 1, ...}

 self.input_data = [] # A passage of lyrics in which each word is ID
 self.output_data = [] # The next passage where each word is ID

        self.word_num_max = None
        self.transform = transform

        self._no_brank()

    def prepare(self):
 # Returns an array (text) that is the input of NN and an array that is the correct data (text) of NN.
        self.get_text_lines()

 Assign an ID to all characters that appear in # date.txt
 for line in self.input_lines + self.output_lines: # First passage and subsequent passages
            self.get_word2id(line)

 # Find the maximum number of words in a passage
        self.get_word_num_max()
 # Returns an array (ID) that is the input of NN and an array that is the correct answer data (ID) of NN.
        for input_line, output_line in zip(self.input_lines, self.output_lines):
            self.input_data.append([self.word2id[word] for word in self.line2words(input_line)] \
            + [self.word2id[" "] for _ in range(self.word_num_max - len(self.line2words(input_line)))])
            self.output_data.append([self.word2id[word] for word in self.line2words(output_line)] \
            + [self.word2id[" "] for _ in range(self.word_num_max - len(self.line2words(output_line)))])

    def _no_brank(self):
 # Take whitespace between lines
        with open(self.file_path, "r") as fr, open(self.edited_file_path, "w") as fw:
            for line in fr.readlines():
                if isAlpha(line) or line == "\n":
 continue # Skip letters and spaces
                fw.write(line)

    def get_text_lines(self, to_file=True):
        """
 Takes the path file_path of the lyrics file with no blank lines and returns an array like this
        """
 #Read Kenshi Yonezu_lyrics.txt line by line, divide it into "lyric passage" and "next passage", and divide by input and output
        with open(self.edited_file_path, "r") as f:
 line_list = f.readlines () #Lyrics passage ... line
            line_num = len(line_list)
            for i, line in enumerate(line_list):
                if i == line_num - 1:
 continue # There is no "next passage" at the end
                self.input_lines.append(line.replace("\n", ""))
                self.output_lines.append("_" + line_list[i+1].replace("\n", ""))

        if to_file:
            with open(self.edited_file_path, "w") as f:
                for input_line, output_line in zip(self.input_lines, self.output_lines):
                    f.write(input_line + " " + output_line + "\n")


    def line2words(self, line: str) -> list:
        word_list = [token for token in self.tokenizer.tokenize(line)]
        return word_list

    def get_word2id(self, line: str) -> dict:
        word_list = self.line2words(line)
        for word in word_list:
            if not word in self.word2id.keys():
                 self.word2id[word] = len(self.word2id)

    def get_word_num_max(self):
 # Find the one with the longest length
        word_num_list = []
        for line in self.input_lines + self.output_lines:
            word_num_list.append(len([self.word2id[word] for word in self.line2words(line)]))
        self.word_num_max = max(word_num_list)

    def __len__(self):
        return len(self.input_data)

    def __getitem__(self, idx):
        out_data = self.input_data[idx]
        out_label = self.output_data[idx]

        if self.transform:
            out_data = self.transform(out_data)

        return out_data, out_label

This time until preprocessing

The code seems to be longer than I expected, so this time I will limit it to "data preprocessing".

Recommended Posts

I made AI think about the lyrics of Kenshi Yonezu (pre-processing)
I made AI think about the lyrics of Kenshi Yonezu (implementation)
The Python project template I think of.
I tried to vectorize the lyrics of Hinatazaka46!
Think about the next generation of Rack and WSGI
Think about the analysis environment (Part 1: Overview) * As of January 2017
Tank game made with python About the behavior of tanks
I made a function to check the model of DCGAN
I made a dot picture of the image of Irasutoya. (part1)
I made a dot picture of the image of Irasutoya. (part2)
About the ease of Python
About the components of Luigi
About the features of Python
I made a slack bot that notifies me of the temperature
[Kaggle] I made a collection of questions using the Titanic tutorial
Think about the minimum change problem
I investigated the mechanism of flask-login!
About the return value of the histogram.
About the basic type of Go
About the upper limit of threads-max
About the behavior of yield_per of SqlAlchemy
About the size of matplotlib points
About the basics list of Python basics
Roughly think about the loss function
I made a calendar that automatically updates the distribution schedule of Vtuber
I wanted to be careful about the behavior of Python's default arguments
I want to express my feelings with the lyrics of Mr. Children
I tried to summarize the logical way of thinking about object orientation.
I made a GAN with Keras, so I made a video of the learning process.
I made a program to check the size of a file in Python
I made a mistake in fetching the hierarchy with MultiIndex of pandas
I think the limit of knapsack is not the weight but the volume w_11/22update
I made a function to see the movement of a two-dimensional array (Python)