[PYTHON] I tried web scraping to analyze the lyrics.


I wanted to analyze the lyrics, but I tried scraping for the first time because it was difficult to collect the lyrics. To be honest, I was a little worried because I had never written HTML properly, but I was able to do what I wanted to do, so I would like to summarize it. I would appreciate it if you could give me some advice and mistakes.

This is the article that I referred to this time.

[I tried to find out where I want to go by using word2vec and lyrics for "Kenshi Yonezu's theory that I can't go anywhere"](https://qiita.com/k_eita/items/456895942c3dda4dc059#%E6%AD % 8C% E8% A9% 9E% E3% 81% AE% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3 % E3% 82% B0)

The lyrics are available as text files in this article. This time, I rewrote it with reference to this code.

What we will get this time

  1. Song title
  1. Artist name
  2. Lyricist
  3. Composer
  4. Lyrics

These are the above five. The output format is csv.

The site scraped this time is "Uta-Net: Lyrics Search Service".


import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

The contents of the code

# Get the website and output it in text format
def load(url):
    res = requests.get(url)
 #HTTPError throws HTTPError if the HTTP request returns a failed status code
 #Get response body in text format
    return res.text

# Get html tag
def get_tag(html, find_tag):
    soup = BeautifulSoup(str(html), 'html.parser')
    tag = soup.find_all(find_tag)
    return tag

# Convert to a data structure that can be handled by the program
def parse(html):
    soup = BeautifulSoup(str(html), 'html.parser')
 Remove #html tag
    simple_row = soup.getText()
    simple_row = simple_row.replace('\n', '')
    simple_row = simple_row.replace(' ', '')
 #Delete alphanumeric characters (if needed)
    #simple_row = re.sub(r'[a-zA-Z0-9]', '', music_row)
 #Delete sign
 simple_row = re.sub (r'[<> ♪ `''" "・… _!?!-/:-@ [-` {-~]','', simple_row)
 #Delete notice
 simple_row = re.sub (r'Note:. +','', Simple_row)
    return simple_row

# Acquisition of song information for each
def get_info(url):
    base_url = 'https://www.uta-net.com/'
    html = load(url)
 #Store url for each song
    song_url = []
 #Store song
    song_info = []
    songs_info = []
 #Get song url
 Store url of #td
    for td in get_tag(html, 'td'):
 Get #a element
        for a in get_tag(td, 'a'):
 Whether the #href attribute contains song
            if 'song' in a.get ('href'):
 Add #url to array
                song_url.append(base_url + a.get('href'))
 #Get song information
    for i, page in enumerate(song_url):
 print ('{} song: {}'. format (i + 1, page))
        html = load(page)
        song_info = []
        for h2 in get_tag(html, 'h2'):
 Cast to str once to do #id search
            h2 = str(h2)
 #Whether or not it is a class element that stores lyrics
            if r'class="prev_pad"' in h2:
 #Remove unnecessary data
                simple_row = parse(h2)
                #print(simple_row, end = '\n')
                for h2 in get_tag(html, 'h2'):
                    h2 = str(h2)
                    simple_row = parse(h2)

        for h3 in get_tag(html, 'h3'):
            h3 = str(h3)
            if r'itemprop="byArtist"' in h3:
                simple_row = parse(h3)

        for h4 in get_tag(html, 'h4'):
            h4 = str(h4)
            if r'itemprop="lyricist"' in h4:
                music = parse(h4)

        for h4 in get_tag(html, 'h4'):
            h4 = str(h4)
            if r'itemprop="composer"' in h4:
                simple_row = parse(h4)

        for div in get_tag(html, 'div'):
            div = str(div)
            if r'itemprop="text"' in div:
                simple_row = parse(div)
 # 1 second wait (reduces server load)
    return songs_info

def create_df(file_name, url):
 #Create a data frame
    #df = pd.DataFrame('Song_Title', 'Artist', 'Lyricist', 'Composer', 'Lyric')
    df = pd.DataFrame(get_info(url))
    df = df.rename(columns={0:'Song_Title', 1:'Artist', 2:'Lyricist', 3:'Composer', 4:'Lyric'})
 # CSV file output
    csv = df.to_csv("csv/{}.csv".format(file_name))
    return csv

By running the above code, you are ready for scraping. You can actually get the lyrics etc. by executing the code below. This time, I got the music of Minami-san. I also tried to make it easier to change the file name and url.

file_name = 'sample'
url = 'https://www.uta-net.com/artist/26099/'
url = 'https://www.uta-net.com/user/ranking/daily.html'
url = 'https://www.uta-net.com/user/ranking/monthly.html'
create_df(file_name, url)

Output result

Here is the data of the music acquired this time. Now you can analyze as many songs as you like. Screen Shot 2020-05-13 at 5.45.19.png

Summary from a digression (?)

I found it fun to make something that works as I intended. It has become an article with a strong self-satisfaction element, so I would like to update it later. (Since the explanation of the code is only commented out, ...) I also want to unify the writing style of Qiita in my own way. Next, I think I'll try natural language processing.

