[PYTHON] I tried web scraping to analyze the lyrics.

Purpose

I wanted to analyze the lyrics, but I tried scraping for the first time because it was difficult to collect the lyrics. To be honest, I was a little worried because I had never written HTML properly, but I was able to do what I wanted to do, so I would like to summarize it. I would appreciate it if you could give me some advice and mistakes.

This is the article that I referred to this time.

[I tried to find out where I want to go by using word2vec and lyrics for "Kenshi Yonezu's theory that I can't go anywhere"](https://qiita.com/k_eita/items/456895942c3dda4dc059#%E6%AD % 8C% E8% A9% 9E% E3% 81% AE% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3 % E3% 82% B0)

The lyrics are available as text files in this article. This time, I rewrote it with reference to this code.

What we will get this time

  1. Song title
  1. Artist name
  2. Lyricist
  3. Composer
  4. Lyrics

These are the above five. The output format is csv.

The site scraped this time is "Uta-Net: Lyrics Search Service".

Library

import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

The contents of the code

# Get the website and output it in text format
def load(url):
    res = requests.get(url)
 #HTTPError throws HTTPError if the HTTP request returns a failed status code
    res.raise_for_status()
 #Get response body in text format
    return res.text

# Get html tag
def get_tag(html, find_tag):
    soup = BeautifulSoup(str(html), 'html.parser')
    tag = soup.find_all(find_tag)
    return tag

# Convert to a data structure that can be handled by the program
def parse(html):
    soup = BeautifulSoup(str(html), 'html.parser')
 Remove #html tag
    simple_row = soup.getText()
    simple_row = simple_row.replace('\n', '')
    simple_row = simple_row.replace(' ', '')
    
 #Delete alphanumeric characters (if needed)
    #simple_row = re.sub(r'[a-zA-Z0-9]', '', music_row)
 #Delete sign
 simple_row = re.sub (r'[<> ♪ `''" "・… _!?!-/:-@ [-` {-~]','', simple_row)
 #Delete notice
 simple_row = re.sub (r'Note:. +','', Simple_row)
    return simple_row

# Acquisition of song information for each
def get_info(url):
    base_url = 'https://www.uta-net.com/'
    html = load(url)
 #Store url for each song
    song_url = []
 #Store song
    song_info = []
    songs_info = []
    
 #Get song url
 Store url of #td
    for td in get_tag(html, 'td'):
 Get #a element
        for a in get_tag(td, 'a'):
 Whether the #href attribute contains song
            if 'song' in a.get ('href'):
 Add #url to array
                song_url.append(base_url + a.get('href'))
    
 #Get song information
    for i, page in enumerate(song_url):
 print ('{} song: {}'. format (i + 1, page))
        html = load(page)
        song_info = []
        
        #Song_Title
        for h2 in get_tag(html, 'h2'):
 Cast to str once to do #id search
            h2 = str(h2)
 #Whether or not it is a class element that stores lyrics
            if r'class="prev_pad"' in h2:
 #Remove unnecessary data
                simple_row = parse(h2)
                #print(simple_row, end = '\n')
                song_info.append(simple_row)   
            else:
                for h2 in get_tag(html, 'h2'):
                    h2 = str(h2)
                    simple_row = parse(h2)
                    song_info.append(simple_row)

        #Artist
        for h3 in get_tag(html, 'h3'):
            h3 = str(h3)
            if r'itemprop="byArtist"' in h3:
                simple_row = parse(h3)
                song_info.append(simple_row)

        #Lyricist
        for h4 in get_tag(html, 'h4'):
            h4 = str(h4)
            if r'itemprop="lyricist"' in h4:
                music = parse(h4)
                song_info.append(simple_row)

        #Composer
        for h4 in get_tag(html, 'h4'):
            h4 = str(h4)
            if r'itemprop="composer"' in h4:
                simple_row = parse(h4)
                song_info.append(simple_row)

        #Lyric
        for div in get_tag(html, 'div'):
            div = str(div)
            if r'itemprop="text"' in div:
                simple_row = parse(div)
                song_info.append(simple_row)
                songs_info.append(song_info)
 # 1 second wait (reduces server load)
                time.sleep(1)
                break
    return songs_info

def create_df(file_name, url):
 #Create a data frame
    #df = pd.DataFrame('Song_Title', 'Artist', 'Lyricist', 'Composer', 'Lyric')
    df = pd.DataFrame(get_info(url))
    df = df.rename(columns={0:'Song_Title', 1:'Artist', 2:'Lyricist', 3:'Composer', 4:'Lyric'})
 # CSV file output
    csv = df.to_csv("csv/{}.csv".format(file_name))
    return csv

By running the above code, you are ready for scraping. You can actually get the lyrics etc. by executing the code below. This time, I got the music of Minami-san. I also tried to make it easier to change the file name and url.

file_name = 'sample'
url = 'https://www.uta-net.com/artist/26099/'
url = 'https://www.uta-net.com/user/ranking/daily.html'
url = 'https://www.uta-net.com/user/ranking/monthly.html'
create_df(file_name, url)

Output result

Here is the data of the music acquired this time. Now you can analyze as many songs as you like. Screen Shot 2020-05-13 at 5.45.19.png

Summary from a digression (?)

I found it fun to make something that works as I intended. It has become an article with a strong self-satisfaction element, so I would like to update it later. (Since the explanation of the code is only commented out, ...) I also want to unify the writing style of Qiita in my own way. Next, I think I'll try natural language processing.

Recommended Posts

I tried web scraping to analyze the lyrics.
Qiita Job I tried to analyze the job offer
I tried to vectorize the lyrics of Hinatazaka46!
I tried web scraping with python.
I tried to move the ball
I tried to estimate the interval.
I tried to analyze the whole novel "Weathering with You" ☔️
I tried scraping
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to make a Web API
I tried web scraping using python and selenium
I tried to optimize while drying the laundry
I tried to get an image by scraping
I tried to save the data with discord
I started to analyze
Introduction to Web Scraping
I tried to touch the API of ebay
[Python] I tried to analyze the pitcher who achieved no hit no run
I tried to correct the keystone of the image
Stock price plummeted with "new corona"? I tried to get the Nikkei Stock Average by web scraping
I tried to debug.
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to verify and analyze the acceleration of Python by Cython
I tried to paste
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
LeetCode I tried to summarize the simple ones
I tried to analyze the New Year's card by myself using python
I tried to implement the traveling salesman problem
I tried to predict the price of ETF
I tried to learn the sin function with chainer
I tried to graph the packages installed in Python
I tried to detect the iris from the camera image
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to predict the J-League match (data analysis)
I tried to solve the soma cube with python
I tried to approximate the sin function using chainer
I tried to put pytest into the actual battle
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to linguistically analyze Karen Takizawa's incomprehensible sentences.
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried scraping the advertisement of the pirated cartoon site
I tried to analyze J League data with Python
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)
I tried to identify the language using CNN + Melspectogram
I tried to notify the honeypot report on LINE
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried scraping with Python
I tried to learn PredNet
I tried to organize SVM.
I tried to implement PCANet