Introduction

This time I'm going to do a simple scraping. I don't think there are many people who like to collect sound sources locally during the heyday of subscription and even tag the lyrics, composition, and arrangement, but I would like to introduce it because it can be easily tagged.

let's try it

First is the structure of the Tower Records site. When I looked up the xpath, which is packed with important information, it looked like the following. //*[@id="RelationArtist_0_1_sub"]/div/div[3]/div[2]/a/text() This shows the lyrics information for the first song on Disc1. From the previous numbers, Disc Number, Trac Number, lyrics or composition or arrangement. Don't play with the last number. Let's actually write the code. The libraries used are lxml (scraping), urllib (around the net) and mutagen (music tag related).

`tagget.py`


om mutagen.flac import FLAC
from urllib import request
import requests
from lxml import html
import os
import requests
import json

class Net():
    def Tower(self, no, html2, disc, item):
        content = list()
        if item=="W": #Judge one of the lyrics, composition and arrangement, and enter the appropriate number.
            i = "3"
        elif item=="C":
            i = "4"
        elif item=="A":
            i = "5"
        contentr = html2.xpath('//*[@id="RelationArtist_'+str(disc)+'_'+str(no)+'_sub"]/div/div['+i+']/div[2]/a/text()') #Specify location

        try:
            content.append(contentr[0].strip('\'').strip()) #It's not smart, but it corresponds to the case where multiple values are entered
            content.append(contentr[1].strip('\'').strip()) #Let's use for or While!
            content.append(contentr[2].strip('\'').strip())
            content.append(contentr[3].strip('\'').strip())
        except IndexError:
            print(content) #If the value is no longer entered, an Error will be issued to output what kind of tag was acquired.
        return content

class Main():
    def Towerget(self,files,url):
        n = Net()
        r = requests.get(url) #Load the page
        html2 = html.fromstring(r.content) #Parse the page
        for f in files:
            tag = FLAC(f) #Loading tags
            no = tag['tracknumber'][0].lstrip("0") #I entered the 1-digit Disc Number as 0x, so I shaped it according to Tower Records.
            disc = int(tag['discnumber'][0].lstrip("0")) - 1 #The number representing the disc starts from 0, so adjust it.
            print(no)
            tag['word'] = n.Tower(no, html2, disc, item="W") #Lyrics tag input
            tag['composer'] = n.Tower(no, html2, disc, item='C') #Input composition tag
            tag['arranger'] = n.Tower(no, html2, disc, item="A") #Arrangement tag input
            tag.pprint() 
            tag.save() #Save tag

os.chdir("E:\music\Unorganized\Uchikubigokumon Club-Prison fifteen") #The file path of the file to tag
files0 = os.listdir(os.getcwd()) #Get a list of files in a folder
files = list()

for f in files0: #Since the same file contains Google Drive management files, jacket photos, etc., only flac is taken out.
    if f.endswith(".flac"):
        files.append(f)
        print(f)
    else:
        print("not "+f)

m = Main()
url = "https://tower.jp/item/4936516/Prison fifteen" #The URL of the Tower Records page
m.Towerget(files, url)

It's not a very clean code, but you can get it for the time being.

Improvement points

・ Songs such as Overture that do not have a song and no lyrics are out of sync. ・ Tower Records may not have entered the arrangement. ・ I want to get the URL of the Tower Records page automatically (this seems difficult).

Pulling songwriting, composition and arrangement information from the Tower Records site with Python

Introduction

let's try it

tagget.py

Improvement points

`tagget.py`