[PYTHON] I vectorized the chord of the song with word2vec and visualized it with t-SNE

Overview

A song is made up of chords called chords. The order in which they are arranged is very important, and it changes the emotion of the song. A block of multiple chords is read as a chord progression, and there is a typical one called [I-V-VIm-IIIm-IV-I-IV-V], for example, a canon progression. ** In terms of the order of arrangement, if you replace the song with a sentence and the chord with a word, you can see the correlation between the chords by vectorizing them with word2vec and compressing them into two dimensions with t-SNE. I tested the assumption of **. I wrote it stubbornly, but ** Isn't it cool to analyze chords (chords) with chords (programming)? I would be grateful if you could sympathize with ** </ font>.

(This is completely speculation, but I think that there is a masterpiece that summarizes how to write code when programming called readable code, and that is why the cover is a musical note. .)

[Click here for readable code] https://www.oreilly.co.jp/books/9784873115658/

Main technology

・ Scraping (selenium == 3.141.0) ・ Word2Vec (gensim == 3.7.3) ・ T-SNE (scikit-learn == 0.20.3)

Prerequisite knowledge

・ Basic python grammar

・ Understanding chord progressions (If you don't understand, skip Chapter 2)

Target audience

・ People who know chords and chord progressions ・ People who want to replace chord progressions with Roman numerals

・ People studying python ・ People who want to know how to scrape ・ People who are interested in machine learning (natural language processing)

Chapter structure

I will write it in three chapters.

** Chapter 1: Collecting data by scraping using selenium ** (about 100 lines) ** Chapter 2: Replacing chord progressions with Roman numerals ** (about 150 lines) ** Chapter 3: Vectorize the code with word2vec and show it with t-SNE ** (about 50 lines)

If you want to refer to scraping by selenium, please see Chapter 1, if you are interested in music, please see Chapter 2, and if you are interested in what kind of result word2vec will bring, please see Chapter 3.

Now let's move on to the code content.

Chapter 1: Collecting data by scraping using selenium

The destination for collecting data this time is U-FRET's site. There is a lot of music data, high accuracy, and lyrics, so it may be a site that people who play and talk have used once.

[Click here for U-FRET] https://www.ufret.jp/

Here, specify an artist and create a chord that outputs the chord progression and lyrics of all the songs to csv. (Determine which meaning the code means from the context. Lol)

I think Selenium or Beautiful Soup are famous for scraping with python, While Selenium specifies the driver and actually performs the screen transition to get the element, BeautyflSoup only specifies the URL and gets the element. I think BeautiflSoup is easy to write and easy to get along with, but since there is a limit to what I can do, elements may not be pulled unless it is Selenium, so I used Selenium this time.

scraping.py


from time import sleep
from selenium import webdriver
import chromedriver_binary
import os
import csv
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.select import Select

#Artist input
artist = "aiko"

#Output CSV of song title
fdir = "chord_data/" + artist + "/"
os.makedirs(fdir,exist_ok=True)
#URL to access
TARGET_URL ='https://www.ufret.jp/search.php?key=' + artist


#Increased browser startup speed
options = webdriver.ChromeOptions()
options.add_argument('--user-agent=hogehoge')
#Launch browser
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(5)
driver.get(TARGET_URL)
url_list= []
urls = driver.find_elements_by_css_selector(".list-group-item.list-group-item-action")
for url in urls:
    text = url.text
    if (not "Easy code for beginners" in text) and (not "Video Plus" in text):
        url = url.get_attribute("href")
        if type(url) is str:
            if "song." in url:
                url_list.append(url)

for url in url_list:
    #sleep(3)
    #driver.implicitly_wait(3)
    driver.get(url)
    sleep(5)
    #Change to the original song key
    elem = driver.find_element_by_name('keyselect')
    select = Select(elem)
    select.select_by_value('0')
    sleep(1)
    #Get song title
    title_elem = driver.find_element_by_class_name('show_name')
    title = title_elem.text
    print(title)
    sleep(1)
    #Get code
    chord_list = []
    chord_elems = driver.find_elements_by_tag_name("rt")
    for chord in chord_elems:
        chord = chord.text
        chord_list.append(chord)

    #Get lyrics
    lyric_list = []
    lyric_elems = driver.find_elements_by_class_name("chord")
    for lyric in lyric_elems:
        lyric = lyric.text.replace('\n',"")
        lyric_list.append(lyric)
    #Get the lyrics only
    no_chord_lyric_list = []
    no_chord_lyric_elems = driver.find_elements_by_class_name("no-chord")
    for no_chord_lyric in no_chord_lyric_elems:
        no_chord_lyric = no_chord_lyric.text
        #Move only the lyrics forward from the lyrics list to correspond to the chord progression
        idx = lyric_list.index(no_chord_lyric)
        lyric_list.remove(no_chord_lyric)
        if idx==0:
            lyric_list[0] = no_chord_lyric + lyric_list[0]
        else:
            lyric_list[idx-1] += no_chord_lyric

    #Delete the code at the beginning of each lyrics and leave only the lyrics
    lyric_list = [lyric.replace(chord_list[idx],"") if chord_list[idx] in lyric else lyric for idx,lyric in enumerate(lyric_list)]
    
    #Output scraping result to csv
    fname = fdir + title + ".csv"
    with open(fname, "w", encoding="cp932") as f:
        writer = csv.writer(f)
        writer.writerow([])
        writer.writerow(chord_list)
        writer.writerow([])
        writer.writerow(lyric_list)

Here, the artist is aiko. (The final results include the analysis results of other artists.)

If you are using a mac, you can find out where the specified element is by pressing option + command + i to open the developer tools.

The acquisition itself is easy because you only have to set the target class and tag, but the lyrics are a little complicated to correspond with the chord progression, but there is no problem even if you do not understand.

The most important thing is to put time in sleep to ** not put a load on U-FRET's server & to consider the processing time of page load **. In the above example, the server is accessed at intervals of 5 seconds. However, even with this, sometimes it fails to get the element, so adjust it by doing it several times or increasing the time. (I felt that implicitly_wait wasn't working .. I would be grateful if you could let me know if you have any details.)

By the way, the output result here is as follows. スクリーンショット 2020-11-01 16.51.10.png

The csv of the song name is output. In addition, the chord progression is output on the second line, and the corresponding lyrics are output on the fourth line. The first part is an intro, so the lyrics are not attached. Don't worry about it here as it will fill in the blank lines in the next two chapters.

Chapter 2: Replacing chord progressions with Roman numerals

This can only be understood by someone who is familiar with music to some extent, so skip it if you don't understand. If you know machine learning, it's okay to recognize that ** standardization is performed as preprocessing of data **. To put it simply, think of it as ** making the absolute display a relative display so that all data can be treated on the same basis **.

To explain a little, there is a main sound in the music, and it is read as key. I'm sure some of you have experienced singing with the key down because it's expensive even in karaoke. For example, the canon progression I explained at the beginning was written as [I-V-VIm-IIIm-IV-I-IV-V], but I can't play it because I don't know the actual chord. This chord progression written in Roman numerals is after the standardization we want to do in this chapter. Actually, it is a progression like [D-A-Bm-F # m-G-D-GA], which is a canon progression with key = D. If the D code is I, the A code is V, and so on, the Roman numerals point to the relative position on the keyboard.

Now, how to estimate the key. ** Compare the diatonic chords of all 12 types of keys with the triad of all the chords of the song, and find the key with the highest matching rate. Judgment is made by an algorithm called ** </ font>. If you are familiar with music, you know what you mean, right? Lol

So, it's a complicated process, but let's take a look at the code.

scraping.py


import csv
import glob
import collections
import re
import os

artist = "aiko"
#Data directory
fdir = "chord_data/" + artist + "/"
os.makedirs(fdir,exist_ok=True)
#Key type
key_list = ["C","C#","D","D#","E","F","F#","G","G#","A","A#","B"]
#Roman numerals
rome_list = ["Ⅰ","#I","Ⅱ","#Ⅱ","Ⅲ","Ⅳ","#Ⅳ","Ⅴ","#Ⅴ","Ⅵ","#Ⅵ","Ⅶ"]
#Diatonic code
dtnc_chord_list_num = ["Ⅰ","Ⅱm","Ⅲm","Ⅳ","Ⅴ","Ⅵm","Ⅶm7-5"]
#All all half all all half
dtnc_step = [2,2,1,2,2,2,1]
#♭#Convert to
r_dict = {'D♭':'C#', 'E♭':'D#', 'G♭':'F#','A♭':'G#','B♭':'A#'}
#Chord progression loading
flist = glob.glob(fdir+"*")

dtnc_chord_list_arr = []

for idx,key in enumerate(key_list):
    pos = idx
    dtnc_chord_list = []
    for num,step in enumerate(dtnc_step):
        #root sound + major or minor
        dtnc_chord_list.append(key_list[pos]+dtnc_chord_list_num[num][1:])
        pos += step
        #Express pos as a number of 12 or less
        if pos >= len(key_list):
            pos -= len(key_list)
    #Stores a list of diatonic codes for 12 different keys
    dtnc_chord_list_arr.append(dtnc_chord_list)


for fname in flist:
    with open(fname,encoding='cp932',mode='r+') as f:
        f.readline().rstrip('\n')
        chord_list = f.readline().rstrip('\n')
        f.readline().rstrip('\n')
        lyric_list = f.readline().rstrip('\n')
        #♭#Convert to
        chord_list = re.sub('({})'.format('|'.join(map(re.escape, r_dict.keys()))), lambda m: r_dict[m.group()], chord_list)
        chord_list = chord_list.split(',')
        chord_list_origin = chord_list.copy()
        # N.C.Get rid of
        chord_list = [chord for chord in chord_list if "N" not in chord]
        #Triad only
        def get_triad(chord):
            split_chord = list(chord)
            triad = split_chord[0]
            if len(split_chord )>=2 and split_chord[1]=='#':
                triad += split_chord[1]
                if len(split_chord )>=3 and split_chord[2]=='m':
                    triad += split_chord[2]
                    if len(split_chord )>=5 and split_chord[4]=='-':
                        triad += split_chord[3]
                        triad += split_chord[4]
                        triad += split_chord[5]

            elif len(split_chord )>=2 and split_chord[1]=='m':
                triad += split_chord[1]
                if len(split_chord )>=4 and split_chord[3]=='-':
                        triad += split_chord[2]
                        triad += split_chord[3]
                        triad += split_chord[4]
            else:
                pass

            return triad
        #Change to triad
        chord_list_triad = [get_triad(chord) for chord in chord_list]
        length = len(chord_list)
        #Unique count of code
        chord_unique = collections.Counter(chord_list_triad)
        #print(chord_unique)

        #################
        ###Determine the key###
        #################

        match_cnt_arr = []
        #Calculate the number of matches with 12 types of keys
        for dtnc_chord_list in dtnc_chord_list_arr:
            match_cnt = 0
            #Chord each of the 7 diatonic codes_Compare with the unique key value,
            #If there is a match, match_Add to cnt
            for dtnc_chord in dtnc_chord_list:
                if dtnc_chord in chord_unique.keys():
                    match_cnt += chord_unique[dtnc_chord]
            match_cnt_arr.append(match_cnt)
        #Maximum number of matches among 12 types
        max_cnt = max(match_cnt_arr)
        #Number of matched codes/Total number of codes(%)
        match_prb = int((max_cnt/length)*100)

        #Key determination
        key_pos = match_cnt_arr.index(max_cnt)
        key = key_list[key_pos]
        dtnc_chord_list = dtnc_chord_list_arr[key_pos]
        file_name = os.path.basename(fname).replace('.csv','')
        print('Song title:{0} , key:{1} , prob:{2}'.format(file_name,key,match_prb))
        print(dtnc_chord_list)
        key_list_chromatic = key_list[key_pos:] + key_list[:key_pos]
        # key_list_chromatic.extend(key_list[key_pos:])
        # key_list_chromatic.extend(key_list[:key_pos])
        print(key_list_chromatic)

        #Convert code to Roman numerals and write to file
        #Function to convert
        def convert_num_chord(chord_list):
            s_list = []
            n_list = []
            for idx, root in enumerate(key_list_chromatic):
                #Since the one with a sharp is replaced first, the root is divided according to the presence or absence of a sharp.
                if '#' in root:
                    s_list.append([idx,root])
                else:
                    n_list.append([idx,root])
            chord_list = ['*' if "N" in chord else chord for chord in chord_list]
            for idx,root in s_list:
                chord_list = [chord.replace(root,rome_list[idx]) if root in chord else chord for chord in chord_list]
            for idx,root in n_list:
                chord_list = [chord.replace(root,rome_list[idx]) if root in chord else chord for chord in chord_list]
            chord_list = ['N.C.' if "*" in chord else chord for chord in chord_list]
            
            return chord_list

        chord_list_converted = convert_num_chord(chord_list_origin)
        print(chord_list_origin)
        print(chord_list_converted)
    with open(fname, "w", encoding="cp932") as f:
        writer = csv.writer(f)
        writer.writerow('key:{0},prob:{1}'.format(key,match_prb).split(','))
        writer.writerow(chord_list_origin)
        writer.writerow(chord_list_converted)
        writer.writerow(lyric_list.split(','))

How was that. It's long and hard to understand what you're doing, but you can recognize that you've standardized it. It's not a big deal, so it's done in an instant. When I run this code, I get the following results: スクリーンショット 2020-11-01 17.43.45.png

You can see that key = D # on the first line, the probability of belonging is 67%, and the chord progression of Roman numerals after standardization is added on the third line. (Since the code of aiko is quite complicated and many non-diatonic codes appear, the probability of belonging to diatonic is 67%. Even with this, the key can be estimated correctly, so the correctness of this process is correct. I think you can prove it.)

Now, in the last chapter, let's vectorize and draw this standardized code.

Chapter 3: Vectorize the code with word2vec and show it with t-SNE

Now, let's put the data of about 200 songs of aiko into word2vec and vectorize it. The number of dimensions is set to 10 because there are at most 100 types of output code. Also, for window, the role of the code can be judged sufficiently with 4 codes before and after, so I specified it as 4. In addition, sg = 0 was set in order to adopt CBOW (Continuous Bag-of-Words) that estimates the central word from the periphery. This transforms each code into a 10-dimensional vector. In order to see this result as a figure, it was compressed in two dimensions using t-SNE and shown in the figure. The perplexity setting here is a difficult place, but I set it to 3 because the smaller one is easier to separate for each cluster.

If it's about compressing the dimensions, why not make the representation dimension of the code two-dimensional? I'm sure there are people who think that, but then there is almost no difference in the code, so I made it 10 dimensions to put it on the expressiveness.

Now let's look at the code.

analyze.py


from gensim.models import Word2Vec
import glob
import itertools
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

artist = "aiko"
#Data directory
fdir = "chord_data/" + artist + "/"
#Chord progression loading
flist = glob.glob(fdir+"*")
music_list = []
for fname in flist:
    with open(fname,encoding='cp932',mode='r+') as f:
        f.readline().rstrip('\n')
        f.readline().rstrip('\n')
        chord_list = f.readline().rstrip('\n').split(',')
        f.readline().rstrip('\n')
        lyric_list = f.readline()
        chord_list = [chord for chord in chord_list if "N" not in chord]
        music_list.append(chord_list)

#Vectorize code
model = Word2Vec(music_list,sg=0,window=4,min_count=0,iter=100,size=10)
chord_unique = list(set(itertools.chain.from_iterable(music_list)))
data = [model[chord] for chord in chord_unique]
print(model.most_similar('Ⅱ'))


#Compress to 2D and draw
tsne = TSNE(n_components=2, random_state = 0, perplexity = 3, n_iter = 1000)
data_tsne = tsne.fit_transform(data)

fig=plt.figure(figsize=(50,25),facecolor='w')

plt.rcParams["font.size"] = 10

for i,chord in enumerate(data_tsne):
    #Point plot
    plt.plot(data_tsne[i][0], data_tsne[i][1], ms=5.0, zorder=2, marker="x",color="red")
    plt.annotate(chord_unique[i],(data_tsne[i][0], data_tsne[i][1]), size=10)
    i += 1

#plt.show()
plt.savefig("chord_data/" + artist + ".jpg ")
 

The result obtained with this code is as follows. aiko.jpg

Aiko has many types of code and it is difficult to find a cluster, For example, in the lower left circle from the center, the triads I, IV, and V are lined up nearby. This is also correct in music theory, and this list of chord combinations is the most frequently occurring chord, and because of its relevance, the vectors are probably closer. In the same circle, there is a code called II, but although this is a non-diatonic code, I think that it was likely that it existed nearby as a doppel dominant in front of the code V. There are also # V and # VI, which are often used in the famous progression of # V → # VI → I, and it is thought that these two codes are close to each other. Besides, there is a set of codes with 7th in the upper circle, so I think that it was an interesting result to see.

Since it's a big deal, I will post the results of other artists as well. First, the following is Official Hige Dandism. Official髭男dism.jpg

And the other is an artist called andymori. This may be easier to read because it uses less code. andymori.jpg

Conclusion and future prospects

In conclusion, I found that putting the code in word2vec leads to some meaningful data.

As a future prospect, let's remember the famous chord progression in advance, and divide the song by about 4 chord progressions for that. It may be interesting to create a sparse matrix as a chord progression, create a topic in LDA, and determine the similarity of songs within an artist. It would be even better if we could even make recommendations between artists. Also, for those who are only interested in music, I would like to have a tool that makes it easy to see where the modulation is due to the work in Chapter 2.

If you are writing an article that combines music and programming, please make friends. ..

Thank you for reading.

Recommended Posts