A story about everything from data collection to AI development and Web application release in Python (3. AI development)


This article is a series. Click here for other articles ↓ 0. Design (What is AI to determine the key?) 1. Data collection (crawling) 2. Data shaping (scraping) 4. Web application development using Django

It's been 3-4 months since development. At the time of development, the goal was to deploy, and I didn't spend much time choosing the essential model.

So this time I would like to try various approaches and compare the accuracy. I will also try to find rule-based accuracy without using machine learning.

Task confirmation

The task this time is to determine the key from the chord progression of the song. Chord progression? Key? Please see the article below for more information.

A story about everything from data collection to AI development and Web application release in Python (0. Design)

Data confirmation

I scraped the number of chord appearances and keys from the chord notation posting site called J-Total Music. Among the songs posted, there is data for the songs (20848 songs) for which the key could be obtained.

However, some songs have few chords. Example: https://music.j-total.net/data/012si/071_shinohara_tomoe/001.html Since these songs are not suitable for training data, data with a total chord appearance number of 20 times or less in the song will be dropped. Also, as outlier removal, data with a total number of chords appearing in the song of 250 or more is also dropped. In the following, we will do various things based on the remaining ** 20481 song data ** </ u>.

Image of data

It is like this. key is the explained variable, and to the right of key is the explanatory variable. To dive into the machine learning model, label-encode the key and change it from 0 to 23. Others are used as they are.

Data (class) bias

There are 24 types of keys in all, but of course the number of data is not even. There is a difference of nearly 10 times between the most keys and the least keys. It's not as much as so-called imbalanced data, but it's a little worrisome. By the way, the ratio of major key to minor key is ** 149 55: 5526 **, and 73% is major song </ u>. If you look at the graph above, you can see that D minor is clustered at the bottom.

What you want to try

Try different approaches to see how accurate they are.

  1. [Classify by rule](#Classify by rule)
  2. [Classify by machine learning](#Classify by machine learning) 2-1. [24 classification](## 2-1.24 classification) 2-2. [Use domain knowledge to change problem design](## 2-2. Use domain knowledge to change problem design)

Indicators and test data for comparison

The evaluation indicators for multi-class classification are described in detail on the following pages. https://analysis-navi.com/?p=553

The following three values are calculated, and the correct answer rate for each class is calculated.

  • Correct answer rate (number of correct answers / number of data)
  • Macro recall rate (average of correct answer rate for each class)
  • Macro F1 score (average of F1 scores for each class)

As for test data, sklearn's train_test_split prepares 25% of the total data (about 5000 songs). The average value for 5 times is calculated in consideration of the variation in accuracy depending on the data. Also, as another test data, check the correct answer rate for 90 flumpool songs posted on U-fret. This is because I like flumpool and I know the keys to all the songs. Hereinafter, this data is referred to as fp data.


Since it will be long, I will summarize the results first.

index Rule base 1 Rule base 2 Logistic regression Support vector machine LightGBM LGBM × LGBM
Correct answer rate 0.528 0.613 0.843 0.854 0.857 0.839
Macro recall 0.566 0.626 0.836 0.832 0.836 0.826
Macro F1 score 0.567 0.581 0.82 0.827 0.833 0.812
fp data correct answer rate 0.178 0.611 0.889 0.889 0.911 0.867

LightGBM is the most accurate!

1. Rule-based classification

At the time of development, I used machine learning because my motivation was "I want to do something using machine learning for the time being!", But it may be a problem that can be solved on a rule basis in the first place. Try the following two. 1-1. [The most frequently used chord in the song (hereinafter referred to as the most frequently used chord) is used as the key](## 1-1. The most frequently used chord is used as the key) 1-2. [Calculate the total number of diatonic codes for each key and output the most popular key](## 1-2. Calculate the total number of diatonic codes for each key)

1-1. Use the mode code as a key

Let's say the most used chord in the song is the key. For example, if "Dm" is used most often in a song, the key of that song is identified as Dm. It's simple. If there are multiple modes, the key will be randomly determined.

import random

def mode_pred(data):

    #Find the mode code and save the code name
    tmp_max = -1
    for c in num_cols:
        if tmp_max < data[c]:
            ans = [c]
            tmp_max = data[c]
        elif tmp_max == data[c]:
    #If there are multiple modes, select them randomly
    if len(ans) == 1:
        return ans[0]
        return random.choice(ans)

df['mode_pred'] = df.apply(mode_pred, axis=1)


  • 5 times average Correct answer rate: 0.528 Macro recall: 0.556 Macro F1 score: 0.567

  • fp data Correct answer rate: 0.178

Legitimacy rate (recall rate) for each class
Key Correct answer rate (recall rate)
C_minor 0.763
F_minor 0.747
G_minor 0.699
D_minor 0.684
B_minor 0.681
A_minor 0.676
D#/E♭_minor 0.668
C#/D♭_minor 0.663
E_minor 0.663
A#/B♭_minor 0.654
F#/G♭_minor 0.641
G#/A♭_minor 0.611
E_Major 0.522
G_Major 0.504
A_Major 0.496
A#/B♭_Major 0.494
D_Major 0.485
C_Major 0.483
F_Major 0.433
F#/G♭_Major 0.425
B_Major 0.412
C#/D♭_Major 0.408
D#/E♭_Major 0.402
G#/A♭_Major 0.379

Looking at each class, you can see that the percentage of correct answers in minor is higher than that in major. If you compare it with your own domain knowledge, you can be convinced. The fp data is useless, isn't it? .. .. This is probably because there are many songs in major key.

1-2. Calculate the total number of diatonic chords for each key

As I wrote in Previous article, when I determine the key from the chord notation By looking at the code used, it is determined that "it is that key because the diatonic code of that key is often used." Let's implement this method. Specifically, the flow is as follows.

  1. Obtain the total number of occurrences of the diatonic chord for each key from the song data.
  2. Use the key with the largest total obtained in 1. as the song key (if there are multiple keys, randomly decide)

Since the for statement is rotated 24 times (the number of keys) on each line, the process takes a considerable amount of time. It took about 20 minutes in total to calculate the average of 5 times. .. ..

def diatonic_pred(data):
    tmp_max = -1
    #Calculate the total number of occurrences of the diatonic code for each key
    for key, cols in diatonic_dict.items():
        sum_value = data[cols].sum()
        if tmp_max < sum_value:
            ans = [key]
            tmp_max = sum_value
        elif tmp_max == sum_value:
    if len(ans) == 1:
        return ans[0]
        return random.choice(ans)

df['diatonic_pred'] = df.progress_apply(diatonic_pred, axis=1)


  • 5 times average Correct answer rate: 0.613 Average correct answer rate for each class: 0.626 F value: 0.581

  • fp data Correct answer rate: 0.611

Legitimacy rate (recall rate) for each class
Key Correct answer rate (recall rate)
F_minor 0.711
G_minor 0.702
C_minor 0.688
A#/B♭_minor 0.688
A_minor 0.67
D_minor 0.667
G_Major 0.651
F#/G♭_minor 0.649
B_minor 0.649
E_minor 0.633
C#/D♭_minor 0.632
G#/A♭_minor 0.615
F_Major 0.614
G#/A♭_Major 0.614
A#/B♭_Major 0.61
B_Major 0.61
D#/E♭_Major 0.607
F#/G♭_Major 0.604
E_Major 0.596
D_Major 0.586
D#/E♭_minor 0.579
A_Major 0.572
C_Major 0.566
C#/D♭_Major 0.504

The average accuracy is higher than that of the mode code. The maximum value of the correct answer rate is about 70%, which is about the same for both methods, but there is a considerable difference in the minimum value. The lowest correct answer rate for the mode code was 38%, but this method is 50%. Also, as before, the classification accuracy of songs in minor is high.

Let's take a look at the confusion matrix here. Curiously, there are some keys that are misclassified by about 20% for each key. This is a key called parallel key, which uses almost the same chords. Since the discrimination method this time uses the number of times the code is used, I think it is a natural result to misclassify the keys as parallel tones.

The correct answer rate for fp data has increased significantly compared to the previous time, but it is about 60%. I looked at the misclassification data, but most of them were misclassified in parallel. It turns out that this method doesn't work for parallel key classification.

2. Classified by machine learning

Now that you know the rule-based accuracy, it's time to try machine learning. I will try some methods here as well. 2-1. [Discriminate as 24 class classification](## 2-1.24 Class classification) 2-2. [Use domain knowledge to change problem design](## 2-2. Use domain knowledge to change problem design)

2-1. 24 classification

It is treated as a classification problem of 24 classes obediently. The method is ** logistic regression ** as a representative of the classical linear separation algorithm, ** support vector machine ** as a representative of the nonlinear separation algorithm, and is the strongest in terms of high accuracy, high learning speed, and little preprocessing. Use the high-profile ** LightGBM **. I want to compare the methods, so I will refrain from flashy parameter adjustments.

Divide the data with train_test_split and plunge it into the model as it is. We do not adjust the hyperparameters, but set class_weight = balanced. By doing this, it will be weighted by number of samples in the corresponding class / (number of classes * total number of samples).

Logistic regression

For the time being, I tried playing with the parameters of the regularization term, but since there was no big difference, I am training by default.

for seed in [1, 4, 9, 16, 25]:
    X_train, X_test, y_train, y_test = train_test_split(df[num_cols], df['target_key'], random_state=seed)
    lr = LogisticRegression(class_weight='balanced')
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

Support vector machine

As a test, I trained the kernel function with the rbf kernel, one-to-other classification method. However, it took almost an hour to learn, and the accuracy was about 30% on average, which was a disappointing result. Since it is better for support vector machines to standardize variables, we decided to measure the accuracy with standardization of variables + rbf kernel + 1 vs. other classification method. By standardizing, the execution time became an order of magnitude faster, and I was impressed. However, one learning takes about 2-3 minutes, so the execution time is slower than the other two.

from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler

#1 vs. other classification
svc = SVC(kernel='rbf', class_weight='balanced', verbose=True)
ovr = OneVsRestClassifier(svc)
sc = StandardScaler()

for seed in [1, 4, 9, 16, 25]:
    X_train, X_test, y_train, y_test = train_test_split(sc.fit_transform(X), y, random_state=seed)
    ovr.fit(X_train, y_train)
    y_pred = ovr.predict(X_test)

LightGBM Executed with default parameters.

import lightgbm as lgbm

for seed in [1, 4, 9, 16, 25]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)
    clf = lgbm.LGBMClassifier(class_weight='balanced')
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)


Logistic regression Support vector machine LightGBM
Correct answer rate 0.843 0.854 0.857
Macro recall 0.836 0.832 0.836
Macro F1 score 0.820 0.827 0.833

The result is that LightGBM is the most accurate, although there is no big difference in accuracy. However, from the viewpoint of execution time, it was like LightGBM > Logistic regression >>> Support vector machine. LightGBM took less than a minute, while support vector machines took about 10 minutes to learn five times. It takes time to learn once, so I feel that I can't easily adjust the parameters.

Next, let's look at the percentage of correct answers for each class.

Legitimacy rate (recall rate) for each class
Key Logistic regression Support vector machine LightGBM
C_Major 0.838 0.866 0.856
C_minor 0.883 0.885 0.837
C#/D♭_Major 0.825 0.859 0.878
C#/D♭_minor 0.809 0.748 0.755
D_Major 0.84 0.875 0.871
D_minor 0.851 0.814 0.827
D#/E♭_Major 0.841 0.842 0.869
D#/E♭_minor 0.808 0.782 0.761
E_Major 0.871 0.897 0.9
E_minor 0.844 0.84 0.842
F_Major 0.851 0.857 0.87
F_minor 0.881 0.827 0.836
F#/G♭_Major 0.805 0.828 0.847
F#/G♭_minor 0.793 0.751 0.791
G_Major 0.857 0.872 0.872
G_minor 0.861 0.849 0.832
G#/A♭_Major 0.86 0.865 0.866
G#/A♭_minor 0.773 0.704 0.725
A_Major 0.849 0.874 0.887
A_minor 0.826 0.83 0.833
A#/B♭_Major 0.822 0.853 0.867
A#/B♭_minor 0.823 0.796 0.777
B_Major 0.815 0.847 0.855
B_minor 0.847 0.815 0.804

It can be seen that logistic regression has a higher percentage of correct answers for minor keys than the other two.

Next, let's plot the correct answer rate for each key with a box plot for each method.

The minimum correct answer rate for all methods is 70% or higher. However, it can be read that the variation (range) of accuracy of logistic regression is smaller than that of other methods. For the time being, LightGBM was the top in each index, but you can see that the range is quite uneven.

By the way, regarding the correct answer rate of fp data, all methods had a correct answer rate of about 80/90 songs. Most of the mistakes were in parallel (others were transposed songs, etc.).

2-2. Change problem design by utilizing domain knowledge

For the purpose of "determining the key" this time, we will change the problem design by utilizing domain knowledge.

There are 24 types of keys that you want to distinguish, but not all 24 types have completely different characteristics. For each key, there is only one key with similar characteristics (specifically, the same key signature and similar notes and chords used). This is called parallel key. For example, the parallel key of C (C major) is Am (A minor). There is always a correspondence between major and minor. Let's actually look at the chord notation.

How about that? The code used is pretty similar, isn't it? Of the 24 types of keys, there are 12 such key combinations.

Based on the above, the judgment is made as follows.

  1. Perform 12 classifications in parallel
  2. Binary classification of major or minor. Determine the key according to the result of 1.

It's hard to understand, so if you give an example

  • 1 classification result is "C (C major) or Am (A minor)", 2 classification result is "major" → ** key is C (C major) **
  • 1 classification result is "F (F major) or Dm (D minor)", 2 classification result is "minor" → ** key is Dm (D minor) **

The image is that the discrimination is divided into two parts like this. Both models use LightGBM.

model_1 = lgbm.LGBMClassifier(class_weight='balanced')
model_2 = lgbm.LGBMClassifier(class_weight='balanced')

key_answer = df['key']
diatonic_answer = df['diatonic_type']
type_answer = df['key_type']
X = df[num_cols]
y1 = df['diatonic_type_int']

for seed in [1, 4, 9, 16, 25]:
    #12 class classification (parallel key classification)
    X_train, X_test, y1_train, y1_test = train_test_split(X, y1, random_state=seed)
    model_1.fit(X_train, y1_train)
    y1_pred = model_1.predict(X_test)
    #String ([email protected]_return to minor)
    y1_pred_str = le_d.inverse_transform(y1_pred)
    #Binary classification with the same data (major or minor)
    train_index = y1_train.index
    test_index = y1_test.index
    y2_train = type_answer[train_index]
    y2_test = type_answer[test_index]
    model_2.fit(X_train, y2_train)
    y2_pred = model_2.predict(X_test)
    #Integrate the results of 12-class classification and binary classification
    y_pred = []
    for y1_, y2_ in zip(y1_pred_str, y2_pred):
        if y2_ == 1:
            ans = y1_.split('@')[0]
            ans = y1_.split('@')[1]
    y_test = key_answer[test_index]


  • 5 times average Correct answer rate: 0.839 Macro recall: 0.826 Macro F1 score: 0.812

  • fp data Correct answer rate: 0.867

Correct answer rate for each class
Key Correct answer rate (recall rate)
C_Major 0.848
C_minor 0.843
C#/D♭_Major 0.858
C#/D♭_minor 0.853
D_Major 0.83
D_minor 0.825
D#/E♭_Major 0.84
D#/E♭_minor 0.836
E_Major 0.82
E_minor 0.815
F_Major 0.797
F_minor 0.787
F#/G♭_Major 0.811
F#/G♭_minor 0.803
G_Major 0.746
G_minor 0.686
G#/A♭_Major 0.775
G#/A♭_minor 0.764
A_Major 0.884
A_minor 0.875
A#/B♭_Major 0.909
A#/B♭_minor 0.89
B_Major 0.869
B_minor 0.864

The result of the 5-time average is a little lower than the result of LightGBM with 24 classifications. The prediction accuracy of the first stage (12 class classifications summarized in parallel key) was good at about 93%, but the overall correct answer rate has decreased in the second stage major or minor. As for fp data, misclassification of parallel tones accounted for most of the data. However, there were some songs that I didn't make a lot of mistakes in the 24 class classification.


I tried various things, but the result was that it was better to predict it as a 24-class classification obediently with LightGBM. After all LightGBM is amazing. It's not just accuracy that LightGBM is great. ** Learning speed is fast without any hurdles **. Therefore, I think it is a big advantage that the number of trials can be increased compared to other models such as parameter adjustment.

It's not that methods other than LightGBM are bad. For example, in Rule Base 1, the correct answer rate in minor is high, and it was found that the mode code is useful for classifying minor. Rule Base 2 confirmed the validity of my hypothesis. In logistic regression, it was found that the variation in the correct answer rate for each class was small. There is still room for parameter adjustment on support vector machines, so the correct answer rate may increase depending on the adjustment. The last two-step prediction with a different problem design may give good results depending on the model and parameters.

So, at present, I'm going to use LightGBM, which can easily produce high accuracy, but I'll take time to adjust the parameters. At that time, I will write the article again. It was a childish sentence, but thank you for reading it.

Recommended Posts