This article is a series. Click here for other articles ↓ 0. Design (What is AI to determine the key?) 1. Data collection (crawling) 2. Data shaping (scraping) 4. Web application development using Django
It's been 3-4 months since development. At the time of development, the goal was to deploy, and I didn't spend much time choosing the essential model.
So this time I would like to try various approaches and compare the accuracy. I will also try to find rule-based accuracy without using machine learning.
The task this time is to determine the key from the chord progression of the song. Chord progression? Key? Please see the article below for more information.
I scraped the number of chord appearances and keys from the chord notation posting site called J-Total Music. Among the songs posted, there is data for the songs (20848 songs) for which the key could be obtained.
However, some songs have few chords. Example: https://music.j-total.net/data/012si/071_shinohara_tomoe/001.html Since these songs are not suitable for training data, data with a total chord appearance number of 20 times or less in the song will be dropped. Also, as outlier removal, data with a total number of chords appearing in the song of 250 or more is also dropped. In the following, we will do various things based on the remaining ** 20481 song data ** </ u>.
It is like this.
key is the explained variable, and to the right of
key is the explanatory variable. To dive into the machine learning model, label-encode the
key and change it from 0 to 23. Others are used as they are.
There are 24 types of keys in all, but of course the number of data is not even. There is a difference of nearly 10 times between the most keys and the least keys. It's not as much as so-called imbalanced data, but it's a little worrisome. By the way, the ratio of major key to minor key is ** 149 55: 5526 **, and 73% is major song </ u>. If you look at the graph above, you can see that D minor is clustered at the bottom.
Try different approaches to see how accurate they are.
The evaluation indicators for multi-class classification are described in detail on the following pages. https://analysis-navi.com/?p=553
The following three values are calculated, and the correct answer rate for each class is calculated.
As for test data, sklearn's
train_test_split prepares 25% of the total data (about 5000 songs). The average value for 5 times is calculated in consideration of the variation in accuracy depending on the data.
Also, as another test data, check the correct answer rate for 90 flumpool songs posted on U-fret. This is because I like flumpool and I know the keys to all the songs. Hereinafter, this data is referred to as fp data.
Since it will be long, I will summarize the results first.
|index||Rule base 1||Rule base 2||Logistic regression||Support vector machine||LightGBM||LGBM × LGBM|
|Correct answer rate||0.528||0.613||0.843||0.854||0.857||0.839|
|Macro F1 score||0.567||0.581||0.82||0.827||0.833||0.812|
|fp data correct answer rate||0.178||0.611||0.889||0.889||0.911||0.867|
LightGBM is the most accurate!
At the time of development, I used machine learning because my motivation was "I want to do something using machine learning for the time being!", But it may be a problem that can be solved on a rule basis in the first place. Try the following two. 1-1. [The most frequently used chord in the song (hereinafter referred to as the most frequently used chord) is used as the key](## 1-1. The most frequently used chord is used as the key) 1-2. [Calculate the total number of diatonic codes for each key and output the most popular key](## 1-2. Calculate the total number of diatonic codes for each key)
Let's say the most used chord in the song is the key. For example, if "Dm" is used most often in a song, the key of that song is identified as Dm. It's simple. If there are multiple modes, the key will be randomly determined.
import random def mode_pred(data): #Find the mode code and save the code name tmp_max = -1 for c in num_cols: if tmp_max < data[c]: ans = [c] tmp_max = data[c] elif tmp_max == data[c]: ans.append(c) #If there are multiple modes, select them randomly if len(ans) == 1: return ans else: return random.choice(ans) df['mode_pred'] = df.apply(mode_pred, axis=1)
5 times average Correct answer rate: 0.528 Macro recall: 0.556 Macro F1 score: 0.567
fp data Correct answer rate: 0.178
|Key||Correct answer rate (recall rate)|
Looking at each class, you can see that the percentage of correct answers in minor is higher than that in major. If you compare it with your own domain knowledge, you can be convinced. The fp data is useless, isn't it? .. .. This is probably because there are many songs in major key.
As I wrote in Previous article, when I determine the key from the chord notation By looking at the code used, it is determined that "it is that key because the diatonic code of that key is often used." Let's implement this method. Specifically, the flow is as follows.
Since the for statement is rotated 24 times (the number of keys) on each line, the process takes a considerable amount of time. It took about 20 minutes in total to calculate the average of 5 times. .. ..
def diatonic_pred(data): tmp_max = -1 #Calculate the total number of occurrences of the diatonic code for each key for key, cols in diatonic_dict.items(): sum_value = data[cols].sum() if tmp_max < sum_value: ans = [key] tmp_max = sum_value elif tmp_max == sum_value: ans.append(key) #Discrimination if len(ans) == 1: return ans else: return random.choice(ans) tqdm_notebook.pandas() df['diatonic_pred'] = df.progress_apply(diatonic_pred, axis=1)
5 times average Correct answer rate: 0.613 Average correct answer rate for each class: 0.626 F value: 0.581
fp data Correct answer rate: 0.611
|Key||Correct answer rate (recall rate)|
The average accuracy is higher than that of the mode code. The maximum value of the correct answer rate is about 70%, which is about the same for both methods, but there is a considerable difference in the minimum value. The lowest correct answer rate for the mode code was 38%, but this method is 50%. Also, as before, the classification accuracy of songs in minor is high.
Let's take a look at the confusion matrix here. Curiously, there are some keys that are misclassified by about 20% for each key. This is a key called parallel key, which uses almost the same chords. Since the discrimination method this time uses the number of times the code is used, I think it is a natural result to misclassify the keys as parallel tones.
The correct answer rate for fp data has increased significantly compared to the previous time, but it is about 60%. I looked at the misclassification data, but most of them were misclassified in parallel. It turns out that this method doesn't work for parallel key classification.
Now that you know the rule-based accuracy, it's time to try machine learning. I will try some methods here as well. 2-1. [Discriminate as 24 class classification](## 2-1.24 Class classification) 2-2. [Use domain knowledge to change problem design](## 2-2. Use domain knowledge to change problem design)
It is treated as a classification problem of 24 classes obediently. The method is ** logistic regression ** as a representative of the classical linear separation algorithm, ** support vector machine ** as a representative of the nonlinear separation algorithm, and is the strongest in terms of high accuracy, high learning speed, and little preprocessing. Use the high-profile ** LightGBM **. I want to compare the methods, so I will refrain from flashy parameter adjustments.
Divide the data with
train_test_split and plunge it into the model as it is.
We do not adjust the hyperparameters, but set
class_weight = balanced. By doing this, it will be weighted by
number of samples in the corresponding class / (number of classes * total number of samples).
For the time being, I tried playing with the parameters of the regularization term, but since there was no big difference, I am training by default.
for seed in [1, 4, 9, 16, 25]: X_train, X_test, y_train, y_test = train_test_split(df[num_cols], df['target_key'], random_state=seed) lr = LogisticRegression(class_weight='balanced') lr.fit(X_train, y_train) y_pred = lr.predict(X_test)
As a test, I trained the kernel function with the rbf kernel, one-to-other classification method. However, it took almost an hour to learn, and the accuracy was about 30% on average, which was a disappointing result.
Since it is better for support vector machines to standardize variables, we decided to measure the accuracy with
standardization of variables + rbf kernel + 1 vs. other classification method. By standardizing, the execution time became an order of magnitude faster, and I was impressed. However, one learning takes about 2-3 minutes, so the execution time is slower than the other two.
from sklearn.svm import SVC from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import StandardScaler #1 vs. other classification svc = SVC(kernel='rbf', class_weight='balanced', verbose=True) ovr = OneVsRestClassifier(svc) #Standardization sc = StandardScaler() for seed in [1, 4, 9, 16, 25]: X_train, X_test, y_train, y_test = train_test_split(sc.fit_transform(X), y, random_state=seed) ovr.fit(X_train, y_train) y_pred = ovr.predict(X_test)
LightGBM Executed with default parameters.
import lightgbm as lgbm for seed in [1, 4, 9, 16, 25]: X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed) clf = lgbm.LGBMClassifier(class_weight='balanced') clf.fit(X_train, y_train) y_pred = clf.predict(X_test)
|Logistic regression||Support vector machine||LightGBM|
|Correct answer rate||0.843||0.854||0.857|
|Macro F1 score||0.820||0.827||0.833|
The result is that LightGBM is the most accurate, although there is no big difference in accuracy.
However, from the viewpoint of execution time, it was like
LightGBM ＞ Logistic regression ＞＞＞ Support vector machine. LightGBM took less than a minute, while support vector machines took about 10 minutes to learn five times. It takes time to learn once, so I feel that I can't easily adjust the parameters.
Next, let's look at the percentage of correct answers for each class.
|Key||Logistic regression||Support vector machine||LightGBM|
It can be seen that logistic regression has a higher percentage of correct answers for minor keys than the other two.
Next, let's plot the correct answer rate for each key with a box plot for each method.
The minimum correct answer rate for all methods is 70% or higher. However, it can be read that the variation (range) of accuracy of logistic regression is smaller than that of other methods. For the time being, LightGBM was the top in each index, but you can see that the range is quite uneven.
By the way, regarding the correct answer rate of fp data, all methods had a correct answer rate of about 80/90 songs. Most of the mistakes were in parallel (others were transposed songs, etc.).
For the purpose of "determining the key" this time, we will change the problem design by utilizing domain knowledge.
There are 24 types of keys that you want to distinguish, but not all 24 types have completely different characteristics. For each key, there is only one key with similar characteristics (specifically, the same key signature and similar notes and chords used). This is called parallel key. For example, the parallel key of C (C major) is Am (A minor). There is always a correspondence between major and minor. Let's actually look at the chord notation.
How about that? The code used is pretty similar, isn't it? Of the 24 types of keys, there are 12 such key combinations.
Based on the above, the judgment is made as follows.
It's hard to understand, so if you give an example
The image is that the discrimination is divided into two parts like this. Both models use LightGBM.
model_1 = lgbm.LGBMClassifier(class_weight='balanced') model_2 = lgbm.LGBMClassifier(class_weight='balanced') key_answer = df['key'] diatonic_answer = df['diatonic_type'] type_answer = df['key_type'] X = df[num_cols] y1 = df['diatonic_type_int'] for seed in [1, 4, 9, 16, 25]: #12 class classification (parallel key classification) X_train, X_test, y1_train, y1_test = train_test_split(X, y1, random_state=seed) model_1.fit(X_train, y1_train) y1_pred = model_1.predict(X_test) #String ([email protected]_return to minor) y1_pred_str = le_d.inverse_transform(y1_pred) #Binary classification with the same data (major or minor) train_index = y1_train.index test_index = y1_test.index y2_train = type_answer[train_index] y2_test = type_answer[test_index] model_2.fit(X_train, y2_train) y2_pred = model_2.predict(X_test) #Integrate the results of 12-class classification and binary classification y_pred =  for y1_, y2_ in zip(y1_pred_str, y2_pred): if y2_ == 1: ans = y1_.split('@') else: ans = y1_.split('@') y_pred.append(ans) y_test = key_answer[test_index]
5 times average Correct answer rate: 0.839 Macro recall: 0.826 Macro F1 score: 0.812
fp data Correct answer rate: 0.867
|Key||Correct answer rate (recall rate)|
The result of the 5-time average is a little lower than the result of LightGBM with 24 classifications. The prediction accuracy of the first stage (12 class classifications summarized in parallel key) was good at about 93%, but the overall correct answer rate has decreased in the second stage major or minor. As for fp data, misclassification of parallel tones accounted for most of the data. However, there were some songs that I didn't make a lot of mistakes in the 24 class classification.
I tried various things, but the result was that it was better to predict it as a 24-class classification obediently with LightGBM. After all LightGBM is amazing. It's not just accuracy that LightGBM is great. ** Learning speed is fast without any hurdles **. Therefore, I think it is a big advantage that the number of trials can be increased compared to other models such as parameter adjustment.
It's not that methods other than LightGBM are bad. For example, in Rule Base 1, the correct answer rate in minor is high, and it was found that the mode code is useful for classifying minor. Rule Base 2 confirmed the validity of my hypothesis. In logistic regression, it was found that the variation in the correct answer rate for each class was small. There is still room for parameter adjustment on support vector machines, so the correct answer rate may increase depending on the adjustment. The last two-step prediction with a different problem design may give good results depending on the model and parameters.
So, at present, I'm going to use LightGBM, which can easily produce high accuracy, but I'll take time to adjust the parameters. At that time, I will write the article again. It was a childish sentence, but thank you for reading it.