This is a continuation of music genre classification using the Microsoft Cognitive Toolkit (CNTK).
In Part2, music genre classification is performed using the logarithmic mel spectrogram image prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed.
Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections prepared training data and verification data.
In Part2, we will classify music genres using a convolutional neural network (CNN).
Since audio data is one-dimensional waveform data, a one-dimensional convolutional neural network comes to mind first, but this time we will use a two-dimensional convolutional neural network as a grayscale image with time on the horizontal axis and frequency on the vertical axis. [1]
The structure of the convolutional neural network has been simplified as follows. [2]
Layer | Filters | Size/Stride | Input | Output |
---|---|---|---|---|
Convolution2D | 64 | 3x3/1 | 1x128x128 | 64x128x128 |
MaxPooling2D | 3x3/2 | 64x128x128 | 64x64x64 | |
Convolution2D | 128 | 3x3/1 | 64x64x64 | 128x64x64 |
MaxPooling2D | 3x3/2 | 128x64x64 | 128x32x32 | |
Convolution2D | 256 | 3x3/1 | 128x32x32 | 256x32x32 |
Dense | 512 | 256x32x32 | 512 | |
Dense | 512 | 512 | 512 | |
Dense | 10 | 512 | 10 | |
Softmax | 10 | 10 |
For the initial value of each parameter, we used the normal distribution of He [[3]](# reference) for the convolution layer and the uniform distribution of Glorot [[4]](# reference) for the fully connected layer.
The loss function used Cross Entropy Error.
We adopted Stochastic Gradient Decent (SGD) with Momentum as the optimization algorithm. The momentum was fixed at 0.9 and the L2 regularization value was set to 0.0005.
The Cyclical Learning Rate (CLR) [5] is used for the learning rate, the maximum learning rate is 1e-3, the base learning rate is 1e-5, and the step size is 10 times the number of epochs. The strategy was set to triangular2.
As a countermeasure against overfitting, Dropout [6] was applied at 0.5 between the fully connected layers.
Model training performed 25 Epoch with mini-batch training of mini-batch size 32.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ Scikit-learn 0.23.2
The training program is available on GitHub.
mgcc_training.py
Training loss and error The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the false recognition rate, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the false recognition rate, respectively.
Validation accuracy and confusion matrix When the performance was evaluated using the test data that was separated when preparing the data in Part 1, the following results were obtained.
Validation Accuracy 69.00%
The figure below is a visualization of the mixed matrix of the verification data. The row direction is the correct answer, and the column direction is the prediction.
Speech Recognition : Genre Classification Part1 - GTZAN Genre Collections
Recommended Posts