Target

This is a continuation of music genre classification using the Microsoft Cognitive Toolkit (CNTK).

In Part2, music genre classification is performed using the logarithmic mel spectrogram image prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed.

Introduction

Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections prepared training data and verification data.

In Part2, we will classify music genres using a convolutional neural network (CNN).

Convolutional neural network in speech recognition

Since audio data is one-dimensional waveform data, a one-dimensional convolutional neural network comes to mind first, but this time we will use a two-dimensional convolutional neural network as a grayscale image with time on the horizontal axis and frequency on the vertical axis. [1]

The structure of the convolutional neural network has been simplified as follows. [2]

Layer	Filters	Size/Stride	Input	Output
Convolution2D	64	3x3/1	1x128x128	64x128x128
MaxPooling2D		3x3/2	64x128x128	64x64x64
Convolution2D	128	3x3/1	64x64x64	128x64x64
MaxPooling2D		3x3/2	128x64x64	128x32x32
Convolution2D	256	3x3/1	128x32x32	256x32x32
Dense	512		256x32x32	512
Dense	512		512	512
Dense	10		512	10
Softmax			10	10

Settings in training

For the initial value of each parameter, we used the normal distribution of He [[3]](# reference) for the convolution layer and the uniform distribution of Glorot [[4]](# reference) for the fully connected layer.

The loss function used Cross Entropy Error.

We adopted Stochastic Gradient Decent (SGD) with Momentum as the optimization algorithm. The momentum was fixed at 0.9 and the L2 regularization value was set to 0.0005.

The Cyclical Learning Rate (CLR) [5] is used for the learning rate, the maximum learning rate is 1e-3, the base learning rate is 1e-5, and the step size is 10 times the number of epochs. The strategy was set to triangular2.

As a countermeasure against overfitting, Dropout [6] was applied at 0.5 between the fully connected layers.

Model training performed 25 Epoch with mini-batch training of mini-batch size 32.

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ Scikit-learn 0.23.2

Program to run

The training program is available on GitHub.

`mgcc_training.py`

result

Training loss and error The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the false recognition rate, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the false recognition rate, respectively.

Validation accuracy and confusion matrix When the performance was evaluated using the test data that was separated when preparing the data in Part 1, the following results were obtained.

Validation Accuracy 69.00%

The figure below is a visualization of the mixed matrix of the verification data. The row direction is the correct answer, and the column direction is the prediction.

reference

Speech Recognition : Genre Classification Part1 - GTZAN Genre Collections

Tom LH. Li, Antoni B. Chan, and Andy HW. Chun. "Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network", Genre 10 (2010).
Weibin Zhang, Wenkang Lei, Xiangmin Xu, and Xiaofeng Xing. "Improved Music Genre Classification with Convolutional Neural Networks", Interspeech. 2016, p. 3304-3308.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", The IEEE International Conference on Computer Vision (ICCV). 2015, p. 1026-1034.
Xaiver Glorot and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks", Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, p. 249-256.
Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevshky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", The Journal of Machine Learning Research 15.1 (2014) p. 1929-1958.

[PYTHON] Speech Recognition: Genre Classification Part2-Music Genre Classification CNN