We have summarized the music genre classification using the Microsoft Cognitive Toolkit (CNTK).
In Part 1, we will prepare for music genre classification.
I will introduce them in the following order.
GTZAN Genre Collections
・ Blues ・ Classical ・ Country ・ Disco ・ Hiphop ・ Jazz ・ Metal ・ Reggae ・ Rock
Contains 10 different music genres. Each has 100 data for 30 seconds.
Download and unzip genres.tar.gz from the link above.
The directory structure this time is as follows.
MGCC |―gtzan |―... mgcc_gtzan.py
Speech is represented as waveform data, but in speech recognition, it is generally treated as frequency data using the Fourier transform, rather than being treated as waveform data as it is.
This time, create a logarithmic melspectogram image from the waveform data of the voice by the following procedure.
I used scikit-learn's train_test_split function to split it into training and validation data. The argument test_size was 0.2.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.1.1 ・ Numpy 1.19.2 ・ Librosa 0.8.0 ・ Scikit-learn 0.23.2
The implemented program is published on GitHub.
mgcc_genre.py
It supplements the essential contents of the program to be executed.
As shown in the lower left of the figure below, audio is obtained as waveform data with time on the horizontal axis and amplitude on the vertical axis. On the other hand, the waveform data is composed of waveforms with multiple frequencies as shown in the upper right. Therefore, by using the Fast Fourier Transform, you can check the frequencies contained in the waveform as shown in the lower right.
The Short-Term Fourier Transform divides the waveform data into sections and performs a fast Fourier transform. This makes it possible to see the time change of the frequency with each section as one frame.
At the time of execution, as shown in the figure below, overlap is allowed, the interval is cut out, the window function is applied, and then the Fourier transform is performed.
I used a humming window for the window function. The humming window is expressed by the following formula.
W_{hamming} = 0.54 - 0.46 \cos \left( \frac{2n \pi}{N-1} \right)
As shown in the figure below, the results obtained by the short-time Fourier transform can be viewed as an image with time on the horizontal axis and frequency on the vertical axis. Such an image is called a spectrogram, where each pixel represents the intensity of the amplitude spectrum, which we have converted to dB.
The higher the frequency of human hearing, the lower the resolution. The Mel scale [2] is a scale that reflects this. There are several types of mel scale conversion formulas, but librosa defaults to the Slaney formula.
Convert the spectrogram to mel frequency by calculating the dot product of the filter as shown below and the power spectrum of the short-time Fourier transform.
When you run the program, it creates and saves logarithmic melspectogram images of training and validation data.
The figure below is a logarithmic melspectogram image of each genre. The horizontal axis is time and the vertical axis is mel frequency, and each pixel represents logarithmic intensity.
Now that you're ready to train, Part 2 will do a music genre classification.
Recommended Posts