We have summarized the phoneme prediction using the Microsoft Cognitive Toolkit (CNTK).
In Part 1, we will prepare for phoneme prediction.
I will introduce them in the following order.
The ATR sample speech dataset [1] is an utterance dataset composed of the rhymes of the ATR database.
Download and unzip atr_503_v1.0.tar.gz from the link above. The audio data exists in the .ad file under the speech directory, and the phoneme label used this time is the .lab file under the old directory under label / monophone.
The directory structure this time is as follows.
CTCR |―atr_503 |―label |―speech |―... ctcr_atr503.py MGCC
The audio data is stored in a big endian signed integer type 16bit with a sampling frequency of 16,000, so the value range is divided by the maximum value $ 2 ^ {16} / 2-1 = 32,767 $ [-1, 1] Normalize to.
This time, the Mel Frequency Cepstrum Coefficient (MFCC) was calculated from the voice data. The number of features used is 13 dimensions.
In addition, high frequency emphasis is applied to audio data as preprocessing. In addition, the 1st and 2nd derivative of MFCC are also included to make a total of 39-dimensional features.
The created features are written and saved as a binary file in HTK (Hidden Markov Toolkit) format.
During this training, we will use HTKDeserializer and HTKMLFDeserializer, which are one of the built-in readers specializing in speech recognition.
The general processing flow of the program that prepares for phoneme prediction is as follows.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Librosa 0.8.0 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ Scikit-learn 0.23.2 ・ Scipy 1.5.2
The implemented program is published on GitHub.
ctcr_atr503.py
It supplements the essential contents of the program to be executed.
The power of the voice is attenuated as it gets higher, so high-frequency enhancement is used to compensate for it. Assuming that the frequency is $ f $ and the sampling frequency is $ f_s $, the first-order finite impulse response (FIR) filter $ H (z) $ used as a high-pass filter is expressed by the following equation.
H(z) = 1 - \alpha z^{-1} \\
z = \exp(j \omega), \omega = 2 \pi f / f_s
Generally, $ \ alpha = 0.97 $ is used.
The mel frequency cepstrum converts the power spectrum of the mel spectrogram used in Speech Recognition: Genre Classification Part1 --GTZAN Genre Collections to decibels and then the discrete cosine transform. Obtained by applying.
Cepstrum [2] is an anagram of the spectrum that can separate fine and gentle fluctuations in the spectrum and represent the characteristics of the human vocal tract.
Also, in order to capture the time change of the feature amount, the difference between adjacent frames is also added as the feature amount. This is called delta cepstrum [3], and this time, not only the first derivative but also the second derivative is calculated and used as a feature.
One of CNTK's built-in readers, HTKDeserializer and HTKMLFDeserializer, requires three files: a list file, a script file, and a model label file.
The list file must have a unique phoneme label to be used as shown below. Also, add _ as a whitespace character.
atr503_mapping.list
A
E
...
z
_
The contents of the script file are as follows, describe the path where the HTK format file is saved on the right side of the equal sign, and write the number of frames in the bracket. Note that the start of the number of frames must be 0 and the end must be subtracted 1 from the number of frames.
train_atr503.scp
train_atr503/00000.mfc=./train_atr503/00000.htk[0,141]
train_atr503/00001.mfc=./train_atr503/00001.htk[0,258]
...
The left side of the equal sign in the script file must correspond to the model label file. The contents of the model label file file are as follows, and the frame and phoneme labels start from the second line. The frame spacing must be greater than or equal to 1, and 5 0s must be added by design. Label information is separated by dots.
train_atr503.mlf
#!MLF!#
"train_atr503/00000.lab"
0 1600000 sil
1600000 1800000 h
...
13600000 14200000 sil
.
"train_atr503/00001.lab"
0 400000 sil
400000 1100000 s
...
When you run the program, features will be generated and a binary file in HTK format will be saved. At the same time, write the frame and phoneme label.
Number of labels : 43
Number of samples 452
Number of samples 51
Now that we are ready to train, we will make phoneme predictions in Part 2.
CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria
Speech Recognition : Genre Classification Part1 - GTZAN Genre Collections