This is a continuation of phoneme prediction using the Microsoft Cognitive Toolkit (CNTK).
In Part2, phoneme prediction is performed using the features and phoneme labels prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed.
In Speech Recognition: Phoneme Prediction Part1-ATR Speech dataset, a list of phoneme labels and HTK format from the ATR sample speech database [1] I prepared files, frames and phoneme labels.
In Part 2, we will create and train a phoneme prediction model using a recurrent neural network (RNN).
The overall picture of the phoneme prediction implemented this time is as shown in the figure below. The components of the recurrent neural network are LSTM [2], and a bidirectional model [3] that connects the forward and backward outputs.
In each layer, apply Layer Normalization [4] to forward and backward respectively, then concatenate and make a residual connection [5], and make a phoneme label at the last full connection. Predict.
For the initial value of each parameter, we used Glorot's uniform distribution [6].
We used the Connectionist Temporal Classification [7] for the loss function.
Adam [8] was used as the optimization algorithm. Adam's hyperparameters $ \ beta_1 $ are set to 0.9 and $ \ beta_2 $ are set to the default values of CNTK.
The Cyclical Learning Rate (CLR) [9] is used for the learning rate, the maximum learning rate is 1e-3, the base learning rate is 1e-5, and the step size is 10 times the number of epochs. The strategy was set to triangular2.
As a countermeasure against overfitting, I applied Dropout [10] at 0.1 before the residual connection.
Model training performed 100 Epoch by mini-batch learning.
・ CPU Intel (R) Core (TM) i7-7700 3.60GHz ・ GPU NVIDIA Quadro P4000 8GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Librosa 0.8.0 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ PyAudio 0.2.11 ・ Scipy 1.5.2
The training program is available on GitHub.
ctcr_training.py
I will supplement the contents required for this implementation.
Connectionist Temporal Classification In speech recognition, the number of frames in speech data and the number of phonemes you want to predict are often different. In such a case, there is no one-to-one correspondence between the RNN output and the correct answer data.
Therefore, by introducing _ as a whitespace character, the correct answer data aoi follows the path of the arrow as shown in the figure below, and matches the length with the number of frames as shown in \ _ \ a_o \ i.
In Connectionist Temporal Classification (CTC), parameter learning is performed by maximum likelihood estimation, and the probability of a route that obtains a series of correct data as shown in the above figure is calculated. Assuming that the input data series is $ x $ and the correct data series is $ l $, the loss function is defined by the following formula.
Loss = - \log p(l|x)
However, there are numerous combinations of routes for calculating the loss function. Therefore, it is efficiently calculated using a forward-backward algorithm based on dynamic programming.
Here, as shown in the figure below, the sum of the probabilities $ \ alpha_t (s) $ of the set $ \ pi $ of the routes that reach the series $ s $ at the time $ t $ is called the forward probability and is expressed by the following formula.
\alpha_t (s) = \sum_{s \in \pi} \prod^t_{i=1} y^i_s
This positive probability $ \ alpha_t (s) $ can be calculated recursively and efficiently based on the idea of dynamic programming.
\left\{
\begin{array}{ll}
\alpha_1(1) = y^1_1 \\
\alpha_t(s) = (\alpha_{t-1} (s-1) + \alpha_{t-1} (s)) y^t_s
\end{array}
\right.
Similarly, the backward probability $ \ beta_t (i) $ as shown below is defined as follows.
\beta_t(s) = \sum_{s \in \pi} \prod^T_{i=s} y^i_s
This backward probability $ \ beta $ can be recursively and efficiently calculated in the same way as the forward probability $ \ alpha $.
\left\{
\begin{array}{ll}
\beta_T(S) = 1 \\
\beta_t(s) = (\beta_{t+1} (s+1) + \beta_{t+1} (s)) y^{t+1}_s
\end{array}
\right.
Then, the probabilities in all routes are $ \ alpha_t (s) \ beta_t (s) $, and the loss function is as follows.
Loss = - \log \sum^S_{s=1} \alpha_t(s)\beta_t(s)
This time, we used the edit distance as a performance evaluation index for the model. The edit distance, also known as the Levenshtein distance, is the minimum number of insert, delete, and replace operations.
As an example, the edit distance between the string akai and the string aoi is
・ Delete k of akai -Replace a in aai with o
Is obtained by the operation. The edit distance can be calculated using a table with a blank character added at the beginning as shown below.
First, find the edit distances for the first row and first column as shown by the blue arrow in the figure below. This is the same as the whitespace character _ and the length of the string at each point in time.
Next, enter the smallest value of any of the following in order from the top left as shown by the green arrow.
・ One value plus one ・ One to the left plus one -The value in the upper left plus 1 (however, if the characters in the row and column directions are the same, 1 is not added)
And the value shown in red at the bottom right is the editing distance to be calculated.
Training loss and error The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the editing distance, the horizontal axis shows the number of epochs, and the vertical axis shows the value and editing distance of the loss function, respectively.
Validation error When the performance was evaluated using the verification data that was separated when preparing the data in Part 1, the following results were obtained.
Validation Error 40.31%
The results of phoneme prediction for a recorded voice of one's own speech are shown below. Utterance was a "Hello".
Say...
Record.
['sil', 'h', 'o', 'h', 'i', 'cl', 'ch', 'i', 'e', 'o', 'a', 'sil']
Since phonemes are output frame by frame during inference, the result of phoneme prediction is the redundancy of consecutive phonemes and the removal of the whitespace character _.
The vowels seem to be almost correct, but the consonants don't seem to work except for'ch'.
CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria
Speech Recognition : Phoneme Prediction Part1 - ATR503 Speech dataset
Recommended Posts