Target

This is a continuation of phoneme prediction using the Microsoft Cognitive Toolkit (CNTK).

In Part2, phoneme prediction is performed using the features and phoneme labels prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed.

Introduction

In Speech Recognition: Phoneme Prediction Part1-ATR Speech dataset, a list of phoneme labels and HTK format from the ATR sample speech database [1] I prepared files, frames and phoneme labels.

In Part 2, we will create and train a phoneme prediction model using a recurrent neural network (RNN).

Recurrent neural network in speech recognition

The overall picture of the phoneme prediction implemented this time is as shown in the figure below. The components of the recurrent neural network are LSTM [2], and a bidirectional model [3] that connects the forward and backward outputs.

In each layer, apply Layer Normalization [4] to forward and backward respectively, then concatenate and make a residual connection [5], and make a phoneme label at the last full connection. Predict.

Settings in training

For the initial value of each parameter, we used Glorot's uniform distribution [6].

We used the Connectionist Temporal Classification [7] for the loss function.

Adam [8] was used as the optimization algorithm. Adam's hyperparameters $ \ beta_1 $ are set to 0.9 and $ \ beta_2 $ are set to the default values of CNTK.

The Cyclical Learning Rate (CLR) [9] is used for the learning rate, the maximum learning rate is 1e-3, the base learning rate is 1e-5, and the step size is 10 times the number of epochs. The strategy was set to triangular2.

As a countermeasure against overfitting, I applied Dropout [10] at 0.1 before the residual connection.

Model training performed 100 Epoch by mini-batch learning.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-7700 3.60GHz ・ GPU NVIDIA Quadro P4000 8GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Librosa 0.8.0 ・ Numpy 1.19.2 ・ Pandas 1.1.2 ・ PyAudio 0.2.11 ・ Scipy 1.5.2

Program to run

The training program is available on GitHub.

`ctcr_training.py`

Commentary

I will supplement the contents required for this implementation.

Connectionist Temporal Classification In speech recognition, the number of frames in speech data and the number of phonemes you want to predict are often different. In such a case, there is no one-to-one correspondence between the RNN output and the correct answer data.

Therefore, by introducing _ as a whitespace character, the correct answer data aoi follows the path of the arrow as shown in the figure below, and matches the length with the number of frames as shown in \ _ \ a_o \ i.

In Connectionist Temporal Classification (CTC), parameter learning is performed by maximum likelihood estimation, and the probability of a route that obtains a series of correct data as shown in the above figure is calculated. Assuming that the input data series is $ x $ and the correct data series is $ l $, the loss function is defined by the following formula.

Loss = - \log p(l|x)

However, there are numerous combinations of routes for calculating the loss function. Therefore, it is efficiently calculated using a forward-backward algorithm based on dynamic programming.

Here, as shown in the figure below, the sum of the probabilities $ \ alpha_t (s) $ of the set $ \ pi $ of the routes that reach the series $ s $ at the time $ t $ is called the forward probability and is expressed by the following formula.

\alpha_t (s) = \sum_{s \in \pi} \prod^t_{i=1} y^i_s

This positive probability $ \ alpha_t (s) $ can be calculated recursively and efficiently based on the idea of dynamic programming.

\left\{
\begin{array}{ll}
\alpha_1(1) = y^1_1 \\
\alpha_t(s) = (\alpha_{t-1} (s-1) + \alpha_{t-1} (s)) y^t_s
\end{array}
\right.

Similarly, the backward probability $ \ beta_t (i) $ as shown below is defined as follows.

\beta_t(s) = \sum_{s \in \pi} \prod^T_{i=s} y^i_s

This backward probability $ \ beta $ can be recursively and efficiently calculated in the same way as the forward probability $ \ alpha $.

\left\{
\begin{array}{ll}
\beta_T(S) = 1 \\
\beta_t(s) = (\beta_{t+1} (s+1) + \beta_{t+1} (s)) y^{t+1}_s
\end{array}
\right.

Then, the probabilities in all routes are $ \ alpha_t (s) \ beta_t (s) $, and the loss function is as follows.

Loss = - \log \sum^S_{s=1} \alpha_t(s)\beta_t(s)

Editing distance

This time, we used the edit distance as a performance evaluation index for the model. The edit distance, also known as the Levenshtein distance, is the minimum number of insert, delete, and replace operations.

As an example, the edit distance between the string akai and the string aoi is

・ Delete k of akai -Replace a in aai with o

Is obtained by the operation. The edit distance can be calculated using a table with a blank character added at the beginning as shown below.

First, find the edit distances for the first row and first column as shown by the blue arrow in the figure below. This is the same as the whitespace character _ and the length of the string at each point in time.

Next, enter the smallest value of any of the following in order from the top left as shown by the green arrow.

・ One value plus one ・ One to the left plus one -The value in the upper left plus 1 (however, if the characters in the row and column directions are the same, 1 is not added)

And the value shown in red at the bottom right is the editing distance to be calculated.

result

Training loss and error The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the editing distance, the horizontal axis shows the number of epochs, and the vertical axis shows the value and editing distance of the loss function, respectively.

Validation error When the performance was evaluated using the verification data that was separated when preparing the data in Part 1, the following results were obtained.

Validation Error 40.31%

Phoneme prediction for spoken voice data

The results of phoneme prediction for a recorded voice of one's own speech are shown below. Utterance was a "Hello".

Say...
Record.
['sil', 'h', 'o', 'h', 'i', 'cl', 'ch', 'i', 'e', 'o', 'a', 'sil']

Since phonemes are output frame by frame during inference, the result of phoneme prediction is the redundancy of consecutive phonemes and the removal of the whitespace character _.

The vowels seem to be almost correct, but the consonants don't seem to work except for'ch'.

reference

CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria

Speech Recognition : Phoneme Prediction Part1 - ATR503 Speech dataset

Yoshiro Yoshida, Takeo Fukurotani, Toshiyuki Takezawa. "ATR Speech Database", Proceedings of the Japanese Society for Artificial Intelligence National Convention 0 (2002): pp. 189-189.
Sepp Hochreiter, and Jürgen Schmidhuber. "Long Short-Term Memory", Neural Computation. 1997, p. 1735-1780.
Mike Schuster and Luldip K. Paliwal. "Bidirectional Recurrent Neural Networks", IEEE transactions on Signal Processing, 45(11), 1997, p. 2673-2681.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer Normalization", arXiv preprint arXiv:1607.06450 (2016).
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition", the IEEE conference on computer vision and pattern recognition. 2016. p. 770-778.
Xaiver Glorot and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks", Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, p. 249-256.
Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks", In: Proceedings of the 23rd international conference on Machine learning. 2006. pp. 369-376.
Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevshky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", The Journal of Machine Learning Research 15.1 (2014) p. 1929-1958.

[PYTHON] Speech Recognition: Phoneme Prediction Part2 --Connectionist Temporal Classification RNN