[PYTHON] [TensorFlow 2] Learn RNN with CTC Loss

Introduction

I tried TensorFlow 2.x to learn how to use CTC (Connectionist Temporal Classification) Loss to learn the parameters of RNN (Recurrent Neural Network) that returns a sequence. I made a note because there were few samples and I had a hard time moving it.

CTC Loss is summarized on the following pages.

-About the theory and implementation of Connectionist Temporal Classification – Is your order machine learning? -Phoneme recognition using Connectionist Temporal Classification (CTC) --Qiita -Voice recognition and deep learning --SlideShare

Verification environment

Based sample code

GitHub - igormq/ctc_tensorflow_example: CTC + Tensorflow Example for ASR

This is a sample implemented in TensorFlow 1.x without using Keras API. The correspondence between feature series and label (character) series is learned by LSTM, like an end-to-end speech recognition sample.

It's a 1.x version of the code, but it's not difficult to run with TensorFlow 2.x.

#Install the required packages
pip3 install python_speech_features --user
#Get the code
git clone https://github.com/igormq/ctc_tensorflow_example.git

If you change ctc_tensorflow_example.py about 3 lines as shown below, it will work with TensorFlow 2.x.

patch


diff --git a/ctc_tensorflow_example.py b/ctc_tensorflow_example.py
index 579d431..2d96d54 100644
--- a/ctc_tensorflow_example.py
+++ b/ctc_tensorflow_example.py
@@ -5,7 +5,7 @@ from __future__ import print_function
 
 import time
 
-import tensorflow as tf
+import tensorflow.compat.v1 as tf
 import scipy.io.wavfile as wav
 import numpy as np
 
@@ -20,6 +20,8 @@ except ImportError:
 from utils import maybe_download as maybe_download
 from utils import sparse_tuple_from as sparse_tuple_from
 
 # Constants
 SPACE_TOKEN = '<space>'
 SPACE_INDEX = 0
@@ -103,9 +105,9 @@ with graph.as_default():
     #   tf.nn.rnn_cell.GRUCell 
     cells = []
     for _ in range(num_layers):
-        cell = tf.contrib.rnn.LSTMCell(num_units)  # Or LSTMCell(num_units)
+        cell = tf.nn.rnn_cell.LSTMCell(num_units)  # Or LSTMCell(num_units)
         cells.append(cell)
-    stack = tf.contrib.rnn.MultiRNNCell(cells)
+    stack = tf.nn.rnn_cell.MultiRNNCell(cells)
 
     # The second output is the last state and we will no use that
     outputs, _ = tf.nn.dynamic_rnn(stack, inputs, seq_len, dtype=tf.float32)

Terminal


python3 ctc_tensorflow_example.py
Epoch 1/200, train_cost = 726.374, train_ler = 1.000, val_cost = 167.637, val_ler = 1.000, time = 0.549
(Omitted)
Epoch 200/200, train_cost = 0.648, train_ler = 0.000, val_cost = 0.642, val_ler = 0.000, time = 0.218
Original:
she had your dark suit in greasy wash water all year
Decoded:
she had your dark suit in greasy wash water all year

Convert to code for TensorFlow 2

If you use TensorFlow 2 with much effort, writing for TensorFlow 2 will improve processing efficiency (probably), and it will be good for maintenance later. That's why I'm going to rewrite the sample code, but I can't find a sample of how to write it ...

It finally worked as if I had cut and pasted the code in various places. The main reference sites are as follows.

  1. Effective TensorFlow 2 | TensorFlow Core
  2. TensorFlow 2.0 Alpha: Convert existing code to TensorFlow 2.0 – TensorFlow 2.x
  3. [TensorFlow 2.0 Main Changes --S-Analysis](http://data-analysis-stats.jp/2019/06/09/tensorflow-2-0-%E4%B8%BB%E3%81% AA% E5% A4% 89% E6% 9B% B4% E7% 82% B9 /)

Sample code (1)

ctc_tensorflow_example_tf2.py


#  Compatibility imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time

import tensorflow as tf
import scipy.io.wavfile as wav
import numpy as np

from six.moves import xrange as range

try:
    from python_speech_features import mfcc
except ImportError:
    print("Failed to import python_speech_features.\n Try pip install python_speech_features.")
    raise ImportError

from utils import maybe_download as maybe_download
from utils import sparse_tuple_from as sparse_tuple_from

# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1  # 0 is reserved to space

# Some configs
num_features = 13
num_units=50 # Number of units in the LSTM cell
# Accounting the 0th indice +  space + blank label = 28 characters
num_classes = ord('z') - ord('a') + 1 + 1 + 1

# Hyper-parameters
num_epochs = 200
num_hidden = 50
num_layers = 1
batch_size = 1
initial_learning_rate = 1e-2
momentum = 0.9

num_examples = 1
num_batches_per_epoch = int(num_examples/batch_size)

# Loading the data

audio_filename = maybe_download('LDC93S1.wav', 93638)
target_filename = maybe_download('LDC93S1.txt', 62)

fs, audio = wav.read(audio_filename)

inputs = mfcc(audio, samplerate=fs)
# Transform in 3D array
train_inputs = np.asarray(inputs[np.newaxis, :], dtype=np.float32)
train_inputs = (train_inputs - np.mean(train_inputs))/np.std(train_inputs)

train_seq_len = [train_inputs.shape[1]]

# Reading targets
with open(target_filename, 'r') as f:

    #Only the last line is necessary
    line = f.readlines()[-1]

    # Get only the words between [a-z] and replace period for none
    original = ' '.join(line.strip().lower().split(' ')[2:]).replace('.', '')
    targets = original.replace(' ', '  ')
    targets = targets.split(' ')

# Adding blank label
targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])

# Transform char into index
targets = np.asarray([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
                      for x in targets])

train_targets = tf.sparse.SparseTensor(*sparse_tuple_from([targets], dtype=np.int32))

train_targets_len = [train_targets.shape[1]]

# We don't have a validation dataset :(
val_inputs, val_targets, val_seq_len, val_targets_len = train_inputs, train_targets, \
                                                        train_seq_len, train_targets_len


# THE MAIN CODE!

# Defining the cell
# Can be:
#   tf.nn.rnn_cell.RNNCell
#   tf.nn.rnn_cell.GRUCell 
cells = []
for _ in range(num_layers):
    cell = tf.keras.layers.LSTMCell(num_units)  # Or LSTMCell(num_units)
    cells.append(cell)
stack = tf.keras.layers.StackedRNNCells(cells)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.RNN(stack, input_shape=(None, num_features), return_sequences=True))
# Truncated normal with mean 0 and stdev=0.1
# Zero initialization        
model.add(tf.keras.layers.Dense(num_classes,
                          kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
                          bias_initializer="zeros"))
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, momentum)

@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
    if flag_training:
        with tf.GradientTape() as tape:
            logits = model(inputs, training=True)
            # Time major
            logits = tf.transpose(logits, (1, 0, 2))
            cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))

        gradients = tape.gradient(cost, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    else:
        logits = model(inputs)
        # Time major
        logits = tf.transpose(logits, (1, 0, 2))
        cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))

    # Option 2: tf.nn.ctc_beam_search_decoder
    # (it's slower but you'll get better results)
    decoded, _ = tf.nn.ctc_greedy_decoder(logits, seq_len)

    # Inaccuracy: label error rate
    ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
                                          targets))
    return cost, ler, decoded

for curr_epoch in range(num_epochs):
    train_cost = train_ler = 0
    start = time.time()

    for batch in range(num_batches_per_epoch):
        batch_cost, batch_ler, _ = step(train_inputs, train_targets, train_seq_len, train_targets_len, True)
        train_cost += batch_cost*batch_size
        train_ler += batch_ler*batch_size

    train_cost /= num_examples
    train_ler /= num_examples

    val_cost, val_ler, decoded = step(val_inputs, val_targets, val_seq_len, val_targets_len, False)
    log = "Epoch {}/{}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}"
    print(log.format(curr_epoch+1, num_epochs, train_cost, train_ler,
                     val_cost, val_ler, time.time() - start))
# Decoding
d = tf.sparse.to_dense(decoded[0])[0].numpy()
str_decoded = ''.join([chr(x) for x in np.asarray(d) + FIRST_INDEX])
# Replacing blank label to none
str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
# Replacing space label to space
str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')

print('Original:\n%s' % original)
print('Decoded:\n%s' % str_decoded)

Only the first epoch takes time, but after that it seems to be about 30% faster.

python3 ctc_tensorflow_example_tf2.py
Epoch 1/200, train_cost = 774.063, train_ler = 1.000, val_cost = 505.479, val_ler = 0.981, time = 1.547
Epoch 2/200, train_cost = 505.479, train_ler = 0.981, val_cost = 496.959, val_ler = 1.000, time = 0.158
(Omitted)
Epoch 200/200, train_cost = 0.541, train_ler = 0.000, val_cost = 0.537, val_ler = 0.000, time = 0.143
Original:
she had your dark suit in greasy wash water all year
Decoded:
she had your dark suit in greasy wash water all year

Explanation of changes

Abolition of tf.Session and tf.placeholder

The original code was based on TensorFlow 1.x's tf.Session, so it won't work in TensorFlow 2.x (without the tf.compat.v1 API). .. The tf.placeholder is gone, and you can just write the code that directly manipulates the entered Tensor.

Basically, the combination of tf.Session and tf.placeholder as described in Effective TensorFlow 2 Rewrite as follows.

# TensorFlow 1.X
outputs = session.run(f(placeholder), feed_dict={placeholder: input})
# TensorFlow 2.0
outputs = f(input)

At this time, add a @ tf.function decorator to move f in graph mode [^ 1].

[^ 1]: It works without @ tf.function, but it is slow because it is Eager Execution. Since Eager Execution is useful when debugging, I think it's a good idea to remove (comment out) @ tf.function and add @ tf.function when it works.

So the original code

# TensorFlow 1.X
feed = {inputs: train_inputs,
        targets: train_targets,
        seq_len: train_seq_len}

batch_cost, _ = session.run([cost, optimizer], feed)
train_cost += batch_cost*batch_size
train_ler += session.run(ler, feed_dict=feed)*batch_size

Is a function (with @ tf.function) that gives train_inputs, train_targets, train_seq_len as arguments and returns cost, optimizer as a return value, if you think in principle. It will be rewritten. However, `ʻoptimizerdoes not need to return a value as it only needs to be executed. Also, the same feedis given in the immediately after session.runto calculate ler, and decodedis used in the decoding process after learning is completed. I'll return them together (I used decodeonly the last time, but anyway, decodedinternally for the calculation of ler`` It's a calculation, so it's not a waste of time (probably ...)).

# TensorFlow 2.0
@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
(Omitted)
    return cost, ler, decoded

batch_cost, batch_ler, _ = step(train_inputs, train_targets, train_seq_len, train_targets_len, True)
train_cost += batch_cost*batch_size
train_ler += batch_ler*batch_size

In order to reuse most of the processing for verification purposes, I wrote the function name step to switch whether to train with the additional argument flag_training. In addition, the argument targets_len has been increased, but this is because the argument given to tf.nn.ctc_loss in TensorFlow 2.x has changed, and it should not be directly related to Eager Executionization.

The tf.sparse_placeholder used to give variable-length correct labels is also gone. I prepared a tuple of (indices, values, shape) to give data to tf.sparse_placeholder, but now I can specify tf.SparseTensor directly from the outside. So I created tf.SparseTensor by myself. dtype matches the type of the original tf.sparse_placeholder, but note that it is np.int32 instead of tf.int32 (subtly caught) point).

# TensorFlow 1.X
train_targets = sparse_tuple_from([targets])

# TensorFlow 2.0
train_targets = tf.sparse.SparseTensor(*sparse_tuple_from([targets], dtype=np.int32))

Change of learning part

In TensorFlow 2.x, `ʻOptimizerhas been changed to use Keras's. In line with that, so far I used to use ```Optimizer.minimize (), but changed it to use GradientTape in TensorFlow 2.x. This process is in the step () we defined earlier.

##### TensorFlow 1.X #####
# Time major
logits = tf.transpose(logits, (1, 0, 2))

loss = tf.nn.ctc_loss(targets, logits, seq_len)
cost = tf.reduce_mean(loss)

optimizer = tf.train.MomentumOptimizer(initial_learning_rate,
                                           0.9).minimize(cost)

##### TensorFlow 2.0 #####
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, 0.9)

@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
    if flag_training:
        with tf.GradientTape() as tape:
            logits = model(inputs, training=True)
            # Time major                                                                                                                              
            logits = tf.transpose(logits, (1, 0, 2))
            cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))

        gradients = tape.gradient(cost, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    else:
(Omitted below)

Here, I need to give a list of weights to be (learned) to be calculated by gradient, but it is troublesome to collect manually defined tf.Variable, so I will make a calculation graph of the model by myself. I changed the part I was writing to tf.keras.Model. This makes it easy to get a list of weights to learn with model.trainable_variables. It also simplifies the creation of calculation graphs.

##### TensorFlow 1.X #####
# The second output is the last state and we will no use that
outputs, _ = tf.nn.dynamic_rnn(stack, inputs, seq_len, dtype=tf.float32)

shape = tf.shape(inputs)
batch_s, max_timesteps = shape[0], shape[1]

# Reshaping to apply the same weights over the timesteps
outputs = tf.reshape(outputs, [-1, num_hidden])

# Truncated normal with mean 0 and stdev=0.1
# Tip: Try another initialization
# see https://www.tensorflow.org/versions/r0.9/api_docs/python/contrib.layers.html#initializers
W = tf.Variable(tf.truncated_normal([num_hidden,
                                     num_classes],
                                    stddev=0.1))
# Zero initialization
# Tip: Is tf.zeros_initializer the same?
b = tf.Variable(tf.constant(0., shape=[num_classes]))

# Doing the affine projection
logits = tf.matmul(outputs, W) + b

# Reshaping back to the original shape
logits = tf.reshape(logits, [batch_s, -1, num_classes])

##### TensorFlow 2.0 #####
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.RNN(stack, input_shape=(None, num_features), return_sequences=True))
# Truncated normal with mean 0 and stdev=0.1
# Zero initialization        
model.add(tf.keras.layers.Dense(num_classes,
                          kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
                          bias_initializer="zeros"))

If a tensor of 3D or higher, including the sample dimensions, is entered in tf.keras.layers.Dense

--Flatten other than the last dimension --Multiply the weight matrix from the right --Return to the original shape after calculation

It will be the operation. In the original code, I wrote the shape operation before and after writing the weight by myself, but it is also very easy because it can be thrown to Keras.

Other

I specified dtype and set the feature type to float32. It works even if it is not specified, but WARNING occurs.

train_inputs = np.asarray(inputs[np.newaxis, :], dtype=np.float32)

Improved to handle variable length data

In the sample code (1) above, there was only one training data, but of course, in reality, we want to put multiple data in a mini-batch for learning. Both the input data and the correct label series have different lengths, so you have to handle them well.

Sample code (2)

ctc_tensorflow_example_tf2_multi.py


#  Compatibility imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import time

import tensorflow as tf
import scipy.io.wavfile as wav
import numpy as np

from six.moves import xrange as range

try:
    from python_speech_features import mfcc
except ImportError:
    print("Failed to import python_speech_features.\n Try pip install python_speech_features.")
    raise ImportError

from utils import maybe_download as maybe_download
from utils import sparse_tuple_from as sparse_tuple_from

# Constants
SPACE_TOKEN = '<space>'
SPACE_INDEX = 0
FIRST_INDEX = ord('a') - 1  # 0 is reserved to space
FEAT_MASK_VALUE = 1e+10

# Some configs
num_features = 13
num_units = 50 # Number of units in the LSTM cell
# Accounting the 0th indice +  space + blank label = 28 characters
num_classes = ord('z') - ord('a') + 1 + 1 + 1

# Hyper-parameters
num_epochs = 400
num_hidden = 50
num_layers = 1
batch_size = 2
initial_learning_rate = 1e-2
momentum = 0.9

# Loading the data

audio_filename = maybe_download('LDC93S1.wav', 93638)
target_filename = maybe_download('LDC93S1.txt', 62)

fs, audio = wav.read(audio_filename)

# create a dataset composed of data with variable lengths
inputs = mfcc(audio, samplerate=fs)
inputs = (inputs - np.mean(inputs))/np.std(inputs)
inputs_short = mfcc(audio[fs*8//10:fs*20//10], samplerate=fs)
inputs_short = (inputs_short - np.mean(inputs_short))/np.std(inputs_short)
# Transform in 3D array
train_inputs = tf.ragged.constant([inputs, inputs_short], dtype=np.float32)
train_seq_len = tf.cast(train_inputs.row_lengths(), tf.int32)
train_inputs = train_inputs.to_sparse()

num_examples = train_inputs.shape[0]

# Reading targets
with open(target_filename, 'r') as f:

    #Only the last line is necessary
    line = f.readlines()[-1]

    # Get only the words between [a-z] and replace period for none
    original = ' '.join(line.strip().lower().split(' ')[2:]).replace('.', '')
    targets = original.replace(' ', '  ')
    targets = targets.split(' ')

# Adding blank label
targets = np.hstack([SPACE_TOKEN if x == '' else list(x) for x in targets])

# Transform char into index
targets = np.asarray([SPACE_INDEX if x == SPACE_TOKEN else ord(x) - FIRST_INDEX
                      for x in targets])
# Creating sparse representation to feed the placeholder
train_targets = tf.ragged.constant([targets, targets[13:32]], dtype=np.int32) 
train_targets_len = tf.cast(train_targets.row_lengths(), tf.int32)
train_targets = train_targets.to_sparse() 

# We don't have a validation dataset :(
val_inputs, val_targets, val_seq_len, val_targets_len = train_inputs, train_targets, \
                                                        train_seq_len, train_targets_len


# THE MAIN CODE!

# Defining the cell
# Can be:
#   tf.nn.rnn_cell.RNNCell
#   tf.nn.rnn_cell.GRUCell 
cells = []
for _ in range(num_layers):
    cell = tf.keras.layers.LSTMCell(num_units)  # Or LSTMCell(num_units)
    cells.append(cell)
stack = tf.keras.layers.StackedRNNCells(cells)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Masking(FEAT_MASK_VALUE, input_shape=(None, num_features)))
model.add(tf.keras.layers.RNN(stack, return_sequences=True))
# Truncated normal with mean 0 and stdev=0.1
# Zero initialization        
model.add(tf.keras.layers.Dense(num_classes,
                          kernel_initializer=tf.keras.initializers.TruncatedNormal(0.0, 0.1),
                          bias_initializer="zeros"))
optimizer = tf.keras.optimizers.SGD(initial_learning_rate, momentum)

@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
    inputs = tf.sparse.to_dense(inputs, default_value=FEAT_MASK_VALUE)
    if flag_training:
        with tf.GradientTape() as tape:
            logits = model(inputs, training=True)
            # Time major
            logits = tf.transpose(logits, (1, 0, 2))
            cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))

        gradients = tape.gradient(cost, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    else:
        logits = model(inputs)
        # Time major
        logits = tf.transpose(logits, (1, 0, 2))
        cost = tf.reduce_mean(tf.nn.ctc_loss(targets, logits, targets_len, seq_len, blank_index=-1))

    # Option 2: tf.nn.ctc_beam_search_decoder
    # (it's slower but you'll get better results)
    decoded, _ = tf.nn.ctc_greedy_decoder(logits, seq_len)

    # Inaccuracy: label error rate
    ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
                                          targets))
    return cost, ler, decoded

ds = tf.data.Dataset.from_tensor_slices((train_inputs, train_targets, train_seq_len, train_targets_len)).batch(batch_size)
for curr_epoch in range(num_epochs):
    train_cost = train_ler = 0
    start = time.time()

    for batch_inputs, batch_targets, batch_seq_len, batch_targets_len in ds:
        batch_cost, batch_ler, _ = step(batch_inputs, batch_targets, batch_seq_len, batch_targets_len, True)
        train_cost += batch_cost*batch_size
        train_ler += batch_ler*batch_size

    train_cost /= num_examples
    train_ler /= num_examples

    val_cost, val_ler, decoded = step(val_inputs, val_targets, val_seq_len, val_targets_len, False)
    log = "Epoch {}/{}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}"
    print(log.format(curr_epoch+1, num_epochs, train_cost, train_ler,
                     val_cost, val_ler, time.time() - start))
# Decoding
print('Original:')
print(original)
print(original[13:32])
print('Decoded:')
d = tf.sparse.to_dense(decoded[0], default_value=-1).numpy()
for i in range(2):
    str_decoded = ''.join([chr(x) for x in np.asarray(d[i][d[i] != -1]) + FIRST_INDEX])
    # Replacing blank label to none
    str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
    # Replacing space label to space
    str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')
    print(str_decoded)

The execution result is as follows, for example.

Epoch 1/400, train_cost = 527.789, train_ler = 1.122, val_cost = 201.650, val_ler = 1.000, time = 1.702
Epoch 2/400, train_cost = 201.650, train_ler = 1.000, val_cost = 372.285, val_ler = 1.000, time = 0.238
(Omitted)
Epoch 400/400, train_cost = 1.331, train_ler = 0.000, val_cost = 1.320, val_ler = 0.000, time = 0.307
Original:
she had your dark suit in greasy wash water all year
dark suit in greasy
Decoded:
she had your dark suit in greasy wash water all year
dark suit in greasy

Commentary

Preparation of variable length data

# create a dataset composed of data with variable lengths
inputs = mfcc(audio, samplerate=fs)
inputs = (inputs - np.mean(inputs))/np.std(inputs)
inputs_short = mfcc(audio[fs*8//10:fs*20//10], samplerate=fs)
inputs_short = (inputs_short - np.mean(inputs_short))/np.std(inputs_short)
# Transform in 3D array
train_inputs = tf.ragged.constant([inputs, inputs_short], dtype=np.float32)
train_seq_len = tf.cast(train_inputs.row_lengths(), tf.int32)
train_inputs = train_inputs.to_sparse()

num_examples = train_inputs.shape[0]

I prepared a part of the data used in the original code and increased the data to two. In the end, the data will be SparseTensor, but if you first create RaggedTensor using tf.ragged.constant () and then convert from there, it will be easier to create. ..

Input variable length data to the model

As I mentioned in another article, I use the Masking layer to represent variable length inputs. Try a basic RNN (LSTM) in Keras-Qiita

model.add(tf.keras.layers.Masking(FEAT_MASK_VALUE, input_shape=(None, num_features)))

Since the shape of the mini-batch at the time of input is made with the maximum data length, when inputting short data, fill the shortage part with FEAT_MASK_VALUE.

@tf.function
def step(inputs, targets, seq_len, targets_len, flag_training):
    inputs = tf.sparse.to_dense(inputs, default_value=FEAT_MASK_VALUE)

I explained the input features, but the same applies to the label side. targets [13:32] is just fetching the label corresponding to the clipped audio section (magic number ...).

# Creating sparse representation to feed the placeholder
train_targets = tf.ragged.constant([targets, targets[13:32]], dtype=np.int32)
train_targets_len = tf.cast(train_targets.row_lengths(), tf.int32)
train_targets = train_targets.to_sparse()

When training, create a Dataset that summarizes the required data and use batch () to create a mini-batch. You can retrieve the mini-batch in sequence in a for loop.

ds = tf.data.Dataset.from_tensor_slices((train_inputs, train_targets, train_seq_len, train_targets_len)).batch(batch_size)
for curr_epoch in range(num_epochs):
(Omitted)
    for batch_inputs, batch_targets, batch_seq_len, batch_targets_len in ds:
(Omitted)

In reality, I think that the training data will be written to a TFRecord format file in advance, and Dataset will be created from it and used. If you use tf.io.VarLenFeature to fetch the feature as SparseTensor at the time of loading, you can use the processing of the contents of the current loop as it is (probably). [\ TensorFlow 2 ] It is recommended to read features from TFRecord in batch units --Qiita

Can't you do it all with Keras?

It's good to say that it works with TensorFlow 2.x based processing, but since the model has been converted to Keras in the end, let's think about whether the learning part can also be executed with the Keras API. I will.

** This was the beginning of the Shura road ... From the conclusion, it seems that you should not try hard. </ del> **

** (Added on 2020/04/27) I found a way to work well with Keras. Please see another article for details. ** ** [\ TensorFlow 2 / Keras ] How to run learning with CTC Loss in Keras-Qiita

Sample code (3)

It is based on the sample code (1) that is learned with one piece of data.

ctc_tensorflow_example_tf2_keras.py


(Since it is the same as the TF2 version, the first half is omitted)

# Creating sparse representation to feed the placeholder
train_targets = tf.sparse.to_dense(tf.sparse.SparseTensor(*sparse_tuple_from([targets], dtype=np.int32)))

(Omitted)

def loss(y_true, y_pred):
    #print(y_true)  # Tensor("dense_target:0", shape=(None, None, None), dtype=float32) ???
    targets_len = train_targets_len[0]
    seq_len = train_seq_len[0]
    targets = tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32)
    # Time major
    logits = tf.transpose(y_pred, (1, 0, 2))
    return tf.reduce_mean(tf.nn.ctc_loss(targets, logits,
             tf.fill((tf.shape(targets)[0],), targets_len), tf.fill((tf.shape(logits)[1],), seq_len),
             blank_index=-1))

def metrics(y_true, y_pred):
    targets_len = train_targets_len[0]
    seq_len = train_seq_len[0]
    targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))
    # Time major
    logits = tf.transpose(y_pred, (1, 0, 2))

    # Option 2: tf.nn.ctc_beam_search_decoder
    # (it's slower but you'll get better results)
    decoded, _ = tf.nn.ctc_greedy_decoder(logits, train_seq_len)

    # Inaccuracy: label error rate
    ler = tf.reduce_mean(tf.edit_distance(tf.cast(decoded[0], tf.int32),
                                          targets))
    return ler

model.compile(loss=loss, optimizer=optimizer, metrics=[metrics])
for curr_epoch in range(num_epochs):
    train_cost = train_ler = 0
    start = time.time()
    train_cost, train_ler = model.train_on_batch(train_inputs, train_targets)
    val_cost, val_ler = model.test_on_batch(train_inputs, train_targets)
    log = "Epoch {}/{}, train_cost = {:.3f}, train_ler = {:.3f}, val_cost = {:.3f}, val_ler = {:.3f}, time = {:.3f}"
    print(log.format(curr_epoch+1, num_epochs, train_cost, train_ler,
                     val_cost, val_ler, time.time() - start))

decoded, _ = tf.nn.ctc_greedy_decoder(tf.transpose(model.predict(train_inputs), (1, 0, 2)), train_seq_len)
d = tf.sparse.to_dense(decoded[0])[0].numpy()
str_decoded = ''.join([chr(x) for x in np.asarray(d) + FIRST_INDEX])
# Replacing blank label to none
str_decoded = str_decoded.replace(chr(ord('z') + 1), '')
# Replacing space label to space
str_decoded = str_decoded.replace(chr(ord('a') - 1), ' ')

print('Original:\n%s' % original)
print('Decoded:\n%s' % str_decoded)

It looks like it's written like that. However, the behavior is actually quite suspicious ...

Suspicious point

Handling of sparse labels

If you create a model normally with Keras, you cannot use Sparse labels with Model.fit () or Model.train_on_batch (). I couldn't help it, so I converted it to a normal Tensor.

Since the label must be Sparse when calculating the label error rate

targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))

I'm returning to Sparse again, but this removes the ID: 0 symbol that corresponds to the space (yes, that's right, because it's a sparse matrix that doesn't originally have 0s ...). Therefore, the error rate is calculated with spaces removed from the correct label column, and the error rate does not become 0 forever (insertion errors occur as many as the number of spaces). I will. The most recent solution is to change the ID system so that ID: 0 becomes a blank symbol (≠ space). However, it is better to solve the point that you bother to set Sparse to Dense and return it again ...

Loss function behavior

When writing in Keras, specify the loss function in Model.compile (). You can also specify your own Callable

def loss(y_true, y_pred):

Since we can only take two arguments, this time we will fetch the length information from a global variable. It's still good up to that point.

def loss(y_true, y_pred):
    #print(y_true)  # Tensor("dense_target:0", shape=(None, None, None), dtype=float32) ???
(Omitted)
    targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))

Isn't y_true the data brought from the correct label (that is, train_targets and val_targets)? These dimensions are supposed to be two dimensions of (sample, time), but for some reason they are three dimensions of Tensor ... Moreover, the original label should have been made with `ʻint32, but for some reason it is float32`` ...

That's why I don't know what's coming to y_true

targets = tf.sparse.from_dense(tf.cast(tf.reshape(y_true, (-1, targets_len)), tf.int32))

It has been transformed into two dimensions and type-converted. It's too suspicious. But it seems that learning is done properly?

This may be the Keras specification (design concept?), And the description in the following document is also tf.keras.losses.Loss | TensorFlow Core v2.1.0

y_true: Ground truth values. shape = [batch_size, d0, .. dN] y_pred: The predicted values. shape = [batch_size, d0, .. dN]

It can be read as if it is supposed to have the same shape. There is no problem when using cross entropy loss etc. in a normal classification problem, but in the case of "the length of the correct label and the prediction result are different" like CTC Loss, it gets confused immediately.

… By the way, sparse_categorical_crossentropy has different shapes of y_true and y_pred, right? How is that achieved?

-- y_true: Category variable (batch_size,) -- y_pred: Output score for each category (batch_size, num_classes)

In other words, you should be able to imitate that implementation. If you look at the implementation below, transformation and type conversion are included, so it may be that the current implementation is actually suitable. (But still suspicious) tensorflow/backend.py at v2.1.0 · tensorflow/tensorflow · GitHub

Execution speed is slow

Epoch 1/200, train_cost = 774.764, train_ler = 1.190, val_cost = 387.497, val_ler = 1.000, time = 2.212
Epoch 2/200, train_cost = 387.497, train_ler = 1.000, val_cost = 638.239, val_ler = 1.000, time = 0.459
(Omitted)
Epoch 200/200, train_cost = 3.549, train_ler = 0.238, val_cost = 3.481, val_ler = 0.238, time = 0.461
Original:
she had your dark suit in greasy wash water all year
Decoded:
she had your dark suit in greasy wash water all year

It takes about 3 times longer than before rewriting to Keras version (TensorFlow 2.x version) ... [^ 2] Moreover, for the reasons mentioned above, the values of train_ler and val_ler are not output correctly.

[^ 2]: Since the amount of data is small, the learning itself is not slow, but Keras conversion may just cause overhead that does not depend on the amount of data. Another possible cause is that the labels are moved back and forth between Dense and Sparse.

I tried my best to write the learning part in Keras style, but I ended up with suspicious hacks, and there is nothing good for now. </ strong> It may be solved with the version upgrade of TensorFlow and Keras, but how about it? </ del>

Summary

--I explained how to learn parameters using CTC Loss in TensorFlow 2.x. It seems to be working for the time being. ――It looks like writing a TensorFlow style learning loop with some Keras code mixed in, but I don't recommend writing it entirely in Keras because it's rather troublesome. </ del> -** (Added on 2020/04/27) Please see Separate article for how to learn in Keras style. ** **

Recommended Posts

[TensorFlow 2] Learn RNN with CTC Loss
Try TensorFlow MNIST with RNN
[TensorFlow 2 / Keras] How to run learning with CTC Loss in Keras
Learn data distributed with TensorFlow Y = 2X
Try TensorFlow RNN with a basic model
Practice RNN TensorFlow
Zundokokiyoshi with TensorFlow
Breakout with Tensorflow
Learn Python with ChemTHEATER
Learn Pandas with Cheminformatics
Reading data with TensorFlow
Kyotei forecast with TensorFlow
Learn with chemoinformatics scikit-learn
Learn with Cheminformatics Matplotlib
Learn with Cheminformatics NumPy
Learn Wasserstein GAN with Keras model and TensorFlow optimization
DCGAN with TF Learn
Try regression with TensorFlow
[How to!] Learn and play Super Mario with Tensorflow !!
Learn Pendulum-v0 with DDPG
Do not learn with machine learning library TensorFlow ~ Fibonacci sequence
Translate Getting Started With TensorFlow
Learn librosa with a tutorial 1
Try deep learning with TensorFlow
Approximate sin function with TensorFlow
Weight loss Elasticsearch with Curator
Learn elliptical orbits with Chainer
Learn new data with PaintsChainer
Jetson Nano JETPACK 44.1 (2020/10/21) with Tensorflow
Easy image classification with TensorFlow
Stock price forecast with tensorflow