Introduction

There are several patterns when using variable-length matrices in natural language processing. I feel like I'm reimplementing it every time, so I'll summarize it as a memorandum.

In this article, I will put the implementation of Chainer and Tensorflow that I often use. (Note: I didn't copy and paste the production code, I reimplemented it from scratch for this post, so it hasn't been tested.)

Things to keep in mind about Chainer

I think Chainer recommends managing variable length as a list of Variable instead of managing it with Variable + length as handled below. Specifically, L.NStepLSTM and [F.pad_sequence](https: / /docs.chainer.org/en/stable/reference/generated/chainer.functions.pad_sequence.html) and so on.

Note

The code below is based on the assumption that the following imports have been made.

`Chainer`


import chainer
import chainer.functions as F
import numpy as np

`Tensorflow`


import tensorflow as tf
import numpy as np


sess = tf.InteractiveSession()

Text

Padding

Many deep learning frameworks do not directly support the calculation of variable length matrices to take advantage of GPU and CPU parallel computing. Therefore, padding is performed to fill the part outside the series length with an appropriate value according to the maximum length matrix.

In addition, this part is often done by yourself at the stage of data creation, not the deep learning framework.

X = [np.array([1, 2]),
     np.array([11, 12, 13, 14]),
     np.array([21])]

#With int32 assuming handling word ID
x = np.zeros([3, 4], dtype=np.int32)

for i, xi in enumerate(X):
    x[i, :len(xi)] = xi[:]

print x
# [[ 1  2  0  0]
#  [11 12 13 14]
#  [21  0  0  0]]

When using Chainer's L.EmbedId, it is better to use -1 padding instead of 0 padding and useL.EmbedId (..., ignore_label = -1).

Masking

When doing Sum pooling etc., mask the part outside the series length created by Padding with 0 (however, [Do not overconfide masking](Do not overconfide #Mask). The calculation can be realized by the calculation of where.

(If True, the lvalue is used, and if False, the rvalue is used to function as masking.)

If you write this process step by step:

`Chainer`


x = chainer.Variable(np.arange(1, 7).reshape(2, 3))
print x
# variable([[1 2 3]
#           [4 5 6]])

length = np.array([3, 2], dtype=np.int32)
print length
# [3 2]

xp = chainer.cuda.get_array_module(x.data)
mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
print mask
# [[0 1 2]
#  [0 1 2]]

mask = mask < length.reshape(-1, 1)
print mask
# [[ True  True  True]
#  [ True  True False]]

padding = xp.zeros(x.shape, dtype=x.dtype)
print padding
# [[0 0 0]
#  [0 0 0]]

z = F.where(mask, x, padding)
print z
# variable([[1 2 3]
#           [4 5 0]])

sequence_mask is convenient in Tensorflow.

`Tensorflow`


x = tf.constant(np.arange(1, 7).reshape(2, 3).astype(np.float32))
length = tf.constant(np.array([3, 2], dtype=np.int32))

mask = tf.sequence_mask(length, tf.shape(x)[-1])
padding = tf.fill(tf.shape(x), 0.0)
z = tf.where(mask, x, padding)
print z.eval()
# [[ 1.  2.  3.]
#  [ 4.  5.  0.]]

Chainer version (rather than numpy version) sequence_mask

`Chainer`


def sequence_mask(length, max_num=None):
    xp = chainer.cuda.get_array_module(length.data)
    if max_num is None:
        max_num = xp.max(length)
    # create permutation on (length.ndim + 1) dimension
    perms = xp.arange(max_num).reshape([1] * length.ndim + [-1])
    length = length.reshape([1] * (length.ndim - 1) + [-1] + [1])
    return perms < length

Reshape

Since deep learning often handles rank 2 matrices of mini-batch size x features, many frameworks provide many functions that take such matrices as input. In order to enjoy the benefits of these functions, the mini-batch x sequence length x feature matrix is converted to a (mini-batch size * sequence length) x feature rank 2 matrix for processing.

However, this is a waste of extra processing when the matrix is relatively sparse. You can reduce the processing by doing your best in indexing. (I have not tried it, but if the matrix is not sparse, it may take time to reallocate memory, so be careful)

In the case of Tensorflow, such processing can be realized by the following processing.

`Chainer`


# WARNING: I have not checked it in case of rank != 3

x = chainer.Variable(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = np.array([2, 3, 1], dtype=np.int32)
w = chainer.Variable(np.ones([2, 3], dtype=np.float32))

# sequence_mask is mentioned above
mask = sequence_mask(length, x.shape[length.ndim])
print mask
# [[ True  True False]
#  [ True  True  True]
#  [ True False False]]

x_reshaped = F.get_item(x, mask)
print x_reshaped
# [[  0.   1.]
#  [  2.   3.]
#  [  6.   7.]
#  [  8.   9.]
#  [ 10.  11.]
#  [ 12.  13.]]

y_reshaped = F.matmul(x_reshaped, w)
print y_reshaped
# [[  1.   1.   1.]
#  [  5.   5.   5.]
#  [ 13.  13.  13.]
#  [ 17.  17.  17.]
#  [ 21.  21.  21.]
#  [ 25.  25.  25.]]

pad_shape = [[0, 0] for _ in xrange(y_reshaped.ndim)]
pad_shape[length.ndim - 1][1] = 1
y_reshaped = F.pad(y_reshaped, pad_shape, 'constant', constant_values=0.)
print y_reshaped
# variable([[  1.,   1.,   1.],
#           [  5.,   5.,   5.],
#           [ 13.,  13.,  13.],
#           [ 17.,  17.,  17.],
#           [ 21.,  21.,  21.],
#           [ 25.,  25.,  25.],
#           [  0.,   0.,   0.]])


idx_size = np.prod(mask.shape)
inv_idx = np.ones([idx_size], dtype=np.int32) * -1
inv_idx[np.nonzero(mask.flat)[0]] = np.arange(x_reshaped.shape[0]).astype(np.int32)
print inv_idx
# [ 0  1 -1  2  3  4  5 -1 -1]

y = F.reshape(F.get_item(y_reshaped, inv_idx), list(x.shape[:length.ndim + 1]) + [-1])
print y
# [[[  1.   1.   1.]
#   [  5.   5.   5.]
#   [  0.   0.   0.]]
# 
#  [[ 13.  13.  13.]
#   [ 17.  17.  17.]
#   [ 21.  21.  21.]]
# 
#  [[ 25.  25.  25.]
#   [  0.   0.   0.]
#   [  0.   0.   0.]]]

In the case of Tensorflow, such processing can be realized by the following processing.

`Tensorflow`


# WARNING: I have not checked it in case of rank != 3
x = tf.constant(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = tf.constant(np.array([2, 3, 1], dtype=np.int32))
w = tf.constant(np.ones([2, 3], dtype=np.float32))

mask = tf.sequence_mask(length, tf.shape(x)[tf.rank(length)])
print mask.eval()
# [[ True  True False]
#  [ True  True  True]
#  [ True False False]]

x_reshaped = tf.boolean_mask(x, mask)
print x_reshaped.eval()
# [[  0.   1.]
#  [  2.   3.]
#  [  6.   7.]
#  [  8.   9.]
#  [ 10.  11.]
#  [ 12.  13.]]

y_reshaped = tf.matmul(x_reshaped, w)
print y_reshaped.eval()
# [[  1.   1.   1.]
#  [  5.   5.   5.]
#  [ 13.  13.  13.]
#  [ 17.  17.  17.]
#  [ 21.  21.  21.]
#  [ 25.  25.  25.]]

idx = tf.to_int32(tf.where(mask))
print idx.eval()
# [[0 0]
#  [0 1]
#  [1 0]
#  [1 1]
#  [1 2]
#  [2 0]]

shape = tf.concat([tf.shape(x)[:-1], tf.shape(y_reshaped)[-1:]], 0)
print shape.eval()
# [3 3 3]

y = tf.scatter_nd(idx, y_reshaped, shape)
print y.eval()
# [[[  1.   1.   1.]
#   [  5.   5.   5.]
#   [  0.   0.   0.]]
# 
#  [[ 13.  13.  13.]
#   [ 17.  17.  17.]
#   [ 21.  21.  21.]]
# 
#  [[ 25.  25.  25.]
#   [  0.   0.   0.]
#   [  0.   0.   0.]]]

Implementation of Softmax

Consider doing a softmax on the outermost dimension of a given matrix. Such situations occur in ListNet Permutation probability distribution and in the calculation of attention.

Softmax formula $ y_i = \frac{exp(x_i)}{\sum_jexp({x_j})} $

x = np.random.random([2, 3]).astype(np.float32)
# array([[ 0.44715771,  0.85983515,  0.08915455],
#        [ 0.02465274,  0.63411605,  0.01340247]], dtype=float32)

length = np.array([3, 2], dtype=np.int32)

I want to calculate Softmax using only the blue area as shown in the figure below.

By the way, don't wear a mask before / after.

`Chainer`


#Bad example 1
x_ = np.copy(x)
x_[1, 2] = 0.
print F.softmax(x_)
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.26211682,  0.48214924,  0.25573397]])

#Bad example 2
y = F.softmax(x)
y[1, 2] = 0.
print y
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.26121548,  0.48049128,  0.0       ]])
#The total of the second line is 1.Obviously not because it is not 0

The reason is very simple, example 1 is for $ exp (0.258) \ neq 0 $. In Example 2, x [2,1] affects the calculation of the denominator.

In Softmax calculation, masking is performed by using $ exp (-inf) = 0 $.

`Chainer`


def masked_softmax(x, length):
    """
    Softmax operation on the ourter-most dimenstion of x.

    Args:
         x (chainer.Variable): Values to be passed to softmax
         length (numpy.ndarray or cupy.ndarray):
             Number of items in the outer-most dimension of x
    """
    assert x.ndim - 1 == length.ndim
    xp = chainer.cuda.get_array_module(x.data)
    x_shape = x.shape
    x = F.reshape(x, (-1, x_shape[-1]))
    # mask: (B, T)
    mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
    mask = mask < length.reshape(-1, 1)
    padding = xp.ones(x.shape, dtype=x.dtype) * -np.inf
    z = F.where(mask, x, padding)
    return F.reshape(F.softmax(z), x_shape)


print masked_softmax(chainer.Variable(x), length)
# variable([[ 0.31153342,  0.47068265,  0.21778394],
#           [ 0.35218161,  0.64781839,  0.        ]])

`Tensorflow`


def masked_softmax(x, length):
    """
    Softmax operation on the ourter-most dimenstion of x.

    Args:
         x (tf.Tensor): Values to be passed to softmax
         length (tf.Tensor): Number of items in the outer-most dimension of x
    """
    mask = tf.sequence_mask(length, tf.shape(x)[-1])
    padding = tf.fill(tf.shape(x), -np.inf)
    z = tf.where(mask, x, padding)
    return tf.nn.softmax(z, dim=-1)


print masked_softmax(
    tf.constant(x),
    tf.constant(length)).eval()
# [[ 0.31153342,  0.47068265,  0.21778394],
#  [ 0.35218161,  0.64781839,  0.        ]]

Appendix:

Don't overconfide in Mask

In the deep learning framework, when division by zero occurs, there is a specification where the gradient becomes ʻinf even if you use where`. Therefore, "I should mask even if I make an unstable calculation" does not work.

There is a network like the following formula.

e = f_0(x) \\
w = f_1(e)

This is expressed by the chain rule as follows. $ \frac{\partial w}{\partial x} = \frac{\partial w}{\partial e}\frac{\partial e}{\partial x} $

By the way, this is realized by automatic differentiation as follows (roughly).

x.grad = e.grad * g(f_0, e, x)

Here, g (f_0, e, x) is the partial derivative expressed from $ f_0 $ and its input / output. In other words, no matter what derivative value ʻe.grad comes from the upper equation, if the partial derivative value of equation $ f_0 $ is ʻinf or nan, x.grad is also ʻinf. It becomes or nan`. If you try this with Chainer and Tensorflow,

`Tensorflow`


sess = tf.InteractiveSession()

x = tf.constant(0.0)

t = x
e = 1. / x
w = tf.where(True, t, e)

print w.eval()  # 0.0
print tf.gradients(w, x)[0].eval()  # nan

`Chainer`


x = chainer.Variable(np.array([0.0], dtype=np.float32))
t = x
e = 1. / x
w = chainer.functions.where(np.array([True]), t, e)

w.grad = np.array([1.0], np.float32)
w.backward(retain_grad=True)

print w  # 0.
print x.grad  # nan

[PYTHON] Tips for handling variable length inputs in deep learning frameworks

Introduction

Things to keep in mind about Chainer

Note

Chainer

Tensorflow

Text

Chainer

Tensorflow

Chainer

Chainer

Tensorflow

Implementation of Softmax

Chainer

Chainer

Tensorflow

Don't overconfide in Mask

Tensorflow

Chainer

`Chainer`

`Tensorflow`

`Chainer`

`Tensorflow`

`Chainer`

`Chainer`

`Tensorflow`

`Chainer`

`Chainer`

`Tensorflow`

`Tensorflow`

`Chainer`