There are several patterns when using variable-length matrices in natural language processing. I feel like I'm reimplementing it every time, so I'll summarize it as a memorandum.
In this article, I will put the implementation of Chainer and Tensorflow that I often use. (Note: I didn't copy and paste the production code, I reimplemented it from scratch for this post, so it hasn't been tested.)
I think Chainer recommends managing variable length as a list of Variable
instead of managing it with Variable
+ length
as handled below. Specifically, L.NStepLSTM
and [F.pad_sequence
](https: / /docs.chainer.org/en/stable/reference/generated/chainer.functions.pad_sequence.html) and so on.
The code below is based on the assumption that the following imports have been made.
Chainer
import chainer
import chainer.functions as F
import numpy as np
Tensorflow
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
Padding
Many deep learning frameworks do not directly support the calculation of variable length matrices to take advantage of GPU and CPU parallel computing. Therefore, padding is performed to fill the part outside the series length with an appropriate value according to the maximum length matrix.
In addition, this part is often done by yourself at the stage of data creation, not the deep learning framework.
X = [np.array([1, 2]),
np.array([11, 12, 13, 14]),
np.array([21])]
#With int32 assuming handling word ID
x = np.zeros([3, 4], dtype=np.int32)
for i, xi in enumerate(X):
x[i, :len(xi)] = xi[:]
print x
# [[ 1 2 0 0]
# [11 12 13 14]
# [21 0 0 0]]
When using Chainer's L.EmbedId
, it is better to use -1 padding instead of 0 padding and useL.EmbedId (..., ignore_label = -1)
.
Masking
When doing Sum pooling etc., mask the part outside the series length created by Padding with 0 (however, [Do not overconfide masking](Do not overconfide #Mask). The calculation can be realized by the calculation of where
.
(If True, the lvalue is used, and if False, the rvalue is used to function as masking.)
If you write this process step by step:
Chainer
x = chainer.Variable(np.arange(1, 7).reshape(2, 3))
print x
# variable([[1 2 3]
# [4 5 6]])
length = np.array([3, 2], dtype=np.int32)
print length
# [3 2]
xp = chainer.cuda.get_array_module(x.data)
mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
print mask
# [[0 1 2]
# [0 1 2]]
mask = mask < length.reshape(-1, 1)
print mask
# [[ True True True]
# [ True True False]]
padding = xp.zeros(x.shape, dtype=x.dtype)
print padding
# [[0 0 0]
# [0 0 0]]
z = F.where(mask, x, padding)
print z
# variable([[1 2 3]
# [4 5 0]])
sequence_mask
is convenient in Tensorflow.
Tensorflow
x = tf.constant(np.arange(1, 7).reshape(2, 3).astype(np.float32))
length = tf.constant(np.array([3, 2], dtype=np.int32))
mask = tf.sequence_mask(length, tf.shape(x)[-1])
padding = tf.fill(tf.shape(x), 0.0)
z = tf.where(mask, x, padding)
print z.eval()
# [[ 1. 2. 3.]
# [ 4. 5. 0.]]
Chainer version (rather than numpy version) sequence_mask
Chainer
def sequence_mask(length, max_num=None):
xp = chainer.cuda.get_array_module(length.data)
if max_num is None:
max_num = xp.max(length)
# create permutation on (length.ndim + 1) dimension
perms = xp.arange(max_num).reshape([1] * length.ndim + [-1])
length = length.reshape([1] * (length.ndim - 1) + [-1] + [1])
return perms < length
Reshape
Since deep learning often handles rank 2 matrices of mini-batch size x features
, many frameworks provide many functions that take such matrices as input. In order to enjoy the benefits of these functions, the mini-batch x sequence length x feature
matrix is converted to a (mini-batch size * sequence length) x feature
rank 2 matrix for processing.
However, this is a waste of extra processing when the matrix is relatively sparse. You can reduce the processing by doing your best in indexing. (I have not tried it, but if the matrix is not sparse, it may take time to reallocate memory, so be careful)
In the case of Tensorflow, such processing can be realized by the following processing.
Chainer
# WARNING: I have not checked it in case of rank != 3
x = chainer.Variable(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = np.array([2, 3, 1], dtype=np.int32)
w = chainer.Variable(np.ones([2, 3], dtype=np.float32))
# sequence_mask is mentioned above
mask = sequence_mask(length, x.shape[length.ndim])
print mask
# [[ True True False]
# [ True True True]
# [ True False False]]
x_reshaped = F.get_item(x, mask)
print x_reshaped
# [[ 0. 1.]
# [ 2. 3.]
# [ 6. 7.]
# [ 8. 9.]
# [ 10. 11.]
# [ 12. 13.]]
y_reshaped = F.matmul(x_reshaped, w)
print y_reshaped
# [[ 1. 1. 1.]
# [ 5. 5. 5.]
# [ 13. 13. 13.]
# [ 17. 17. 17.]
# [ 21. 21. 21.]
# [ 25. 25. 25.]]
pad_shape = [[0, 0] for _ in xrange(y_reshaped.ndim)]
pad_shape[length.ndim - 1][1] = 1
y_reshaped = F.pad(y_reshaped, pad_shape, 'constant', constant_values=0.)
print y_reshaped
# variable([[ 1., 1., 1.],
# [ 5., 5., 5.],
# [ 13., 13., 13.],
# [ 17., 17., 17.],
# [ 21., 21., 21.],
# [ 25., 25., 25.],
# [ 0., 0., 0.]])
idx_size = np.prod(mask.shape)
inv_idx = np.ones([idx_size], dtype=np.int32) * -1
inv_idx[np.nonzero(mask.flat)[0]] = np.arange(x_reshaped.shape[0]).astype(np.int32)
print inv_idx
# [ 0 1 -1 2 3 4 5 -1 -1]
y = F.reshape(F.get_item(y_reshaped, inv_idx), list(x.shape[:length.ndim + 1]) + [-1])
print y
# [[[ 1. 1. 1.]
# [ 5. 5. 5.]
# [ 0. 0. 0.]]
#
# [[ 13. 13. 13.]
# [ 17. 17. 17.]
# [ 21. 21. 21.]]
#
# [[ 25. 25. 25.]
# [ 0. 0. 0.]
# [ 0. 0. 0.]]]
In the case of Tensorflow, such processing can be realized by the following processing.
Tensorflow
# WARNING: I have not checked it in case of rank != 3
x = tf.constant(np.arange(18).astype(np.float32).reshape(3, 3, 2))
length = tf.constant(np.array([2, 3, 1], dtype=np.int32))
w = tf.constant(np.ones([2, 3], dtype=np.float32))
mask = tf.sequence_mask(length, tf.shape(x)[tf.rank(length)])
print mask.eval()
# [[ True True False]
# [ True True True]
# [ True False False]]
x_reshaped = tf.boolean_mask(x, mask)
print x_reshaped.eval()
# [[ 0. 1.]
# [ 2. 3.]
# [ 6. 7.]
# [ 8. 9.]
# [ 10. 11.]
# [ 12. 13.]]
y_reshaped = tf.matmul(x_reshaped, w)
print y_reshaped.eval()
# [[ 1. 1. 1.]
# [ 5. 5. 5.]
# [ 13. 13. 13.]
# [ 17. 17. 17.]
# [ 21. 21. 21.]
# [ 25. 25. 25.]]
idx = tf.to_int32(tf.where(mask))
print idx.eval()
# [[0 0]
# [0 1]
# [1 0]
# [1 1]
# [1 2]
# [2 0]]
shape = tf.concat([tf.shape(x)[:-1], tf.shape(y_reshaped)[-1:]], 0)
print shape.eval()
# [3 3 3]
y = tf.scatter_nd(idx, y_reshaped, shape)
print y.eval()
# [[[ 1. 1. 1.]
# [ 5. 5. 5.]
# [ 0. 0. 0.]]
#
# [[ 13. 13. 13.]
# [ 17. 17. 17.]
# [ 21. 21. 21.]]
#
# [[ 25. 25. 25.]
# [ 0. 0. 0.]
# [ 0. 0. 0.]]]
Consider doing a softmax on the outermost dimension of a given matrix. Such situations occur in ListNet Permutation probability distribution and in the calculation of attention.
Softmax formula
x = np.random.random([2, 3]).astype(np.float32)
# array([[ 0.44715771, 0.85983515, 0.08915455],
# [ 0.02465274, 0.63411605, 0.01340247]], dtype=float32)
length = np.array([3, 2], dtype=np.int32)
I want to calculate Softmax using only the blue area as shown in the figure below.
By the way, don't wear a mask before / after.
Chainer
#Bad example 1
x_ = np.copy(x)
x_[1, 2] = 0.
print F.softmax(x_)
# variable([[ 0.31153342, 0.47068265, 0.21778394],
# [ 0.26211682, 0.48214924, 0.25573397]])
#Bad example 2
y = F.softmax(x)
y[1, 2] = 0.
print y
# variable([[ 0.31153342, 0.47068265, 0.21778394],
# [ 0.26121548, 0.48049128, 0.0 ]])
#The total of the second line is 1.Obviously not because it is not 0
The reason is very simple, example 1 is for $ exp (0.258) \ neq 0 $. In Example 2, x [2,1]
affects the calculation of the denominator.
In Softmax calculation, masking is performed by using $ exp (-inf) = 0 $.
Chainer
def masked_softmax(x, length):
"""
Softmax operation on the ourter-most dimenstion of x.
Args:
x (chainer.Variable): Values to be passed to softmax
length (numpy.ndarray or cupy.ndarray):
Number of items in the outer-most dimension of x
"""
assert x.ndim - 1 == length.ndim
xp = chainer.cuda.get_array_module(x.data)
x_shape = x.shape
x = F.reshape(x, (-1, x_shape[-1]))
# mask: (B, T)
mask = xp.tile(xp.arange(x.shape[-1]).reshape(1, -1), (x.shape[0], 1))
mask = mask < length.reshape(-1, 1)
padding = xp.ones(x.shape, dtype=x.dtype) * -np.inf
z = F.where(mask, x, padding)
return F.reshape(F.softmax(z), x_shape)
print masked_softmax(chainer.Variable(x), length)
# variable([[ 0.31153342, 0.47068265, 0.21778394],
# [ 0.35218161, 0.64781839, 0. ]])
Tensorflow
def masked_softmax(x, length):
"""
Softmax operation on the ourter-most dimenstion of x.
Args:
x (tf.Tensor): Values to be passed to softmax
length (tf.Tensor): Number of items in the outer-most dimension of x
"""
mask = tf.sequence_mask(length, tf.shape(x)[-1])
padding = tf.fill(tf.shape(x), -np.inf)
z = tf.where(mask, x, padding)
return tf.nn.softmax(z, dim=-1)
print masked_softmax(
tf.constant(x),
tf.constant(length)).eval()
# [[ 0.31153342, 0.47068265, 0.21778394],
# [ 0.35218161, 0.64781839, 0. ]]
Appendix:
In the deep learning framework, when division by zero occurs, there is a specification where the gradient becomes ʻinf even if you use
where`. Therefore, "I should mask even if I make an unstable calculation" does not work.
There is a network like the following formula.
e = f_0(x) \\
w = f_1(e)
This is expressed by the chain rule as follows.
By the way, this is realized by automatic differentiation as follows (roughly).
x.grad = e.grad * g(f_0, e, x)
Here, g (f_0, e, x)
is the partial derivative expressed from $ f_0 $ and its input / output. In other words, no matter what derivative value ʻe.grad comes from the upper equation, if the partial derivative value of equation $ f_0 $ is ʻinf
or nan
, x.grad
is also ʻinf. It becomes or
nan`. If you try this with Chainer and Tensorflow,
Tensorflow
sess = tf.InteractiveSession()
x = tf.constant(0.0)
t = x
e = 1. / x
w = tf.where(True, t, e)
print w.eval() # 0.0
print tf.gradients(w, x)[0].eval() # nan
Chainer
x = chainer.Variable(np.array([0.0], dtype=np.float32))
t = x
e = 1. / x
w = chainer.functions.where(np.array([True]), t, e)
w.grad = np.array([1.0], np.float32)
w.backward(retain_grad=True)
print w # 0.
print x.grad # nan
Recommended Posts