Introduction

NStep LSTM was implemented in the release of Chainer 1.16.0. As the name implies, NStep LSTM is a model that can easily realize multi-layered LSTM. Internally, an RNN optimized with cuDNN is used, and it operates faster than conventional LSTMs. Furthermore, with NStep LSTM, ** it is no longer necessary to match the length of the data in the mini-batch **, and you can enter each sample in the list as it is. You no longer need to pad with -1 and use ignore_label = -1 and where, or transpose and enter a list sorted by data length.

So, this time, I tried to learn series labeling using this NStep LSTM.

Differences in interface with conventional LSTM

Since NStep LSTM has different input / output from conventional LSTM, it is not possible to simply replace the model implemented so far with NStep LSTM.

The input / output of \ _ \ _ init \ _ \ _ () and \ _ \ _ call \ _ \ _ () of NStepLSTM is as follows.

NStepLSTM.__init__(n_layers, in_size, out_size, dropout, use_cudnn=True)
"""
n_layers (int): Number of layers.
in_size (int): Dimensionality of input vectors.
out_size (int): Dimensionality of hidden states and output vectors.
dropout (float): Dropout ratio.
use_cudnn (bool): Use cuDNN.
"""

...

NStepLSTM.__call__(hx, cx, xs, train=True)
"""
hx (~chainer.Variable): Initial hidden states.
cx (~chainer.Variable): Initial cell states.
xs (list of ~chianer.Variable): List of input sequences.
        Each element ``xs[i]`` is a :class:`chainer.Variable` holding a sequence.
"""
    ...

    return hy, cy, ys

On the other hand, the conventional LSTM was as follows.

LSTM.__init__(in_size, out_size, **kwargs)
"""
in_size (int) – Dimension of input vectors. If None, parameter initialization will be deferred until the first forward data pass at which time the size will be determined.
out_size (int) – Dimensionality of output vectors.
lateral_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
        It is used for initialization of the lateral connections.
        Maybe be None to use default initialization.
upward_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
        It is used for initialization of the upward connections.
        Maybe be None to use default initialization.
bias_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
        It is used for initialization of the biases of cell input, input gate and output gate, and gates of the upward connection.
        Maybe a scalar, in that case, the bias is initialized by this value.
        Maybe be None to use default initialization.
forget_bias_init – A callable that takes numpy.ndarray or cupy.ndarray and edits its value.
        It is used for initialization of the biases of the forget gate of the upward connection.
        Maybe a scalar, in that case, the bias is initialized by this value.
        Maybe be None to use default initialization.
"""

...

LSTM.__call__(x)
"""
x (~chainer.Variable): A new batch from the input sequence.
"""
    ...

    return y

Therefore, NStep LSTM is handled differently from LSTM in the following points.

-Specify the number of layers and dropout ratio with \ _ \ _ init () \ _ \ _ -\ _ \ _ call () \ _ \ _ must pass ** initial hidden states ** and ** initial cell states ** -The input of \ _ \ _ call () \ _ \ _ is not chainer.Variable but chainer.Variable ** list ** -The return value of \ _ \ _ call () \ _ \ _ is the ** list ** of hidden states, cell states and output (chainer.Variable) after the series forward calculation is completed.

The big difference is that the call to \ _ \ _call () \ _ \ _ is given the initial hidden states and cell states, and the I / O is a list.

Make NStep LSTM easier to handle

Implement subclasses to bring NStep LSTM initialization and calls as close to LSTM as possible.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from chainer import Variable
import chainer.links as L
import numpy as np


class LSTM(L.NStepLSTM):

    def __init__(self, in_size, out_size, dropout=0.5, use_cudnn=True):
        n_layers = 1
        super(LSTM, self).__init__(n_layers, in_size, out_size, dropout, use_cudnn)
        self.state_size = out_size
        self.reset_state()

    def to_cpu(self):
        super(LSTM, self).to_cpu()
        if self.cx is not None:
            self.cx.to_cpu()
        if self.hx is not None:
            self.hx.to_cpu()

    def to_gpu(self, device=None):
        super(LSTM, self).to_gpu(device)
        if self.cx is not None:
            self.cx.to_gpu(device)
        if self.hx is not None:
            self.hx.to_gpu(device)

    def set_state(self, cx, hx):
        assert isinstance(cx, Variable)
        assert isinstance(hx, Variable)
        cx_ = cx
        hx_ = hx
        if self.xp == np:
            cx_.to_cpu()
            hx_.to_cpu()
        else:
            cx_.to_gpu()
            hx_.to_gpu()
        self.cx = cx_
        self.hx = hx_

    def reset_state(self):
        self.cx = self.hx = None

    def __call__(self, xs, train=True):
        batch = len(xs)
        if self.hx is None:
            xp = self.xp
            self.hx = Variable(
                xp.zeros((self.n_layers, batch, self.state_size), dtype=xs[0].dtype),
                volatile='auto')
        if self.cx is None:
            xp = self.xp
            self.cx = Variable(
                xp.zeros((self.n_layers, batch, self.state_size), dtype=xs[0].dtype),
                volatile='auto')

        hy, cy, ys = super(LSTM, self).__call__(self.hx, self.cx, xs, train)
        self.hx, self.cx = hy, cy
        return ys

In the above class, \ _ \ _ init () \ _ \ _ is specified only in_size and out_size as before (default value of dropout is 0.5, fixed to n_layers = 1 without LSTM multi-layering). .. \ _ \ _ Call () \ _ \ _ automatically initializes cx and hx, and inputs / outputs only to the list of chainer.Variable.

Implemented Bi-directional LSTM with NStep LSTM

Implement Bi-directional LSTM using NStep LSTM. Make a list for backward-LSTM input by reversing each sample of the chainer.Variable list to pass to forward-LSTM. After calculating the output with forward-LSTM and backward-LSTM, ** align each sample in the list of each output and ** concatenate to make one vector. In the class below, a linear operation is added so that out_size is the number of labels for series labeling.

class BLSTMBase(Chain):

    def __init__(self, embeddings, n_labels, dropout=0.5, train=True):
        vocab_size, embed_size = embeddings.shape
        feature_size = embed_size
        super(BLSTMBase, self).__init__(
            embed=L.EmbedID(
                in_size=vocab_size,
                out_size=embed_size,
                initialW=embeddings,
            ),
            f_lstm=LSTM(feature_size, feature_size, dropout),
            b_lstm=LSTM(feature_size, feature_size, dropout),
            linear=L.Linear(feature_size * 2, n_labels),
        )
        self._dropout = dropout
        self._n_labels = n_labels
        self.train = train

    def reset_state(self):
        self.f_lstm.reset_state()
        self.b_lstm.reset_state()

    def __call__(self, xs):
        self.reset_state()
        xs_f = []
        xs_b = []
        for x in xs:
            _x = self.embed(self.xp.array(x))
            xs_f.append(_x)
            xs_b.append(_x[::-1])
        hs_f = self.f_lstm(xs_f, self.train)
        hs_b = self.b_lstm(xs_b, self.train)
        ys = [self.linear(F.dropout(F.concat([h_f, h_b[::-1]]), ratio=self._dropout, train=self.train)) for h_f, h_b in zip(hs_f, hs_b)]
        return ys

Learn sequence labeling using Bi-directional LSTM

Let's actually use the model implemented above and apply it to the task of series labeling. We chose the Chinese Word Segmentation as a series labeling issue where Bi-directional LSTMs are often used. Unlike English, Chinese does not separate words with spaces, so you need to identify word boundaries before processing the text.

Example)

Winter, Noh (can) Wear (amount) Wear (wear) Less (amount); Summer, Noh (can) Wear (wear) Many (more) Little (little) Wear (wear) Many (more) Little (little).

[Chen+, 2015]

The above example has different meanings depending on whether it is divided into "some" or "many" and "small". Since the sentence structure is almost the same, the delimiter is judged in the context of the surrounding words.

B (Begin, the beginning of a word of two or more letters), M (Middle, the middle of a word of two or more letters), E (End) for a string to learn Chinese word splitting as a series labeling problem. , End of two or more letters), S (Single, one letter word). Using the text data with this label, we will learn the label assigned to each character from the context information of the word string.

Experiment

Target dataset

PKU (Peking University corpus, standard dataset for benchmarking Chinese Word Segmentation)

environment

Python 3.5.2
Chainer 1.18.0
Ubuntu 14.04.5 LTS + GPU

Model & Experiment Settings

Screen Shot 2016-12-03 at 05.36.56.png [Yao +, 2016] * Very similar to the model above [^ 1]

Bi-directional LSTM
epoch: 10
dropout ratio: 0.5
AdaGrad - learning rate: 0.2
weight decay: 10^-4 --Mini batch size: 20 --Word Embeddings pretrain: Chinese Wikipedia corpus, 100 dim

Experimental result

Learning process and results of this model

The learning process is described below as it is.

hiroki-t:/private/work/blstm-cws$ python app/train.py --save -e 10 --gpu 0
2016-12-03 09:34:06.27 JST      13a653  [info]  LOG Start with ACCESSID=[13a653] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 09:34:06.026907 JST]
2016-12-03 09:34:06.27 JST      13a653  [info]  *** [START] ***
2016-12-03 09:34:06.27 JST      13a653  [info]  initialize preprocessor with /private/work/blstm-cws/app/../data/zhwiki-embeddings-100.txt
2016-12-03 09:34:06.526 JST     13a653  [info]  load train dataset from /private/work/blstm-cws/app/../data/icwb2-data/training/pku_training.utf8
2016-12-03 09:34:14.134 JST     13a653  [info]  load test dataset from /private/work/blstm-cws/app/../data/icwb2-data/gold/pku_test_gold.utf8
2016-12-03 09:34:14.589 JST     13a653  [trace]
2016-12-03 09:34:14.589 JST     13a653  [trace] initialize ...
2016-12-03 09:34:14.589 JST     13a653  [trace] --------------------------------
2016-12-03 09:34:14.589 JST     13a653  [info]  # Minibatch-size: 20
2016-12-03 09:34:14.589 JST     13a653  [info]  # epoch: 10
2016-12-03 09:34:14.589 JST     13a653  [info]  # gpu: 0
2016-12-03 09:34:14.589 JST     13a653  [info]  # hyper-parameters: {'adagrad_lr': 0.2, 'dropout_ratio': 0.2, 'weight_decay': 0.0001}
2016-12-03 09:34:14.590 JST     13a653  [trace] --------------------------------
2016-12-03 09:34:14.590 JST     13a653  [trace]
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:50 Time: 0:07:50
2016-12-03 09:42:05.642 JST     13a653  [info]  [training] epoch 1 - #samples: 19054, loss: 9.640346, accuracy: 0.834476
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:29 Time: 0:00:29
2016-12-03 09:42:34.865 JST     13a653  [info]  [evaluation] epoch 1 - #samples: 1944, loss: 6.919845, accuracy: 0.890557
2016-12-03 09:42:34.866 JST     13a653  [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:40 Time: 0:07:40
2016-12-03 09:50:15.258 JST     13a653  [info]  [training] epoch 2 - #samples: 19054, loss: 5.526157, accuracy: 0.903373
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:24 Time: 0:00:24
2016-12-03 09:50:39.400 JST     13a653  [info]  [evaluation] epoch 2 - #samples: 1944, loss: 6.233129, accuracy: 0.900318
2016-12-03 09:50:39.401 JST     13a653  [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:08:41 Time: 0:08:41
2016-12-03 09:59:21.301 JST     13a653  [info]  [training] epoch 3 - #samples: 19054, loss: 4.217260, accuracy: 0.921377
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:24 Time: 0:00:24
2016-12-03 09:59:45.587 JST     13a653  [info]  [evaluation] epoch 3 - #samples: 1944, loss: 5.650668, accuracy: 0.913843
2016-12-03 09:59:45.587 JST     13a653  [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:25 Time: 0:07:25
2016-12-03 10:07:11.451 JST     13a653  [info]  [training] epoch 4 - #samples: 19054, loss: 3.488712, accuracy: 0.931668
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:26 Time: 0:00:26
2016-12-03 10:07:37.889 JST     13a653  [info]  [evaluation] epoch 4 - #samples: 1944, loss: 5.342249, accuracy: 0.917103
2016-12-03 10:07:37.890 JST     13a653  [trace] -
100% (19054 of 19054) |#######################################| Elapsed Time: 0:07:26 Time: 0:07:26
2016-12-03 10:15:03.919 JST     13a653  [info]  [training] epoch 5 - #samples: 19054, loss: 2.995683, accuracy: 0.938305
100% (1944 of 1944) |#########################################| Elapsed Time: 0:00:15 Time: 0:00:15
2016-12-03 10:15:19.749 JST     13a653  [info]  [evaluation] epoch 5 - #samples: 1944, loss: 5.320374, accuracy: 0.921863
2016-12-03 10:15:19.750 JST     13a653  [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:07:29 Time: 0:07:29
2016-12-03 10:22:49.393 JST     13a653  [info]  [training] epoch 6 - #samples: 19054, loss: 2.680496, accuracy: 0.943861
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:27 Time: 0:00:27
2016-12-03 10:23:16.985 JST     13a653  [info]  [evaluation] epoch 6 - #samples: 1944, loss: 5.326864, accuracy: 0.924161
2016-12-03 10:23:16.986 JST     13a653  [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:07:28 Time: 0:07:28
2016-12-03 10:30:45.772 JST     13a653  [info]  [training] epoch 7 - #samples: 19054, loss: 2.425466, accuracy: 0.947673
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:22 Time: 0:00:22
2016-12-03 10:31:08.448 JST     13a653  [info]  [evaluation] epoch 7 - #samples: 1944, loss: 5.270019, accuracy: 0.925341
2016-12-03 10:31:08.449 JST     13a653  [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:08:39 Time: 0:08:39
2016-12-03 10:39:47.461 JST     13a653  [info]  [training] epoch 8 - #samples: 19054, loss: 2.233068, accuracy: 0.950928
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:26 Time: 0:00:26
2016-12-03 10:40:14.2 JST       13a653  [info]  [evaluation] epoch 8 - #samples: 1944, loss: 5.792994, accuracy: 0.924707
2016-12-03 10:40:14.2 JST       13a653  [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:07:10 Time: 0:07:10
2016-12-03 10:47:24.806 JST     13a653  [info]  [training] epoch 9 - #samples: 19054, loss: 2.066807, accuracy: 0.953524
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:26 Time: 0:00:26
2016-12-03 10:47:51.745 JST     13a653  [info]  [evaluation] epoch 9 - #samples: 1944, loss: 5.864374, accuracy: 0.925294
2016-12-03 10:47:51.746 JST     13a653  [trace] -
100% (19054 of 19054) |########################################| Elapsed Time: 0:08:43 Time: 0:08:43
2016-12-03 10:56:34.758 JST     13a653  [info]  [training] epoch 10 - #samples: 19054, loss: 1.946193, accuracy: 0.955782
100% (1944 of 1944) |##########################################| Elapsed Time: 0:00:22 Time: 0:00:22
2016-12-03 10:56:57.641 JST     13a653  [info]  [evaluation] epoch 10 - #samples: 1944, loss: 5.284819, accuracy: 0.930201
2016-12-03 10:56:57.642 JST     13a653  [trace] -
2016-12-03 10:56:57.642 JST     13a653  [info]  saving the model to /private/work/blstm-cws/app/../output/cws.model ...
2016-12-03 10:56:58.520 JST     13a653  [info]  *** [DONE] ***
2016-12-03 10:56:58.521 JST     13a653  [info]  LOG End with ACCESSID=[13a653] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 09:34:06.026907 JST] PROCESSTIME=[4972.494370000]

It is not the Precision, Recall, F value but the Accuracy value, but it is 93.0 at the 10th epoch. The processing time was 10 epoch, which was a little over 80 minutes.

Comparison with the results of previous studies

Screen Shot 2016-12-03 at 10.27.55.png [Yao+, 2016] [^2]

All the models are trained on NVIDIA GTX Geforce 970, it took about 16 to 17 hours to train a model on GPU while more than 4 days to train on CPU, in contrast.

[Yao+, 2016]

There are some differences from the previous research, such as the initialization of Embeddings, but the accuracy and processing time of the 1-layer BLSTM are reasonable results.

Decoding

hiroki-t:/private/work/blstm-cws$ python app/parse.py
2016-12-03 11:01:13.343 JST     549e15  [info]  LOG Start with ACCESSID=[549e15] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 11:01:13.343412 JST]
2016-12-03 11:01:13.343 JST     549e15  [info]  *** [START] ***
2016-12-03 11:01:13.344 JST     549e15  [info]  initialize preprocessor with /private/work/blstm-cws/app/../data/zhwiki-embeddings-100.txt
2016-12-03 11:01:13.834 JST     549e15  [trace]
2016-12-03 11:01:13.834 JST     549e15  [trace] initialize ...
2016-12-03 11:01:13.834 JST     549e15  [trace]
2016-12-03 11:01:13.914 JST     549e15  [info]  loading a model from /private/work/blstm-cws/app/../output/cws.model ...
Input a Chinese sentence! (use 'q' to exit)
The third step of modernization and construction for the completion of the Chinese people's entry.
B E B E B E S S B M E B E B E S B E B E B E S S B E S
The entry of the Chinese people into the modernized construction, the third step strategy, and the progressive new conquest.
-
q
2016-12-03 11:02:08.961 JST     549e15  [info]  *** [DONE] ***
2016-12-03 11:02:08.962 JST     549e15  [info]  LOG End with ACCESSID=[549e15] UNIQUEID=[UNIQID] ACCESSTIME=[2016-12-03 11:01:13.343412 JST] PROCESSTIME=[55.618552000]

# ^note[gold]The entry of the Chinese people into the modernized construction, the third step strategy, and the progressive new conquest.

When decoding is performed based on the learning result, the correct label sequence and word division result are returned from the undivided character string.

in conclusion

I learned sequence labeling with Bi-directional LSTM using Chainer's NStep LSTM. With variable length mini-batch + cuDNN support, input data processing has become easier and operations have become faster than before. The model implemented this time can be used not only for Chinese word division but also for series learning, so it may be interesting to apply it to other tasks such as part-of-speech tagging.

The source code is available on GitHub. https://github.com/chantera/blstm-cws

In addition to the BLSTM introduced above, the repository contains the code that I actually use in combination with Chainer for BLSTM + CRF implementation and NLP research, so I hope you find it helpful.

reference

--Do a mini-batch of variable length data in chainer where --studylog / North cloud http://studylog.hateblo.jp/entry/2016/02/04/020547 --The beginning of Chainer's cuDNN-RNN (NStepLSTM) --studylog / Northern clouds http://studylog.hateblo.jp/entry/2016/10/03/095406 --Chainer's NStep LSTM predicts comments on Nico Nico Douga. --Monthly Hacker's Blog http://www.monthly-hack.com/entry/2016/10/24/200000

chainer-NStepLSTM/ptb_nslstm.py at master · monthly-hack/chainer-NStepLSTM https://github.com/monthly-hack/chainer-NStepLSTM/blob/master/ptb_nslstm.py --About variable length series input of LSTM --Google groups https://groups.google.com/forum/#!topic/chainer-jp/og0dBbgSLVw
Optimizing Recurrent Neural Networks in cuDNN 5 | Parallel Forall https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/
Chen, X., Qiu, X., Zhu, C., Liu, P. and Huang, X., 2015. Long short-term memory neural networks for chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1385-1394). http://aclweb.org/anthology/D15-1141.pdf
Huang, Z., Xu, W., Yu, K., 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. https://arxiv.org/abs/1508.01991
Yao, Y., Huang, Z., 2016. Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation. arXiv preprint arXiv:1602.04874. https://arxiv.org/abs/1602.04874

written by chantera at NAIST cllab

[^ 1]: [Yao +, 2016] returns the vector v ∈ R ^ 2d of the output of BLSTM to the d dimension by the matrix of W ∈ R ^ d * 2d. [^ 2]: In [Yao +, 2016], the dimension of Word Embeddings is set to 200 dimensions, and a dictionary is created from the characters of the training set without pretraining.

[PYTHON] Implementation of Chainer series learning using variable length mini-batch