[PYTHON] Tech Circle ML # 8 Chainer with Recurrent Neural Language Model

Recurrent neural network language model

Advance preparation http://qiita.com/GushiSnow/items/9ab8761082e29002f735

Github with hands-on code https://github.com/SnowMasaya/Chainer-with-Neural-Networks-Language-model-Hands-on.git

Checking the operation of the application

Enable virtual environment

Mac/Linux

source my_env/bin/activate
pyenv install 3.4.1
pyenv rehash
pyenv local 3.4.1 

Operation check Directly below, do the following:

ipython notebook

The Chainer notebook will now open.

Experience the process of creating a recurrent neural language model

Open the iPython notebook. Here are the steps for creating a recurrent neural language model in order. Since the code in the sentence can actually be executed with iPython notebook, let's explain and execute it in order from the top (Click here for detailed usage) ( http://qiita.com/icoxfog417/items/175f69d06f4e590face9 )Please refer to).

Screen Shot 2015-09-15 at 7.30.33.jpg

Look at the ipython notebook from here to the coding part.

Recurrent neural language model setting (coding part)! !! !! !!

The model is defined in another class. You can freely change the model in this part. The purpose of this part is to help you understand the unique characteristics of the recurrent neural language model. -F.EmbedID is performing the process of converting dictionary data into data for the number of input units (conversion to latent vector space). -The reason why the output is quadrupled is that the input layer, input restriction layer, output restriction layer, and forgetting layer are used for input in LSTM. ・ H1_in = self.l1_x (F.dropout (h0, ratio = dropout_ratio, train = train)) + self.l1_h (state ['h1']) is a unit with how many dropouts while retaining past information Indicates whether to scrape. See below for Drop out.

http://olanleed.hatenablog.com/entry/2013/12/03/010945

C1, h1 = F.lstm (state ['c1'], h1_in) is a device for the recurrent neural network to learn with a good feeling without causing memory failure and gradient disappearance by a magical device called lstm. .. If you want to know more, please see below.

http://www.slideshare.net/nishio/long-shortterm-memory

-Return state, F.softmax_cross_entropy (y, t) is where the loss function is updated by comparing the predicted character with the actual character. The reason for using the softmax function is that the output can be determined by considering all the inputs of the layer immediately before the output layer, so the softmax function is generally used to calculate the output layer.

#-------------Explain2 in the Qiita-------------
class CharRNN(FunctionSet):

    """
This is the part that defines the neural network.
The dictionary vector space entered in order from the top is converted to the number of hidden layer units, and then the hidden layer is entered.
Setting power and hidden layer.
The same processing is performed on the two layers, and the output layer corrects the number of vocabularies and outputs it.
The first parameter to set is-0.08 to 0.It is set randomly between 08.
    """
    
    def __init__(self, n_vocab, n_units):
        super(CharRNN, self).__init__(
            embed = F.EmbedID(n_vocab, n_units),
            l1_x = F.Linear(n_units, 4*n_units),
            l1_h = F.Linear(n_units, 4*n_units),
            l2_x = F.Linear(n_units, 4*n_units),
            l2_h = F.Linear(n_units, 4*n_units),
            l3   = F.Linear(n_units, n_vocab),
        )
        for param in self.parameters:
            param[:] = np.random.uniform(-0.08, 0.08, param.shape)

    """
A description of forward propagation.
The forward propagation input is defined in Variable and the input and answer are passed.
Use the embed that defined the input layer earlier.
For the input of the hidden layer, l1 defined earlier_Using x, pass dropout and hidden layer state as arguments
is.
Hidden layer in lstm First layer state and h1_Pass in.
Write the second layer in the same way, and define the output layer without passing the state.
Each state is retained for use in subsequent inputs.
It compares the output label with the label of the answer and returns the loss and the status.
    """

    def forward_one_step(self, x_data, y_data, state, train=True, dropout_ratio=0.5):
        x = Variable(x_data, volatile=not train)
        t = Variable(y_data, volatile=not train)

        h0      = self.embed(x)
        h1_in   = self.l1_x(F.dropout(h0, ratio=dropout_ratio, train=train)) + self.l1_h(state['h1'])
        c1, h1  = F.lstm(state['c1'], h1_in)
        h2_in   = self.l2_x(F.dropout(h1, ratio=dropout_ratio, train=train)) + self.l2_h(state['h2'])
        c2, h2  = F.lstm(state['c2'], h2_in)
        y       = self.l3(F.dropout(h2, ratio=dropout_ratio, train=train))
        state   = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2}

        return state, F.softmax_cross_entropy(y, t)

    """
The description of dropout is removed and it is described as a method for prediction.
There is an argument called train in dropout, and it will not work if the argument of train is set to false
So, at the time of prediction, you can change the learning and prediction by changing the arguments passed, but this time it is explicitly known
I wrote it separately as follows.
    """

    def predict(self, x_data, state):
        x = Variable(x_data, volatile=True)

        h0      = self.embed(x)
        h1_in   = self.l1_x(h0) + self.l1_h(state['h1'])
        c1, h1  = F.lstm(state['c1'], h1_in)
        h2_in   = self.l2_x(h1) + self.l2_h(state['h2'])
        c2, h2  = F.lstm(state['c2'], h2_in)
        y       = self.l3(h2)
        state   = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2}

        return state, F.softmax(y)
    
"""
It is the initialization of the state.
"""

def make_initial_state(n_units, batchsize=100, train=True):
    return {name: Variable(np.zeros((batchsize, n_units), dtype=np.float32),
            volatile=not train)
            for name in ('c1', 'h1', 'c2', 'h2')}
#-------------Explain2 in the Qiita-------------

Language prediction


Handson # 2 Explanation

Regarding the acquisition of character string data for prediction, normally the training data and the test data are separated, but this time I wanted the effect to be realized hands-on, so I made the training data and the test data the same. Forecasting makes model changes and string predictions created.

-Change the model. -Predict the character string.

To change the predicted model, change the code below in the iPython notebook. Since the created model is in the cv folder There aren't many, but please check it.

# load model
#-------------Explain5 in the Qiita-------------
model = pickle.load(open("cv/charrnn_epoch_x.chainermodel", 'rb'))
#-------------Explain5 in the Qiita-------------

-The probability and state predicted by state, prob = model.predict (np.array ([index], dtype = np.int32), state) are acquired. The state is also acquired for use in the next prediction. ・ ʻIndex = np.argmax (cuda.to_cpu (prob.data))has the highest probability among them because the weight probability of each word can be obtained in thecuda.to_cpu (prob.data)part. Is the predicted character, so I try to return the index for that character. ・ ʻIndex = np.random.choice (prob.data.argsort () [0, -sampling_range:] [:: -1], 1) [0]is the probability that characters similar to recurrent will be output. Is high, so the process of randomly outputting from the top 5 candidates is also included. Since this is the part where you want to see that a wide variety of outputs are output, you should originally select the maximum value.

#-------------Explain7 in the Qiita-------------
    state, prob = model.predict(np.array([index], dtype=np.int32), state)
    #index = np.argmax(prob.data)
    index = np.random.choice(prob.data.argsort()[0,-sampling_range:][::-1], 1)[0]
#-------------Explain7 in the Qiita-------------

Make your model smarter and predict

In this Hands On, we are learning only for a limited time, so we can only make a model with terrible accuracy. So let's adjust the parameters and recreate it using the model. Parameters to adjust

#-------------Explain7 in the Qiita-------------
n_epochs    = 30
n_units     = 625
batchsize   = 100
bprop_len   = 10
grad_clip   = 0.5
#-------------Explain7 in the Qiita-------------

Role of each parameter n_epochs represents the number of learnings. If the model is complicated, it will not converge unless the number of trainings is increased, so if the model is complicated, it is necessary to set a large number.

n_units is the number of hidden layers. The higher this number, the more complex the model. If this number is increased, the learning will not converge unless the number of learnings is increased. Especially in the case of a language model, it is better to change it according to the number of vocabularies. If the number of units is larger than the number of vocabularies, the mapping to the latent space will not be completed, resulting in meaningless processing.

batchsize is the number of data to learn at one time. It depends on the size of the data. This point is often adjusted empirically, but basically, if it is increased, the learning accuracy will be improved but the learning speed will be reduced, and if it is decreased, the learning accuracy will be decreased but the learning speed will be increased.

bprop_len is a parameter peculiar to the recurrent neural network and indicates how many past characters are retained. This is a parameter that changes depending on the problem to be solved, so if you want to predict a long sentence, set a large number, and if it is a relatively short sentence, set a short number.

optimizer.clip_grads (grad_clip) puts an upper limit on the magnitude of the gradient (weight update width) to prevent the weights from exploding. A large value allows learning, and a small value suppresses learning.

If you want to know more about hyperparameter optimization, please see below.

http://colinraffel.com/wiki/neural_network_hyperparameters

Handson Advance

Language processing takes a lot of time, so I recommend GPU settings. However, it does not mean that it should be used unconditionally, and it works effectively in the following settings. If you would like to know the details of the mechanism, please see below. http://www.kumikomi.net/archives/2008/06/12gpu1.php?page=1

For those who want to try at high speed

http://sla.hatenablog.com/entry/chainer_on_ec2

github (for this GPU) Use GPU instance (based on Amazon Linux) with CUDA environment published by NVIDIA

https://github.com/SnowMasaya/Chainer-with-Neural-Networks-Language-model-Hands-on-Advance.git

GPU driver settings

GPU setting on AWS was done by referring to the following site.

http://tleyden.github.io/blog/2014/10/25/cuda-6-dot-5-on-aws-gpu-instance-running-ubuntu-14-dot-04/

apt-get update && apt-get install build-essential

Get the Cuda installer

wget http://developer.download.nvidia.com/compute/cuda/6_5/rel/installers/cuda_6.5.14_linux_64.run

Get only Cuda installer

chmod +x cuda_6.5.14_linux_64.run
mkdir nvidia_installers
./cuda_6.5.14_linux_64.run -extract=`pwd`/nvidia_installers

Get image-extract

sudo apt-get install linux-image-extra-virtual

Reboot

reboot

Create file

vi /etc/modprobe.d/blacklist-nouveau.conf

set nouveau and lbm-nouveau not to start

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Set not to start Kernel Nouveau

 echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf

Reboot after setting to configure the file system by expanding to memory in advance when the kernel starts

update-initramfs -u
reboot

Get kernel source

apt-get install linux-source
apt-get install linux-headers-3.13.0-37-generic

Install NVIDIA driver

cd nvidia_installers
./NVIDIA-Linux-x86_64-340.29.run

Check if the driver is installed with the following command.

nvidia-smi
Wed Aug  5 07:48:36 2015
+------------------------------------------------------+
| NVIDIA-SMI 340.29     Driver Version: 340.29         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   54C    P0    80W / 125W |    391MiB /  4095MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      8013  python                                               378MiB |
+-----------------------------------------------------------------------------+

The number assigned to the above GPU is the GPU ID. This will be used later when running Chainer.

ʻError while loading shared libraries: libcurand.so.5.5: cannot open shared object file: No such file or directory `

http://linuxtoolkit.blogspot.jp/2013/09/error-while-loading-shared-libraries.html

Also pass the PATH

export PATH=$PATH:/usr/local/cuda-6.5/bin/

Python settings

I have set up Python3.

Execute the following command to install the necessary items in advance


apt-get update
apt-get install gcc gcc++ kmod perl python-dev
sudo reboot

pip installation procedure https://pip.pypa.io/en/stable/installing.html

Pyenv installation procedure https://github.com/yyuu/pyenv


pip install virtualenv

pyenv install 3.4

virtualenv my_env -p = ~/.pyenv/versions/3.4.0/bin/python3.4

Requirement.txt has been set.

numpy
scikit-learn
Mako
six
chainer
scikit-cuda

Install the required libraries

pip install -r requirement.txt

Download "install-headers" from below.

https://android.googlesource.com/toolchain/python/+/47a24ea6662f20c8e165d541ab6facdf009bfee4/Python-2.7.5/Lib/distutils/command/install_headers.py

Install PyCuda

wget https://pypi.python.org/packages/source/p/pycuda/pycuda-2015.1.2.tar.gz
tar zxvf pycuda-2015.1.2.tar.gz
cd pycuda-2015.1.2
./configure.py
make
make install

Handson Advance2

Run ipython notebook on the server and check the operation (on AWS)

https://thomassileo.name/blog/2012/11/19/setup-a-remote-ipython-notebook-server-with-numpyscipymaltplotlibpandas-in-a-virtualenv-on-ubuntu-server/

Create configuration file

ipython profile create myserver

Modify configuration file

vim /home/ec2-user/.ipython/profile_myserver/ipython_config.py

Add line

c = get_config()

c.IPKernelApp.pylab = 'inline'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:yourhashedpassword'
c.NotebookApp.port = 9999

Keep it in cuda's PATH

export PATH=$PATH:/usr/local/cuda-6.5/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-6.5/lib64/

Open the port you want to open from the AWS security group

1: Select a security group 2: Add rules by editing 3: Type: Custom TCP rule 4: Protocol: TCP 5: Port: 9999 6: Source can be set anywhere

Run Since it is not reflected in the normal procedure, reflect the profiling file directly.

sudo ipython notebook --config=/home/ec2-user/.ipython/profile_myserver/ipython_config.py --no-browser

For those with aspirations who still want to study

"Statistical Language Models based on Neural Networks" on the following site is very organized and easy to understand. Although it is in English.

http://rnnlm.org/

List of reference sites

Description of language model coverage, perplexity

http://marujirou.hatenablog.com/entry/2014/08/22/235215

Run the deep learning framework Chainer on a GPU instance on EC2 g2.2xlarge instance

http://ukonlly.hatenablog.jp/entry/2015/07/04/210149

Drop Out

http://olanleed.hatenablog.com/entry/2013/12/03/010945

Learning to forget continual prediction with lstm

http://www.slideshare.net/FujimotoKeisuke/learning-to-forget-continual-prediction-with-lstm

Zaremba, Wojciech, Ilya Sutskever, and Oriol Vinyals. "Recurrent neural network regularization." arXiv preprint arXiv:1409.2329 (2014).

Google Mikolov

http://www.rnnlm.org/

Language Model (LM) using Neural Network (NN), that is, a kind of Neural Network Language Model (NNLM), Recurrent Neural Network Language Model (RNNLM) using Recurrent Neural Network (RNN)

http://kiyukuta.github.io/2013/12/09/mlac2013_day9_recurrent_neural_network_language_model.html

Long Short-term Memory

http://www.slideshare.net/nishio/long-shortterm-memory

While explaining Chainer's ptb sample, let's learn your own sentences deeply and automatically generate my sentence-like sentences

http://d.hatena.ne.jp/shi3z/20150714/1436832305

RNNLM

http://www.slideshare.net/uchumik/rnnln

Sparse estimation overview: model, theory, application

http://www.is.titech.ac.jp/~s-taiji/tmp/sparse_tutorial_2014.pdf

Optimization method in regularization learning method

http://imi.kyushu-u.ac.jp/~waki/ws2013/slide/suzuki.pdf

Recurrent neural language model creation reference https://github.com/yusuketomoto/chainer-char-rnn

Neural network natural language processing http://www.orsj.or.jp/archive2/or60-4/or60_4_205.pdf

Language model creation http://www.slideshare.net/uchumik/rnnln

Natural Language Processing Programming Study Group n-gram Language Model http://www.phontron.com/slides/nlp-programming-ja-02-bigramlm.pdf

Introduction to Statistical Semantic ~ From distribution hypothesis to word2vec ~ http://www.slideshare.net/unnonouno/20140206-statistical-semantics

linux source code https://github.com/torvalds/linux

  1. Why GPU Computing Is Attention-Keio University http://www.yasuoka.mech.keio.ac.jp/gpu/gpu_0.php

Actual GPU computing using CUDA technology (Part 1) -Application of parallel processing technology refined in the graphics field to general-purpose numerical calculation http://www.kumikomi.net/archives/2008/06/12gpu1.php?page=1

GPGPU https://ja.wikipedia.org/wiki/GPGPU#.E7.89.B9.E5.BE.B4.E3.81.A8.E8.AA.B2.E9.A1.8C

Natural language processing theory I http://www.jaist.ac.jp/~kshirai/lec/i223/02.pdf

STATISTICAL LANGUAGE MODELS BASED ON NEURAL NETWORKS http://www.rnnlm.org/

Neural Network Hyperparameters http://colinraffel.com/wiki/neural_network_hyperparameters

Random Search for Hyper-Parameter Optimization http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

Recommended Posts

Tech Circle ML # 8 Chainer with Recurrent Neural Language Model
Neural network starting with Chainer
4. Circle parameters with neural network!
Simple classification model with neural network
Seq2Seq (2) ~ Attention Model edition ~ with chainer
Load caffe model with Chainer and classify images
Do image recognition with Caffe model Chainer Yo!