Deep learning in natural language processing

Deep learning is used in various tasks such as the following in natural language processing.

Machine translation that translates sentences into different languages
Automatic summarization that extracts only important information
Machine reading comprehension based on documents and questions
A system that answers questions about images
etc

All of these are fields that researchers and well-known IT companies around the world are actively working on. Deep learning is almost always used in the latest methods used for these. Therefore, for those who are trying to do research and business related to natural language processing It can be said that learning deep learning and neural networks is almost indispensable.

So why is deep learning so used in natural language processing? In order to handle words on a computer, it is absolutely necessary to convert them into numbers. As the classic way

One hot vector
TFIDF

And so on.

In fact, these are easy to do, so it's a great way to get started when you want to do something with natural language processing.

But for these vectors

① Sparseness
Vector dimensions range from tens of thousands to hundreds of thousands, often running out of memory.

② I can't handle the meaning
All words are differentiated only by their frequency of occurrence and lose their individuality.

There are problems such as.

So if you use a neural network model Because the vector of words can be learned by the error back propagation method

Embedding(Assigning only a few hundred dimensional vectors to each word)Just good

Furthermore, because you can learn word vectors while considering the context If the words have similar meanings, similar vectors can be obtained. Compared to TFIDF etc., it has the advantage of being able to handle the meaning of words.

Embedding

Embedding means "embedded" in Japanese.

In natural language processing by deep learning, the symbol "word" is used. Embedding is performed by embedding in a d-dimensional vector (d is about 100 to 300).

Embedding is the first thing to do when building a model of a neural network that handles words.

In keras, there is an Embedding Layer, which can be used as follows.

from keras.models import Sequential
from keras.layers.embeddings import Embedding

vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length = 20 #Sentence length

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))

The values that must be specified here are input_dim, output_dim, and input_length.

input_dim:Number of vocabulary, number of word types
output_dim:Word vector dimensions
input_length:Length of each sentence

Combined word vectors of all words

An example of what is called the Embedding Matrix is shown in the figure.

As a premise, assign a unique ID to each word It refers to the line number of the Embedding Matrix. (ID = 0 is the 0th line, ID = i is the i line.)

And the vector of each word corresponds to the row of Embedding Matrix. The horizontal length d of the Embedding Matrix corresponds to the dimension of the word vector.

Keras automatically stores a random value for the value of each cell. Also, as shown in the figure

In most cases, line 0 is assigned unk, or Unknown.

The reason for using unk is to limit the vocabulary used and all other minor words This is to save memory by setting it to Unknown. How to limit vocabulary It is common to handle only those that appear more frequently in the corpus (document) to be handled.

Input to Embedding is a matrix consisting of IDs assigned to words The size should be (batch size, sentence length).

Here, the batch size represents the number of data (statements) that are calculated in parallel at one time.

The sentence length must be the same for all data. The problem here is that the length of the sentence is not constant. Therefore, the length of the sentence is arbitrarily set to D.

Statements less than or equal to length D add 0 so that the length is D
For sentences longer than D, remove the word from the end so that the length is D.

This is commonly called padding

Here is a usage example.

import numpy as np
from keras.models import Sequential
from keras.layers.embeddings import Embedding


batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length = 20 #Sentence length

#Originally it is necessary to convert the word to ID, but this time I prepared the input data easily.
input_data = np.arange(batch_size * seq_length).reshape(batch_size, seq_length)

#Add Embedding to model.
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))

# input_Check how the shape of the data changes.
output = model.predict(input_data)
print(output.shape)

RNN

In this section you will learn about RNNs.

RNN is an abbreviation for Recurrent Neural Network
Translated as "recursive neural network" in Japanese.

It excels at handling variable length series, that is, input strings of arbitrary length. It is an important mechanism frequently used in natural language processing.

Since time-series data can be used in addition to languages, it has a wide range of applications such as voice recognition.

Recurrent means "repeating".

So RNN means "recurrent" neural network.

The simplest RNN is expressed by the following formula.

Here, t is often expressed as time. xt is the input of time t, ht is the hidden state vector of time t, yt is the output of time t All three are vectors.

Zero vectors are often used for h0.

f and g are activation functions, such as the sigmoid function.

Wh, Wy, U and bh, by are learnable matrices and biases, respectively.

A simple illustration will look like the figure.

There are many variants of RNNs other than those introduced above, The basic structure is as shown in the figure.

Also, you don't have to remember these strict definitions when using keras. Keep in mind that you can enter the input columns in sequence and get the hidden state vector and output at each time.

You can also stack multiple layers on an RNN like any other neural network.

LSTM

What is LSTM

LSTM is a kind of RNN, and LSTM has a mechanism to make up for the drawbacks peculiar to RNN.

I will explain the drawbacks peculiar to the RNN. Since RNN is a neural network deep in the time direction, consider the value entered earlier. It is said that it is difficult to train parameters, that is, it is not good at long-term memory.

Speaking sensuously, RNNs "forget" the elements entered earlier.

LSTM is a well-known mechanism to prevent this.

What is LSTM? Long Short-Abbreviation for Term Memory
As the name implies, it enables long-term memory and short-term memory.

Widely used by researchers around the world.

LSTM introduces the concept of "gate" to RNN A type of RNN with a gate. This gate enables long-term and short-term memory.

The outline of LSTM is as shown in the figure.

The inside of the LSTM is described by the following formula. (No need to remember.)

⊙ represents the multiplication of each element and the Hadamard product.

Here, i is called the input gate, f is called the forgetting gate, and o is called the output gate.

Note that the sigmoid function is used for the activation function of these gates. What is the sigmoid function?

That is. So the output of these gates is a number between 0 and 1.

The i of the input gate is multiplied by ¯ht for each element. You can see that we are adjusting how much information is transmitted from the input at time t. As the name suggests, the gate comes from the image of a "gate."

The oblivion gate f is multiplied by ct-1 for each element. We are adjusting how much information up to time t-1 is transmitted to time t (= how much past information is forgotten).

And o of the output gate is from ct which contains the information from time 1 to t We are adjusting how much information is used as the hidden state vector ht output.

These are the whole picture of LSTM devised to realize short-term memory and long-term memory.

Implementation of LSTM

I will implement LSTM with keras immediately. keras has a module that makes it easy to use LSTM You can use LSTMs without being aware of the formulas described.

Specifically, it is used as follows.

from keras.layers.recurrent import LSTM

units = 200

lstm = LSTM(units, return_sequences=True)

Where units is the number of dimensions of the LSTM hidden state vector. Usually it should be around 100 to 300.

In general, the more parameters you need to learn, the more complex phenomena you can model. It is difficult to learn by that amount (memory consumption increases, learning time is long).

It's a good idea to adjust it according to your learning environment.

And this time

return_Specify an argument called sequences.

This is an argument that determines what the LSTM output format should be.

return_If sequences is True
The LSTM outputs the output sequence (hidden state vector h1h1 to hThT) corresponding to all input sequences.

return_If sequences is False
The LSTM outputs only the hidden state vector at the last time T.

We'll use all the output sequences in a later chapter, so we'll leave return_sequences set to True here.

The model construction method is as follows when connected with Embedding.

from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM

vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length = 20 #Sentence length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim))
model.add(LSTM(lstm_units, return_sequences=True))

Click here for usage examples

import numpy as np
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM


batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length = 20 #Sentence length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM

#I prepared the input data easily this time as well.
input_data = np.arange(batch_size * seq_length).reshape(batch_size, seq_length)

#Add LSTM to model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))
model.add(LSTM(lstm_units,return_sequences=True))

# input_Check how the shape of the data changes.
output = model.predict(input_data)
print(output.shape)

BiLSTM

So far, input series x = {x1, ..., xT} to RNN including LSTM from x1 I entered in order from the beginning over xT.

Contrary to the past, you can also input from the back from xT to x1.

Combined two more input directions

Bidirectional recursive neural network (bi-directional recurrent neural network) is often used.

The advantage of this is the information propagated from the beginning and the information propagated from the back at each time. That is, the information of the entire input series can be considered.

And the one that connects the LSTM in two directions

Bidirectional LSTM, commonly known as BiLSTM.

The outline is as shown in the figure.

There are several ways to connect RNNs in two directions, but I'll show you how to implement them before I explain them.

from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional

bilstm = Bidirectional(LSTM(units, return_sequences=True), merge_mode='sum')

It can be easily implemented by using LSTM as an argument like this.

Another argument, merge_mode, is for specifying how to connect two-way LSTMs. Basically, choose from {'sum','mul','concat','ave'}.

sum:Element sum
mul:Element product
concat:Join
ave:average
None:Returns a list without merging

Click here for an example

import numpy as np
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional


batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length = 20 #Sentence length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM

#I prepared the input data easily this time as well.
input_data = np.arange(batch_size * seq_length).reshape(batch_size, seq_length)

#Added BiLSTM to model.
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length))
model.add(Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat'))

# input_Check how the shape of the data changes.
output = model.predict(input_data)
print(output.shape)

Softmax function

It's a bit off topic, but let's get used to working with the Softmax function.

The Softmax function is a type of activation function It is used in the layer closest to the output of the neural network when classifying.

The definition of the Softmax function is the input string x = {x1, ..., xn} (each element is a real number)

By converting to an output string y = {y1, ..., yn} that has the i-th element As you can see from this definition, the output sequence y is

Meet. This was the case when the input column was a real number.

When actually implementing with keras

from keras.layers.core import Activation

#x size: [Batch size, number of classes]

y = Activation('softmax')(x)

# sum(y[0]) = 1, sum(y[1]) = 1, ...

Apply the softmax function for each batch and use it.

This is the default setting for Activation ('softmax') This is because softmax acts on the axis of the number of classes, which is the last element of the size of the input x.

That is, the size of x is[Batch size, d, number of classes]Even if it is three-dimensional like
Activation('softmax')Is applicable.

As an aside, how to write a model like this without using keras.models.Sequential

It's called Functional API.

Click here for usage examples

import numpy as np
from keras.layers import Input
from keras.layers.core import Activation
from keras.models import Model


x = Input(shape=(20, 5))
#Act softmax on x
y = Activation('softmax')(x)

model = Model(inputs=x, outputs=y)

sample_input = np.ones((12, 20, 5))
sample_output = model.predict(sample_input)

print(np.sum(sample_output, axis=2))

Attention

What is Attention

Now suppose you have two sentences s = {s1, ..., sN}, t = {t1, ..., tL}.

Here, let s be a question sentence and t be a candidate answer sentence.

At this time, how can the machine automatically determine whether t is valid as an answer to s?

No matter how much you look at t alone, you cannot tell whether t is valid or not. You need to refer to the information in s to determine if t is valid.

That's where the Attention Mechanism comes in handy.

In previous chapters we learned that sentences can be converted to hidden state vectors by RNNs.

Specifically, prepare two separate RNNs, and use one RNN. Hidden state vector s Convert to a hidden state vector t by the other RNN Can be converted to.

Therefore, in order to use the information of t in consideration of the information of s, at each time of t as follows. Calculate the feature considering the hidden state vector at each time of s.

In this way, by calculating where to "attention" in the reference source series s at each time of the target series t. You can flexibly extract the information of the target series while considering the information of the reference source series.

The outline is as shown in the figure.

The figure shows the case of unidirectional RNN, but Attention can be applied even for bidirectional RNN.
Also
Attention can be applied even if the maximum time (total number of hidden state vectors) of the reference source and the target RNN is different.

This mechanism called Attention is an important technology that is commonly used in natural language processing by deep learning.

Frequently used in machine translation, automatic summarization, and dialogue papers.

Historically, its usefulness has gained widespread recognition since it was first used in machine translation.

Also, this time we use the weighted average of the hidden state vector of s.

I introduced Soft Attention

Randomly select one hidden state vector

Hard Attention also exists.

It is also derived from it and may be used in the field of image recognition.

Among them, Google announced

The paper Attention is all you need is very famous.

Implementation of Attention

To implement Attention in keras

You need to use the Merge layer.

Since keras version 2.0 or later, it is not possible to merge the Sequential Model used up to the previous chapter, so

Here, we use keras's Functional API. Here's a simple way to use the Functional API:

In Sequential Model, I just added new layers,

You can build complex models more freely by using the Functional API.

from keras.layers import Input, Dense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers.merge import dot, concatenate
from keras.layers.core import Activation
from keras.models import Model

batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length1 = 20 #Sentence 1 length
seq_length2 = 30 #Sentence 2 length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM
hidden_dim = 200 #Number of dimensions of vector in final output

input1 = Input(shape=(seq_length1,))
embed1 = Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length1)(input1)
bilstm1 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)

input2 = Input(shape=(seq_length2,))
embed2 = Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=seq_length2)(input2)
bilstm2 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)

#Calculate the product for each element
product = dot([bilstm2, bilstm1], axes=2) #product size:[Batch size, length of sentence 2, length of sentence 1]

a = Activation('softmax')(product)
c = dot([a, bilstm1], axes=[2, 1])
c_bilstm2 = concatenate([c, bilstm2], axis=2)
h = Dense(hidden_dim, activation='tanh')(c_bilstm2)

model = Model(inputs=[input1, input2], outputs=h)

Since each layer is used like a function in this way, it is called the Functional API.

Also, be careful not to put the batch size in the shape specified in the newly introduced Input layer.

And when defining Model, you need to specify inputs and outputs as arguments,

If you have multiple inputs and outputs, you can put them in a list and pass them.

And the new dot([u, v], axes=2)Calculates the batch matrix multiplication of u and v.

The number of dimensions in the specified axes must be equal for u and v.

Also, dot([u, v], axes=[1,2])Then u 1D and v 2D
You can also specify different dimensions, such as.

Now, let's implement Attention using the Functional API based on the following formula.

Click here for usage examples

import numpy as np
from keras.layers import Input, Dense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers.merge import dot, concatenate
from keras.layers.core import Activation
from keras.models import Model

batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length1 = 20 #Sentence 1 length
seq_length2 = 30 #Sentence 2 length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM
hidden_dim = 200 #Number of dimensions of vector in final output

#To use the Embedding Layer that is common to the two LSTMs, first define the Embedding Layer.
embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

input1 = Input(shape=(seq_length1,))
embed1 = embedding(input1)
bilstm1 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)

input2 = Input(shape=(seq_length2,))
embed2 = embedding(input2)
bilstm2 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)

#Calculate the product for each element
product = dot([bilstm2, bilstm1], axes=2) #size:[Batch size, length of sentence 2, length of sentence 1]

#Implement Attention mechanism here
a = Activation('softmax')(product)
c = dot([a, bilstm1], axes=[2, 1])
c_bilstm2 = concatenate([c, bilstm2], axis=2)
h = Dense(hidden_dim, activation='tanh')(c_bilstm2)


model = Model(inputs=[input1, input2], outputs=h)

sample_input1 = np.arange(batch_size * seq_length1).reshape(batch_size, seq_length1)
sample_input2 = np.arange(batch_size * seq_length2).reshape(batch_size, seq_length2)

sample_output = model.predict([sample_input1, sample_input2])
print(sample_output.shape)

Dropout

Dropout is by randomly setting some variables to 0 during training. It is a method to improve generalization performance and prevent overfitting.

① What is overfitting?

When supervised learning is performed with a model such as a neural network
Often too fit for training data
It causes "overfitting" in which the performance of the evaluation data is significantly lower than that of the training data.

② What is generalization performance?

Regardless of training data and evaluation data, without overfitting with training data
Generally, high performance is called "generalization performance".

When actually using it, set the ratio of variables to 0 with a value between 0 and 1. Add a Dropout layer.

#When using the Sequential model
from keras.models import Sequential
from keras.layers import Dropout


model = Sequential()

...

model.add(Dropout(0.3))

#When using Functional API
from keras.layers import Dropout

y = Dropout(0.3)(x)

Click here for usage examples

import numpy as np
from keras.layers import Input, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.models import Model

batch_size = 32  #Batch size
vocab_size = 1000  #Number of vocabularies to handle
embedding_dim = 100  #Word vector dimensions
seq_length = 20  #Sentence 1 length
lstm_units = 200  #Number of dimensions of hidden state vector of LSTM

input = Input(shape=(seq_length,))

embed = Embedding(input_dim=vocab_size, output_dim=embedding_dim,
                  input_length=seq_length)(input)

bilstm = Bidirectional(
    LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed)

output = Dropout(0.3)(bilstm)

model = Model(inputs=input, outputs=output)

sample_input = np.arange(
    batch_size * seq_length).reshape(batch_size, seq_length)

sample_output = model.predict(sample_input)

print(sample_output.shape)

Python: Deep Learning in Natural Language Processing: Basics

Deep learning in natural language processing

What is LSTM

Implementation of LSTM

Softmax function

What is Attention

Implementation of Attention