Answer sentence selection system

For the question text, some candidates for the answer text are given It is a system that automatically selects the correct answer sentence from them.

The dataset used is

Allen AI's Textbook Question Answering.

I prepared train.json and val.json under "./nlp_data/".

Use train.json as training data and val.json as evaluation data.

Now, in general, natural language processing tasks require preprocessing of data.

Use word-separation as a pre-process.

Separation is the division of a sentence into words.

In the case of English as well, it is necessary to divide the words, normalize the characters, and ID the words.

And when using deep learning in natural language processing, all sentence lengths of the input must be the same.

This is because otherwise matrix operations cannot be performed.

Unifying the length of this input sentence is called Padding.

Short sentences should add 0 and sentences that are too long should be removed.

Data preprocessing

Normalization / word-separation

Regarding English normalization, this time we will only deal with the most basic process of unifying to uppercase or lowercase letters.

When the English text is given as a string

s = "I am Darwin."
s = s.lower()
print(s)
# => "i am darwin."

Next is the word-separation. One of the tools used for English word-separation

There is something called nltk.

In nltk, you can use headwords, stems, etc. in addition to word-separation. This time, for the sake of simplicity, I will only use word-separation.

from nltk.tokenize import word_tokenize
t = "he isn't darwin."
t = word_tokenize(t)
print(t)
# => ['he', 'is', "n't", 'darwin', '.']

In this way, isn't can be divided into is and n't, and the period can also be separated as one word.

Click here for usage examples

import json
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

with open("./nlp_data/train.json") as f:
    train = json.load(f)

#The train is a list, and each element stores questions, answer candidates, and answers as dictionary-type data.
# train[0] = {'answerChoices': {'a': 'solid Earth.',
#  'b': 'Earths oceans.',
#  'c': 'Earths atmosphere.',
#  'd': 'all of the above'},
# 'correctAnswer': 'd',
# 'question': 'Earth science is the study of'}

target = train[0]["question"]

#Unified to lowercase
target = target.lower()

#Word-separation
target = word_tokenize(target)

print(target)

Word ID

Since the word itself cannot be given to the neural network as input Must be converted to an ID.

What is an ID here?

Corresponds to a row in the Embedding Matrix.

Also, if you give an ID to all the words that appear in the data In many cases, the total number of vocabulary words becomes enormous.

Therefore, give an ID only to words whose frequency is above a certain level. Converts the data to a column of IDs.

Also, dictionary type.get(['key'])Corresponds to Key by
You can get the value of Value.

dict_ = {'key1': 'earth','key2': 'science', 'key3':'is','key4': 'the', 'key5':'study', 'key6':'of'}

print(dict_['key1'])

print(dict_.get('key1'))

Click here for usage examples

import json
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

with open("./nlp_data/train.json", "r") as f:
    train = json.load(f)
    
def preprocess(s):
    s = s.lower()
    s = word_tokenize(s)
    return s

sentences = []
for t in train:
    q = t['question']
    q = preprocess(q)
    sentences.append(q)
    for i, a in t['answerChoices'].items():
        a = preprocess(a)
        sentences.append(a)

vocab = {}
for s in sentences:
    for w in s:
        # vocab.get()Calculate the frequency for each word with
        vocab[w] = vocab.get(w, 0) + 1

        
word2id = {}
word2id['<unk>'] = 0
for w, v in vocab.items():
    if not w in word2id and v >= 2:
        # len()Give the word an ID with
        word2id[w] = len(word2id)

target = preprocess(train[0]["question"])

target = [word2id.get(w, 0) for w in target]

print(target)

Padding

When doing deep learning, matrix operations cannot be performed on data with different lengths such as sentences.

Forcibly add 0 of dummy ID at the end or delete as many words as necessary from the end of the sentence

You need to padding (and truncating) on the input data.

Keras has a convenient function for that, so I will use it this time.

import numpy as np
from keras.preprocessing.sequence import pad_sequences


s = [[1,2], [3,4,5], [6,7,8], [9,10,11,12,13,14]]
s = pad_sequences(s, maxlen=5, dtype=np.int32, padding='post', truncating='post', value=0)
print(s)
# => array([[ 1,  2,  0,  0,  0],
#       [ 3,  4,  5,  0,  0],
#       [ 6,  7,  8,  0,  0],
#       [ 9, 10, 11, 12, 13]], dtype=int32)

After padding and truncating in this way, it returns as a numpy array.

The explanation of the arguments is as follows.

maxlen:Unify length
dtype:Data type
padding: 'pre'Or'post'を指定し、前と後ろのどちらにpaddingするOrを決める
truncating: 'pre'Or'post'を指定し、前と後ろのどちらをtruncatingするOr決める
value:Value used when padding

Click here for usage examples

import numpy as np
from keras.preprocessing.sequence import pad_sequences


#Use this for the argument.
maxlen = 10
dtype = np.int32
padding = 'post'
truncating = 'post'
value = 0

#data
s = [[1,2,3,4,5,6], [7,8,9,10,11,12,13,14,15,16,17,18], [19,20,21,22,23]]

# padding,Please truncating.
s = pad_sequences(s,maxlen=10,dtype=np.int32,padding=padding,truncating=truncating,value=value)


print(s)

Attention-based QA-LSTM

Overall picture

From now on, we will finally implement the answer sentence selection system.

For the learning model

Attention-based QA-We will use an improved version of LSTM that is easy to understand.

The big picture of the model is a diagram.

① First, enter Question and Answer separately in BiLSTM.

② Next Attention from Question to Answer You can get Answer information considering Question.

(3) After that, the hidden state vector at each time of Question is averaged (mean pooling) to obtain the vector q.

④ On the other hand, after applying Attention from Question, take the average of the hidden state vectors at each time of Answer. Get the vector a.

⑤ Finally, these two vectors

Combine the vectors in the above equation The output consists of two units via a forward propagation neural network and the Softmax function.

This method of joining is based on the famous method called InferSent announced by Facebook research.

The output layer of this model has two units We will learn to predict [1,0] for correct answer sentences and [0,1] for incorrect answer sentences.

BiLSTM for questions and answers

Bidirectional LSTM(BiLSTM)What is
When recognizing named entities, you can capture contextual information in both left and right directions by reading from behind.

Click here for usage examples

from keras.layers import Input, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.models import Model


vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length1 = 20 #Question length
seq_length2 = 10 #Answer length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM

embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

input1 = Input(shape=(seq_length1,))
embed1 = embedding(input1)
bilstm1 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)
h1 = Dropout(0.2)(bilstm1)
model1 = Model(inputs=input1, outputs=h1)


input2 = Input(shape=(seq_length2,))
embed2 = embedding(input2)
bilstm2 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)
h2 = Dropout(0.2)(bilstm2)
model2 = Model(inputs=input2, outputs=h2)

model1.summary()
model2.summary()

Attention from question to answer

Let's go through the contents of the figure below.

Notice that it is an Attention from Question to Answer.

Here is a usage example, which is added to the previous section.

from keras.layers import Input, Dense, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers.merge import dot, concatenate
from keras.layers.core import Activation
from keras.models import Model

batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length1 = 20 #Question length
seq_length2 = 10 #Answer length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM
hidden_dim = 200 #Number of dimensions of vector in final output

embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

input1 = Input(shape=(seq_length1,))
embed1 = embedding(input1)
bilstm1 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)
h1 = Dropout(0.2)(bilstm1)

input2 = Input(shape=(seq_length2,))
embed2 = embedding(input2)
bilstm2 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)
h2 = Dropout(0.2)(bilstm2)

#Calculate the product for each element
product = dot([h2, h1], axes=2) #size:[Batch size, answer length, question length]
a = Activation('softmax')(product)
c = dot([a, h1], axes=[2, 1])
c_h2 = concatenate([c, h2], axis=2)
h = Dense(hidden_dim, activation='tanh')(c_h2)

model = Model(inputs=[input1, input2], outputs=h)
model.summary()

Output layer, compilation

Implement from mean pooling to output layer.

Note that we finally use the softmax function.

for mean pooling

from keras.layers.pooling import AveragePooling1D

y = AveragePooling1D(pool_size=2, strides=1)(x)

The size of x is [batch_size, steps, features]

The size of y will be [batch_size, downsampled_steps, features].

Click here for usage examples

from keras.layers import Input, Dense, Dropout, Lambda, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers.merge import dot, concatenate, subtract, multiply
from keras.layers.core import Activation
from keras.layers.pooling import AveragePooling1D
from keras import backend as K
from keras.models import Model

batch_size = 32 #Batch size
vocab_size = 1000 #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length1 = 20 #Question length
seq_length2 = 10 #Answer length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM
hidden_dim = lstm_units * 2 #Number of dimensions of vector in final output

def abs_sub(x):
    return K.abs(x[0] - x[1])

embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

input1 = Input(shape=(seq_length1,))
embed1 = embedding(input1)
bilstm1 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)
h1 = Dropout(0.2)(bilstm1)

input2 = Input(shape=(seq_length2,))
embed2 = embedding(input2)
bilstm2 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)
h2 = Dropout(0.2)(bilstm2)

#Calculate the product for each element
product = dot([h2, h1], axes=2) #size:[Batch size, answer length, question length]
a = Activation('softmax')(product)
c = dot([a, h1], axes=[2, 1])
c_h2 = concatenate([c, h2], axis=2)
h = Dense(hidden_dim, activation='tanh')(c_h2)

#It is implemented here.
mean_pooled_1 = AveragePooling1D(pool_size=seq_length1, strides=1, padding='valid')(h1)
mean_pooled_2 = AveragePooling1D(pool_size=seq_length2, strides=1, padding='valid')(h)

mean_pooled_1 = Reshape((lstm_units * 2,))(mean_pooled_1)
mean_pooled_2 = Reshape((lstm_units * 2,))(mean_pooled_2)

sub = Lambda(abs_sub)([mean_pooled_1, mean_pooled_2])
mult = multiply([mean_pooled_1, mean_pooled_2])
con = concatenate([mean_pooled_1, mean_pooled_2, sub, mult], axis=-1)
#con = Reshape((lstm_units * 2 * 4,))(con)
output = Dense(2, activation='softmax')(con)

model = Model(inputs=[input1, input2], outputs=output)
model.summary()
model.compile(optimizer="adam", loss="categorical_crossentropy")

Training

After building the model, we will learn the model.

After completing all preprocessing except padding and converting to ID It is assumed that it is prepared in ./nlp_data/.

It is assumed that the dictionary for converting words to ID is stored in ./nlp_data/word2id.json.

The file name is ./nlp_data/preprocessed_train.json The evaluation data is ./nlp_data/preprocessed_val.json.

The data in preprocessed_train.json looks like this, for example.

{'answerChoices': {'a': [1082, 1181, 586, 2952, 0],
  'b': [1471, 2492, 773, 0, 1297],
  'c': [811, 2575, 0, 1181, 2841, 0],
  'd': [2031, 1984, 1099, 0, 3345, 975, 87, 697, 1366]},
 'correctAnswer': 'a',
 'question': [544, 0]}

Click here for usage examples

import json
import numpy as np
from keras.layers import Input, Dense, Dropout, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers.merge import dot, concatenate
from keras.layers.core import Activation
from keras.layers.pooling import AveragePooling1D
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences

with open("./nlp_data/word2id.json", "r") as f:
    word2id = json.load(f)

batch_size = 500 #Batch size
vocab_size = len(word2id) #Number of vocabularies to handle
embedding_dim = 100 #Word vector dimensions
seq_length1 = 20 #Question length
seq_length2 = 10 #Answer length
lstm_units = 200 #Number of dimensions of hidden state vector of LSTM
hidden_dim = 200 #Number of dimensions of vector in final output

embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

input1 = Input(shape=(seq_length1,))
embed1 = embedding(input1)
bilstm1 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)
h1 = Dropout(0.2)(bilstm1)

input2 = Input(shape=(seq_length2,))
embed2 = embedding(input2)
bilstm2 = Bidirectional(LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)
h2 = Dropout(0.2)(bilstm2)

#Calculate the product for each element
product = dot([h2, h1], axes=2) #size:[Batch size, answer length, question length]
a = Activation('softmax')(product)
c = dot([a, h1], axes=[2, 1])
c_h2 = concatenate([c, h2], axis=2)
h = Dense(hidden_dim, activation='tanh')(c_h2)

mean_pooled_1 = AveragePooling1D(pool_size=seq_length1, strides=1, padding='valid')(h1)
mean_pooled_2 = AveragePooling1D(pool_size=seq_length2, strides=1, padding='valid')(h)
con = concatenate([mean_pooled_1, mean_pooled_2], axis=-1)
con = Reshape((lstm_units * 2 + hidden_dim,))(con)
output = Dense(2, activation='softmax')(con)

model = Model(inputs=[input1, input2], outputs=output)

model.compile(optimizer="adam", loss="categorical_crossentropy")

with open("./nlp_data/preprocessed_train.json", "r") as f:
    train = json.load(f)

questions = []
answers = []
outputs = []
for t in train:
    for i, ans in t["answerChoices"].items():
        if i == t["correctAnswer"]:
            outputs.append([1, 0])
        else:
            outputs.append([0, 1])
        #Please fill in the code below
        questions.append(t["question"])
        answers.append(ans)

questions = pad_sequences(questions, maxlen=seq_length1, dtype=np.int32, padding='post', truncating='post', value=0)
answers = pad_sequences(answers, maxlen=seq_length2, dtype=np.int32, padding='post', truncating='post', value=0)
outputs = np.array(outputs)

#I'm learning
model.fit([questions[:10*100], answers[:10*100]], outputs[:10*100], batch_size=batch_size)
#If you work locally, run the following code.

#　model.save_weights("./nlp_data/model.hdf5")
#　model_json = model.to_json()

#　with open("./nlp_data/model.json", "w") as f:
    #　json.dump(model_json, f)

Click here for results

test

Finally, we will test using the evaluation data.

Because it is a binary classification Accuracy is accuracy Precision Calculate the recall rate (Recall).

Also, I learned 5 epoch here. We have prepared a trained model ("./nlp_data/trained_model.hdf5").

Click here for usage examples

import json
import numpy as np
from keras.models import model_from_json
from keras.preprocessing.sequence import pad_sequences


with open("./nlp_data/preprocessed_val.json", "r") as f:
    val = json.load(f)
seq_length1 = 20 #Question length
seq_length2 = 10 #Answer length

questions = []
answers = []
outputs = []
for t in val:
    for i, ans in t["answerChoices"].items():
        if i == t["correctAnswer"]:
            outputs.append([1, 0])
        else:
            outputs.append([0, 1])
        questions.append(t["question"])
        answers.append(ans)

questions = pad_sequences(questions, maxlen=seq_length1, dtype=np.int32, padding='post', truncating='post', value=0)
answers = pad_sequences(answers, maxlen=seq_length2, dtype=np.int32, padding='post', truncating='post', value=0)

with open("./nlp_data/model.json", "r") as f:
    model_json = json.load(f)
model = model_from_json(model_json)
model.load_weights("./nlp_data/trained_model.hdf5")

pred = model.predict([questions, answers])

pred_idx = np.argmax(pred, axis=-1)
true_idx = np.argmax(outputs, axis=-1)

TP = 0
FP = 0
FN = 0
TN = 0

for p, t in zip(pred_idx, true_idx):
    if p == 0 and t == 0:
        TP += 1
    elif p == 0 and t == 1:
        FP += 1
    elif p == 1 and t == 0:
        FN += 1
    else:
        TN += 1

print("Correct answer rate:", (TP+TN)/(TP+FP+FN+TN))
print("Compliance rate:", TP/(TP+FP))
print("Recall:", TP/(TP+FN))

Click here for results

Visualization of Attention

In Attention, when applying Attention from sentence s to sentence t

How much attention is paid to the i-th word of t by the j-th word of s like It can be said that aij represents it.

The matrix A that has this aij as the (i, j) component is called the Attention Matrix. You can visualize the relationship between the words s and t by looking at the Attention Matrix.

Question words (horizontal axis) and answer words (vertical axis) that are closely related are displayed in white.

Click here for usage examples

import matplotlib.pyplot as plt
import json
import numpy as np
from keras.layers import Input, Dense, Dropout, Reshape
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import Bidirectional
from keras.layers.merge import dot, concatenate
from keras.layers.core import Activation
from keras.layers.pooling import AveragePooling1D
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.models import model_from_json
import mpl_toolkits.axes_grid1

batch_size = 32  #Batch size
embedding_dim = 100  #Word vector dimensions
seq_length1 = 20  #Question length
seq_length2 = 10  #Answer length
lstm_units = 200  #Number of dimensions of hidden state vector of LSTM
hidden_dim = 200  #Number of dimensions of vector in final output

with open("./nlp_data/preprocessed_val.json", "r") as f:
    val = json.load(f)

questions = []
answers = []
outputs = []
for t in val:
    for i, ans in t["answerChoices"].items():
        if i == t["correctAnswer"]:
            outputs.append([1, 0])
        else:
            outputs.append([0, 1])
        questions.append(t["question"])
        answers.append(ans)

questions = pad_sequences(questions, maxlen=seq_length1,
                          dtype=np.int32, padding='post', truncating='post', value=0)
answers = pad_sequences(answers, maxlen=seq_length2,
                        dtype=np.int32, padding='post', truncating='post', value=0)

with open("./nlp_data/word2id.json", "r") as f:
    word2id = json.load(f)

vocab_size = len(word2id)  #Number of vocabularies to handle
embedding = Embedding(input_dim=vocab_size, output_dim=embedding_dim)

input1 = Input(shape=(seq_length1,))
embed1 = embedding(input1)
bilstm1 = Bidirectional(
    LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed1)
h1 = Dropout(0.2)(bilstm1)

input2 = Input(shape=(seq_length2,))
embed2 = embedding(input2)
bilstm2 = Bidirectional(
    LSTM(lstm_units, return_sequences=True), merge_mode='concat')(embed2)
h2 = Dropout(0.2)(bilstm2)


#Calculate the product for each element
product = dot([h2, h1], axes=2)  #size:[Batch size, answer length, question length]
a = Activation('softmax')(product)

c = dot([a, h1], axes=[2, 1])
c_h2 = concatenate([c, h2], axis=2)
h = Dense(hidden_dim, activation='tanh')(c_h2)

mean_pooled_1 = AveragePooling1D(
    pool_size=seq_length1, strides=1, padding='valid')(h1)
mean_pooled_2 = AveragePooling1D(
    pool_size=seq_length2, strides=1, padding='valid')(h)
con = concatenate([mean_pooled_1, mean_pooled_2], axis=-1)
con = Reshape((lstm_units * 2 + hidden_dim,))(con)
output = Dense(2, activation='softmax')(con)

#Please answer here
prob_model = Model(inputs=[input1, input2], outputs=[a, output])

prob_model.load_weights("./nlp_data/trained_model.hdf5")

question = np.array([[2945, 1752, 2993, 1099, 122, 2717, 0,
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
answer = np.array([[2841, 830, 2433, 0, 0, 0, 0, 0, 0, 0]])

att, pred = prob_model.predict([question, answer])

id2word = {v: k for k, v in word2id.items()}

q_words = [id2word[w] for w in question[0]]
a_words = [id2word[w] for w in answer[0]]

f = plt.figure(figsize=(8, 8.5))
ax = f.add_subplot(1, 1, 1)

# add image
i = ax.imshow(att[0], interpolation='nearest', cmap='gray')

# add labels
ax.set_yticks(range(att.shape[1]))
ax.set_yticklabels(a_words)

ax.set_xticks(range(att.shape[2]))
ax.set_xticklabels(q_words, rotation=45)

ax.set_xlabel('Question')
ax.set_ylabel('Answer')

# add colorbar
divider = mpl_toolkits.axes_grid1.make_axes_locatable(ax)
cax = divider.append_axes('right', '5%', pad='3%')
plt.colorbar(i, cax=cax)
plt.show()

Click here for results

Python: Deep learning in natural language processing: Implementation of answer sentence selection system

Answer sentence selection system

Data preprocessing

Normalization / word-separation

Word ID

Overall picture

BiLSTM for questions and answers

Attention from question to answer

Output layer, compilation

Training

test

Visualization of Attention