[PYTHON] Use ELMo and BERT to determine word similarity for polysemous words

Overview

It is verified by calculating the similarity of words whether ELMo, which is an encoder model that calculates the word distribution expression (embedded vector) considering the context, can distinguish polysemous words having multiple meanings. Polysemous words are, for example:

--What do you ** mean ? (What do you ** mean ?) --a ** mean ** person ( stingy ** person) --the ** mean ** value ( mean ** value)

The same verification is performed on BERT, which is the de facto standard for natural language processing models in recent years, and the results are compared.

(Added on 2020/4/8) </ font> We additionally verified how the result changes depending on the layer from which the embedded vector is extracted.

Environment / use model

All calculations were done on Google Colaboratory.

For both ELMo and BERT, the English-learned model is used as it is without fine tuning. ELMo uses the one from TensorFlow Hub, and BERT uses the one from Official Repository.

Problem setting

Models such as Word2vec and GloVe have one embedded vector for each word. Because it is obtained, it is not possible to distinguish what the polysemous words are used for. On the other hand, in models such as ELMo and BERT, even the same word has different embedded vectors depending on the context. Therefore, it can be expected that it is possible to distinguish polysemous words according to the meaning used.

This time, we will use the following example sentence, taking "right" which means "right", "correct", and "right" as an example.

--Meaning "right" My right arm is broken. Cover your right eye. Please turn right at the next corner. I got into the right lane.

--Meaning "correct" Your opinion is more or less right. I got the answer right. Please try to make things right again. It was quite right of you to refuse the offer.

--Meaning "right" I don't have a right to access that computer. Everyone has a right to enjoy his liberty. She has the right to criticize the government. Every person has a right to defend themselves.

Input these example sentences into the trained model to extract the embedded vector corresponding to "right" and cosine similarity.

cossim(\mathbf{u} ,\mathbf{v} ) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| \, |\mathbf{v}|}

By calculating, we will find out if "right" with the same meaning have a high degree of similarity.

Implementation

Import the required libraries. The version of TensorFlow uses 1.x series for both ELMo and BERT, but since March 27, 2020, the default of Google Colaboratory is 2.x series, so the magic command % tensorflow_version 1.x The 1.x system is specified in.

import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%tensorflow_version 1.x
import tensorflow as tf
import tensorflow_hub as hub

Prepare the text to be used for verification. BERT needs to read the input data from the file, so write it to the text file as well.

right_texts = ["My right arm is broken",
               "Cover your right eye",
               "Please turn right at the next corner",
               "I got into the right lane",
               "Your opinion is more or less right",
               "I got the answer right",
               "Please try to make things right again",
               "It was quite right of you to refuse the offer",
               "I don't have a right to access that computer",
               "Everyone has a right to enjoy his liberty",
               "She has the right to criticize the government",
               "Every person has a right to defend themselves",]

with open('right_texts.txt', mode='w') as f:
    f.write('\n'.join(right_texts))

Prepare a function to calculate the correlation matrix of cosine similarity.

def cos_sim(v1, v2):
  return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
  
def calc_sim_mat(arr):
  num = len(arr) # number of vectors contained in arr
  sim_mat = np.zeros((num, num))
  norm = np.apply_along_axis(lambda x: np.linalg.norm(x), 1, arr) # norm of each vector
  normed_arr = arr / np.reshape(norm, (-1,1))
  for i, vec in enumerate(normed_arr):
    sim = np.dot(normed_arr, np.reshape(vec, (-1,1)))
    sim = np.reshape(sim, -1) # flatten
    sim_mat[i] = sim
  return sim_mat

ELMo ELMo uses the trained model (v3) of TensorFlow Hub. How to use is written on the original page,

Try Word Embedding, ELMo with context in mind using TensorFlow Hub

I also referred to. (Actually, I decided to write this article because I read the above article.)

The ELMo module has a mode signature =" default " for entering space-separated sentences and a mode signature =" tokens " for entering a list of tokens divided by word, but this time we will use the latter. I will. Therefore, we have prepared a function called tokenizer to tokenize and pad the text.

elmo_url = "https://tfhub.dev/google/elmo/3"

def tokenizer(texts):
  PAD = ""
  tokens = [s.lower().split() for s in texts]
  lengths = [len(t) for t in tokens]
  max_len = max(lengths)
  tokens = [t + [PAD] * (max_len - len(t)) for t in tokens]
  return tokens, lengths

def elmo_embed(texts):
    tokens, lengths = tokenizer(texts)
    elmo = hub.Module(elmo_url, trainable=False)
    embeddings = elmo(
        inputs={
        "tokens": tokens,
        "sequence_len": lengths
        },
        signature="tokens",
        as_dict=True)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        embeddings_dict = sess.run(embeddings)

    return tokens, embeddings_dict

Performs calculations and outputs embedded vectors.

tokens, elmo_embeddings_dict = elmo_embed(right_texts)
print(elmo_embeddings_dict.keys())
# dict_keys(['lstm_outputs1', 'lstm_outputs2', 'word_emb', 'sequence_len', 'elmo', 'default'])

As explained on the TensorFlow Hub Page (https://tfhub.dev/google/elmo/3), the output of the ELMo module is a dictionary containing various embedded vectors. The explanation of each key is as follows.

word_emb: the character-based word representations with shape [batch_size, max_length, 512].
lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024].
lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024].
elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024]
default: a fixed mean-pooling of all contextualized word representations with shape [batch_size, 1024].

word_emb is the output of the embedded layer without considering the context of the first layer. Only this vector has a dimension of 512, but when summing it with other vectors, it seems that two word_emb vectors are combined to make the dimension 1024. As stated in the Original Paper, the ELMo output is a linear sum of the three embedded vectors word_emb, lstm_outputs1, and lstm_outputs2 with trainable coefficients. I took it and it is stored in ʻelmo. Since we are not training downstream tasks this time, we specified trainable = Falsewhen calling the module, but TensorFlow Hub does not mention what happens to the coefficients of the ELMo vector in that case. When I examined the vector values obtained by this calculation, it seems that the coefficients are simply 1/3 each. Also, since the value of the ELMo vector did not change even iftrainable = Truewas specified, it seems that the initial values of all trainable weights are also 1/3. defaultis the average of the ELMo vectors of all words in the sentence. I think it can be interpreted as a distributed expression of the entire sentence. sequence_len` is not included in the above explanation, but it is a list containing the number of tokens (excluding padding) of each sentence.

According to the Original Paper, the output of the first layer of the LSTM captures syntactic information, and the second layer captures semantic information. There seems to be a tendency, so in view of the content of this task of classifying the meaning of words, I will use lstm_outputs2 first. The following function retrieves only the "right" embedded vector.

def my_index(l, x, default=False):
  return l.index(x) if x in l else default

def find_position(tokens, word):
  pos = [my_index(t, word) for t in tokens]
  assert False not in pos
  return pos

def extract_elmo_vectors(embeddings_dict, tokens, word, layer):
  embeddings = embeddings_dict[layer]
  num_sentences = embeddings.shape[0]
  vec_dim = embeddings.shape[2]
  vectors = np.zeros((num_sentences, vec_dim))
  pos = find_position(tokens, word)
  for i in range(num_sentences):
    vectors[i] = embeddings[i][pos[i]][:]
  return vectors

elmo_vectors = extract_elmo_vectors(elmo_embeddings_dict, tokens, 'right', 'lstm_outputs2')
print(elmo_vectors.shape)
# (12, 1024)
elmo_sim_mat = calc_sim_mat(elmo_vectors)

You now have the embedded vector ʻelmo_vectors for the "right" in each sentence and the similarity correlation matrix ʻelmo_sim_mat. Do the same calculation with BERT before looking at the results.

BERT BERT is a model with supervised learning for fine tuning for downstream tasks, but bert-as-service It can also be used as an encoder to obtain a distributed representation of sentences. This time we will use BERT to calculate the distributed representation of words.

First, clone BERT's Official Repository (https://github.com/google-research/bert).

!git clone https://github.com/google-research/bert.git

The model will use BERT-Base, Uncased. Download and unpack the trained parameters.

!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip && \
unzip uncased_L-12_H-768_A-12.zip && \
rm uncased_L-12_H-768_A-12.zip

The code for extracting the embedded vector is provided in the official repository as ʻextract_features.py, so just execute it as follows. Specify the input file prepared by --input_file, and --output_filespecifies a jsonl file with an arbitrary name to save the output. The next three arguments specify the trained model downloaded above.--layers` specifies the output layers to use as the embedded vector, and all layers are specified for later verification.

!python ./bert/extract_features.py \
  --input_file=right_texts.txt \
  --output_file=right_output.jsonl \
  --vocab_file=uncased_L-12_H-768_A-12/vocab.txt \
  --bert_config_file=uncased_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=uncased_L-12_H-768_A-12/bert_model.ckpt \
  --do_lower=True \
  --layers 0,1,2,3,4,5,6,7,8,9,10,11

Prepare a function to extract the embedded vector corresponding to the target word token from the output jsonl file. I referred to this page.

def extract_bert_vectors(input_path, target_layer=-2, target_token): 
  with open(input_path, 'r') as f:
      output_jsons = f.readlines()
  
  vectors = []
  for output_json in output_jsons:
      output = json.loads(output_json)
      for feature in output['features']:
          if feature['token'] != target_token: continue
          for layer in feature['layers']:
              if layer['index'] != target_layer: continue
              vectors.append(layer['values'])
  return np.array(vectors)

Take the vector corresponding to "right" and calculate the similarity matrix. The layer from which the vector is taken out is the penultimate layer.

bert_vectors = extract_bert_vectors('./right_output.jsonl', target_layer=10, target_token='right')
print(bert_vectors.shape)
# (12, 768)
bert_sim_mat = calc_sim_mat(bert_vectors)

result

Now, let's plot the calculation result. Define a function for plotting with seaborn heatmap.

def show_sim_mat(sim_mat, texts, title=None, export_fig=False):
  sns.set(font_scale=1)
  g = sns.heatmap(
      sim_mat,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  g.set_xticklabels(texts, rotation='vertical')
  g.set_yticklabels(texts, rotation=False)
  if title:
    plt.title(title, fontsize=24)
  if export_fig:
    plt.savefig(export_fig, bbox_inches='tight')
  plt.show()

Execute for ELMo and BERT results.

show_sim_mat(elmo_sim_mat, right_texts, title='ELMo')
show_sim_mat(bert_sim_mat, right_texts, title='BERT')

The result is as follows. The similarity of the vector corresponding to "right" is plotted, but the label shows the entire sentence. Since I arranged four sentences with the same meaning and "right", it is ideal that the color of each of the four diagonal blocks is dark and the color of the other off-diagonal parts is light. But how about it?

First, in both figures, we can see that the similarity of the last block of the meaning of "rights" is clearly higher. All of these "right" are used as a set with "have / has" and "to", and the sentence structure is similar, so it is not a convincing result that it is easy to distinguish from other meanings. Is it? Regarding "right" and "correct", it is not as clear as "right", but there are some places where the similarity between the same meanings is certainly high as in the first two sentences.

When it comes to comparing ELMo and BERT, BERT seems to be better to the eye. However, it is important to look at the order of similarity rather than the similarity value itself, as cosine similarity tends to have different levels of overall value across models. Therefore, we introduce a quantitative index based on the order of similarity to compare both models.

The following function defines the similarity point. block_size is the number of sentences in which the word has the same meaning, which is 4 in the current example. Arrange each sentence in descending order of similarity, and points will be added if sentences that are actually used with the same meaning are included within the block_size rank. However, I will exclude the 1st place because I am always myself. In this case, if a sentence in which "right" is used in the same meaning is entered in the 2nd to 4th places, the score will be given. The points are normalized so that the highest point is 1. Similarity points for each sentence are stored in points_arr, where ʻav_point` is their average.

def eval_sim_points(sim_mat, block_size):
  num_data = len(sim_mat)
  points_list = []
  for i in range(num_data):
    block_id = int(i / block_size)
    points = np.array([1 if (block_id * block_size <= j and j < (block_id+1) * block_size) else 0 for j in range(num_data)])
    sorted_args = np.argsort(sim_mat[i])[::-1]
    sorted_points = points[sorted_args]
    point = np.mean(sorted_points[1:block_size])
    points_list.append(point)
  points_arr = np.array(points_list)
  av_point = np.mean(points_arr)
  return av_point, points_arr

The result of the execution is as follows.

# ELMo
elmo_point, elmo_points_arr = eval_sim_points(elmo_sim_mat, 4)
print(np.round(elmo_point, 2))
# 0.61
print(np.round(elmo_points_arr, 2))
# [0.33 0.33 0.   0.67 0.67 0.67 0.67 0.   1.   1.   1.   1.  ]

# BERT
bert_point, bert_points_arr = eval_sim_points(bert_sim_mat, 4)
print(np.round(bert_point, 2))
# 0.78
print(np.round(bert_points_arr, 2))
# [1.   1.   0.67 1.   0.67 0.33 0.33 0.33 1.   1.   1.   1.  ]

The small amount of data leaves questions about reliability, but quantification has made it possible to clearly evaluate the results. The average score was ELMo: 0.61, BERT: 0.78, and the rank was raised to BERT. Looking at the points for each sentence, it can be seen that the four sentences meaning "rights" are all full marks for both models, and the four sentences meaning "right" are given high scores by BERT. Both models struggle with the four sentences that mean "correct," but on average, ELMo gives better results.

Comparison of layers to extract vector

In the results shown so far, ELMo used the embedded vector extracted from the second layer of the LSTM, and BERT used the embedded vector extracted from the last second layer. Finally, let's see how the similarity points change depending on the layer from which the vector is extracted.

ELMo There are four types of word vectors output from ELMo: the context-independent embedded layer word_emb, the LSTM 1st layer lstm_outputs1, the LSTM 2nd layer lstm_outputs2, and the average ELMo vector ʻelmoof these three. Regardingword_emb`, the word vector of" right "is the same in every sentence, so it is not possible to distinguish polysemous words. Vectors that depend on the rest of the context

elmo_vectors_e = extract_elmo_vectors(elmo_embeddings_dict, tokens, 'right', 'elmo')
elmo_vectors_1 = extract_elmo_vectors(elmo_embeddings_dict, tokens, 'right', 'lstm_outputs1')
elmo_vectors_2 = extract_elmo_vectors(elmo_embeddings_dict, tokens, 'right', 'lstm_outputs2')

On the other hand, the results of calculating the average of the similarity points are as follows.

layer	Similarity points
LSTM 1st layer	0.67
LSTM 2nd layer	0.61
ELMo	0.64

According to the original paper, the second layer tends to capture meaningful information, so I expected that the second layer would be more accurate, but the result was that the first layer was more accurate. became. I can't say for sure because it's the result of a small dataset with only 12 sentences, but is it important to look at the sentence structure to distinguish synonyms? The result of the ELMo vector is worse than that of the first layer of LSTM, which is reasonable considering that the ELMo vector also includes a context-independent vector.

BERT BERT_base consists of 12 layers of Transformer, so let's compare all 12 layers. Since the output file right_output.jsonl stores the output of all layers, the vector can be retrieved as follows. It also calculates the average vector for all layers and the average vector for the last 6 layers, respectively.

bert_vectors_list = []
for i in range(12):
  bert_vectors_list.append(extract_bert_vectors('./right_output.jsonl', target_layer=i, target_token='right'))

# average of all the layers
bert_vector_av_all = np.mean(bert_vectors_list, axis=0)
# average of the last 6 layers
bert_vector_av_last6 = np.mean(bert_vectors_list[6:], axis=0)

The result of averaging the similarity points for these vectors is as follows.

layer	Similarity points
1st layer	0.58
2nd layer	0.67
3rd layer	0.78
4th layer	0.83
5th layer	0.83
6th layer	0.83
7th layer	0.81
8th layer	0.78
9th layer	0.83
10th layer	0.81
11th layer	0.78
12th layer	0.75
All layers average	0.81
Final 6-layer average	0.83

The low accuracy of the shallow layer close to the input and the final layer (12th layer), which is strongly influenced by the pre-learning task, is a convincing result. The highest accuracy is achieved by the average vector of several layers near the center and the final 6 layers. Even when compared with the highest accuracy, BERT resulted in a great difference from ELMo.

in conclusion

Although ELMo and BERT are said to give word distribution expressions that take context into consideration, I have never seen such an experiment, so I summarized it in an article. Regarding the example tried here, the result is that both ELMo and BERT can distinguish polysemous words by grasping the context to some extent. Comparing ELMo and BERT, BERT was still more efficient.

This time I dealt with word distributed expressions, but considering the application in the real world, I think that the distributed expressions of sentences have a wider range of applications, so next I would like to experiment with distributed expressions of sentences. ..