[PYTHON] [Japanese version] Judgment of word similarity for polysemous words using ELMo and BERT

Overview

Natural language processing models such as ELMo and BERT provide context-sensitive word-distributed representations. You can expect to be able to distinguish what meanings polysemous words with multiple meanings are used in sentences. Previously posted articles

-Use ELMo and BERT to determine word similarity for polysemous words

Then, using the English version of ELMo and BERT, we verified whether the three meanings of the word "right" could be distinguished: "right", "correct", and "right". This time, we will use the Japanese version of pre-learned ELMo and BERT to perform a similar experiment on Japanese polysemous words.

Both ELMo and BERT, which have been pre-learned in Japanese, use the model published by Stockmark.

-Introduction of ELMo (using MeCab) model that learned large-scale Japanese business news corpus -Introduction of BERT pre-learned (using MeCab) model that learned large-scale Japanese business news corpus

Problem setting

In the English version of the experiment, I distinguished the three meanings of "right", but I could not find a good example with three or more meanings in Japanese, so I will consider multiple polysemous words with two meanings. A total of 32 example sentences are used, with 4 sentences for each meaning of the four polysemous words "quite", "disrespectful", "innocent", and "neck". Most of the example sentences are taken from Japanese dictionaries on the Internet, and some of them have been modified.

-** Quite **-Meaning A (Not perfect, but it's enough.)

I made cookies for the first time, but it wasn't difficult and I think it was quite delicious. `` He doesn't say anything, but he cares a lot about being short. `The restaurant over there is cheap but quite delicious. `There are quite a few people who don't wash their hands after the toilet. ``

-** Quite **-Meaning B (That's fine. You don't need any more.)

I have enough. Any more is fine. `` You don't have to do that. `At this time, I decided that the land would not be used, and I knew that it would be fine if I could ask for money. `Still, unlike other people, the teacher can work while playing, so it's fine. ``

-** Excuse me **-Meaning A (Excuse me. Excuse me.)

Last night I was disappointed to show you something unsightly. `` Overall, do you feel that you are saying such a disrespectful thing? `He was uttered with disrespectful remarks, but he didn't go well at the workplace and quit immediately. `What are you saying? Don't be disrespectful. This is not where you come. go back! ``

-** Disrespect **-Meaning B (Stealing)

I've been a little disappointed from my brother's bookshelf. `` I'm sorry for this ballpoint pen. `A student disrespected the policeman's hat and started running at a glance. `Well, when he left, he disrespected the non-owned letter from the table. ``

-** Innocent **-Meaning A (no confusion or greed)

While looking at the brilliance of the moon innocently, something like a memory inherited from ancient times was evoked in Tengo. `` And I wanted to think that if I continued this effort innocently, it would eventually lead to a breakthrough. `A deer is playing innocently in the fire of Susukino flowers and in the sparkling reddish-brown erection. `The sound of the waves was innocently hitting the rocks on the shore all day, crushing and splashing. ``

-** Innocent **-Meaning B (begging people for money)

As soon as I visited this place and consulted about how to swing myself, my sister's husband gave me 300 yen. `` At that time, a woman with a suckling baby stood at the gate and begged for innocence. `The arrested man's mother revealed to the interview that he had been assaulted by the suspect the day before. `The result was that Kotaro, a little bit, was given a letter of innocence and was forced to use the road of Ichiri to his old father. ``

-** Neck **-Meaning A (The part that connects the head and torso of a vertebrate.)

I always tie my hair because it gets in the way when I squeeze my neck. `` Laughter with a tense voice while bending his neck back. `There was a white decoration on the collar of the clothes, probably to enhance the beauty of the long neck. `The employee frowned a little, pondered, and then politely shook his head. ``

-** Neck **-Meaning B (dismissal.)

Workers who are on leave due to occupational accidents cannot be fired. `` This project has the neck of each and every one of this team. `If you don't want to be fired, get results! `I was fired from the company, but I couldn't tell my family and I pretended to go to the company every day and spend time in the park. ``

I think it is common to write "Kubi" in katakana for the meaning B of "neck", but I write it in kanji to make it the same expression as meaning A.

Enter these example sentences into ELMo and BERT to extract the embedded vector of the target word, and cosine similarity between the vectors.

cossim(\mathbf{u} ,\mathbf{v} ) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}| \, |\mathbf{v}|}

To verify that words with the same meaning have a high degree of similarity.

Implementation

The calculation was done on Google colaboratory. TensorFlow uses version 1.x series.

Preparation

Import the required libraries and mount Google Drive.

import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Library for using Japanese with matplotlib
!pip install japanize_matplotlib
import japanize_matplotlib

#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

The stockmark pre-learning model uses the MeCab + NEologd dictionary as the tokenizer. The installation procedure is as described in this article, but I will repost it.

#MeCab installation
!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y

#NEologd dictionary installation
%cd /content
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
%cd mecab-ipadic-neologd
!echo yes | ./bin/install-mecab-ipadic-neologd -n

#Install the library to call MeCab in python
!pip install mecab-python3
!pip install unidic-lite  #Without this mecab-I get an error when running python3
import MeCab

Work in the directory / content / drive /'My Drive'/ synonym_classification.

%cd /content/drive/'My Drive'/synonym_classification

Next, prepare a list of example sentences to use.

target_words = ["very well", "Disrespect", "Innocent", "neck"]

texts1 = ["I made cookies for the first time, but it wasn't difficult and I think it was quite delicious.", 
          "He doesn't say anything, but he cares a lot about his shortness.", 
          "The restaurant over there is cheap but quite delicious.", 
          "There are quite a lot of people who don't wash their hands after the toilet.", 
          "I have enough. Any more is fine.", 
          "You don't have to do that.", 
          "At this time, I decided that the land would not be used, and I knew that it would be fine if I could ask for money.", 
          "Still, unlike other people, the teacher is fine because he can work while playing."]

texts2 = ["Last night I was disappointed to show you something unsightly.", 
          "All in all, do you feel that you are saying such a disrespectful thing?", 
          "He uttered all the disrespectful remarks, but he didn't go well at the workplace and immediately quit.", 
          "What do you say Don't be disrespectful. This is not where you come. go back!", 
          "I've been a little disappointed from my brother's bookshelf.", 
          "I'm sorry for this ballpoint pen.", 
          "A student disrespected the policeman's hat and started running at a glance.", 
          "Well, when he left, he disrespected the non-owned letter from the table."]

texts3 = ["While gazing at the brilliance of the moon, something like a memory that has been passed down from ancient times was recalled in Tengo.", 
          "The sound of the waves instinctively hit the rocks on the shore all day, crushing and splashing. ..", 
          "And I wanted to think that if I continued this effort innocently, it would eventually lead to a breakthrough.", 
          "Deer are playing innocently in the fire of Susukino flowers and the sparkling reddish-brown trees.", 
          "As soon as I visited this place and consulted about how to swing myself, my sister's husband gave me 300 yen.", 
          "At that time, a woman with a suckling baby stood at the gate and begged for innocence.", 
          "The arrested man's mother revealed to the interview that the suspect had been assaulted the day before with no money.", 
          "As a result, Kotaro was sent a letter of innocence and was forced to use the road of Ichiri to his old father.", ]
          
texts4 = ["I always tie my hair because it gets in the way when I squeeze my neck.", 
          "He bends his head backwards and laughs with a tense voice.", 
          "There was a white decoration on the collar of the clothes, probably to enhance the beauty of the long neck.", 
          "The employee frowned a little, pondered, and then politely shook his head.", 
          "Workers who are on leave due to an occupational accident cannot be fired.", 
          "The project depends on each and every one of the team.", 
          "If you don't want to be fired, get results!", 
          "He was fired at the company, but he couldn't tell his family and spent every day in the park pretending to go to the company."]

Since you need to input to BERT from a file, write it in a text file.

with open('texts1.txt', mode='w') as f:
  f.write('\n'.join(texts1))
with open('texts2.txt', mode='w') as f:
  f.write('\n'.join(texts2))
with open('texts3.txt', mode='w') as f:
  f.write('\n'.join(texts3))
with open('texts4.txt', mode='w') as f:
  f.write('\n'.join(texts4))

Since the input to ELMo needs to be tokenized in advance, prepare a tokenizer function and tokenize each sentence.

def mecab_tokenizer(texts, 
                    dict_path="/usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"):
  mecab = MeCab.Tagger("-Owakati -d " + dict_path)
  token_list = []
  for s in texts:
      parsed = mecab.parse(s).replace('\n', '')
      token_list += [parsed.split()]
  return token_list

token_list1 = mecab_tokenizer(texts1)
token_list2 = mecab_tokenizer(texts2)
token_list3 = mecab_tokenizer(texts3)
token_list4 = mecab_tokenizer(texts4)

The contents of token_list1 are as follows.

['For the first time', 'cookie', 'To', 'Making', 'Ta', 'but', '、', 'Difficult', 'Not', 'Ta', 'Shi', '、', 'very well', 'おいShiく', 'Make', 'Ta', 'When', 'think', '。']
['he', 'Is', 'mouth', 'To', 'Is', 'Issued', 'Absent', 'but', '、', 'Back', 'But', 'small', 'thing', 'To', 'To', 'very well', 'Qi', 'To', 'Shi', 'hand', 'I', 'Masu', '。']
['over there', 'of', 'Restaurant', 'Is', '、', 'cheap', 'ofに', 'very well', 'Delicious', 'Hmm', 'is', 'Yo', '。']
['toilet', 'of', 'after', 'so', '、', 'hand', 'To', 'Washed', 'Absent', 'Man', 'Is', 'very well', 'Lots', 'I', 'Masu', '。']
['Already', 'sufficient', 'What', 'Better', 'Ta', '。', 'this', 'that's all', 'Is', 'very well', 'is', '。']
['Such', 'thing', 'Until', 'Shi', 'hand', 'Is it', 'Without', 'hand', 'Also', 'very well', 'is', '。']
['this', 'When', 'To', 'Is', 'land', 'Is', 'Is it', 'Absent', 'thing', 'To', 'Shi', 'hand', '、', 'Money', 'so', 'Please', 'But', 'soき', 'Masure', 'If', 'very well', 'Is', 'When', 'I know', 'hand', 'I', 'Ta', 'of', 'so', 'ござI', 'Masu', 'But', '。']
['Still', '、', 'teacher', 'Is', 'Other', 'of', 'Man', 'When', 'Different', 'hand', '、', 'play', 'While', 'job', 'But', 'Can do', 'ofso', 'very well', 'so', 'Thank you', 'Masu', '。']

The target word needs to be separated as a single token for the task of determining polysemous words, but it is true that "quite" is one token in every sentence. Other target words could be tokenized as well. In fact, selecting a polysemous word that meets this condition is the most difficult part this time.

Actually, I wanted to compare the results using other Japanese pre-learned BERTs in addition to the stockmark model, but with dictionaries other than the NEologd dictionary, for example, "OK" becomes one token, WordPiece When tokenizing with or Sentence Piece, "Yui" and "Structure" were separated, and the task of determining polysemous words could not be executed. For this reason, this time we are only comparing stockmark pre-learned ELMo and BERT.

ELMo The ELMo model uses this implementation (https://github.com/HIT-SCIR/ELMoForManyLangs) with stockmark's pre-trained parameters. How to use this article I wrote earlier

-Detect anomaly of sentences using ELMo, BERT, USE

Is the same as.

!pip install overrides
!git clone https://github.com/HIT-SCIR/ELMoForManyLangs.git
%cd ./ELMoForManyLangs
!python setup.py install
%cd ..

Download the stockmark pre-learning model from here and place it in the folder ./ELMo_ja_word_level. There are "word unit embedding model" and "character unit / word unit embedding model", but this time we will use "word unit embedding model".

Now that the model is ready, we will create an instance of ʻEmbedder` and define a function to calculate and retrieve the embedded vector of the target word.

from ELMoForManyLangs.elmoformanylangs import Embedder
from overrides import overrides

elmo_model_path = "./ELMo_ja_word_level"
elmo_embedder = Embedder(elmo_model_path, batch_size=64)

def my_index(l, x, default=-1):
  return l.index(x) if x in l else default

def find_position(token_list, word):
  pos_list = [my_index(t, word) for t in token_list]
  assert -1 not in pos_list
  return pos_list

def get_elmo_word_embeddings(token_list, target_word):
  embs_list = elmo_embedder.sents2elmo(token_list, output_layer=-2)
  pos_list = find_position(token_list, target_word)
  word_emb_list = []
  for i, embs in enumerate(embs_list):
    word_emb_list.append(embs[:, pos_list[i], :])
  return word_emb_list

The argument output_layer of the method sents2elmo of the ʻEmbedder instance specifies the layer to extract the embedded vector, and the details are as follows. 0: Context-independent first word embedding layer 1: First LSTM layer 2: Second LSTM layer -1: Average of 3 layers (default) -2: Output all three layers This time, ʻoutput_layer = -2 is specified to compare the difference between output layers.

Enter the tokenized text to calculate the embed vector for the target word.

elmo_embeddings = get_elmo_word_embeddings(token_list1, target_words[0])
elmo_embeddings += get_elmo_word_embeddings(token_list2, target_words[1])
elmo_embeddings += get_elmo_word_embeddings(token_list3, target_words[2])
elmo_embeddings += get_elmo_word_embeddings(token_list4, target_words[3])
elmo_embeddings = np.array(elmo_embeddings)

print(elmo_embeddings.shape)
# (32, 3, 1024)

Since the context-independent word embedding of the first layer cannot be used to distinguish polysemous words, we will use the average of the first and second layers of LSTM and the three layers called ELMo vector.

elmo_lstm1_embeddings = elmo_embeddings[:, 1, :]
elmo_lstm2_embeddings = elmo_embeddings[:, 2, :]
elmo_mean_embeddings = np.mean(elmo_embeddings, axis=1)

Before calculating the cosine similarity of the embedded vector, prepare the embedded vector of BERT as well.

BERT This article, which I wrote earlier, basically describes how to use the stockmark pre-trained BERT model.

-Detect anomaly of sentences using ELMo, BERT, USE

According to. First, download the stockmark pre-trained model (TensorFlow version) from this link and place it in the directory ./BERT_base_stockmark. I will. Next, import the TensorFlow version 1.x series and clone the official BERT repository.

%tensorflow_version 1.x
import tensorflow as tf

!git clone https://github.com/google-research/bert.git

Of the code in the cloned repository, tokenization.py needs to be modified to use MeCab as the tokenizer. Please refer to this article as this will be long and will not be posted again. The code ʻextract_features.pythat retrieves the embedded vector does not need to be changed this time. By executing the following, the embedded vector for all tokens will be output tobert_embeddings * .jsonl`. All layers are specified as the layers from which the embedded vector is extracted.

#BERT execution
for i in range(1, 5):
  input_file = 'texts' + str(i) + '.txt'
  output_file = 'bert_embeddings' + str(i) + '.jsonl'

  !python ./bert/extract_features_mecab_neologd.py \
    --input_file=$input_file \
    --output_file=$output_file \
    --vocab_file=./BERT_base_stockmark/vocab.txt \
    --bert_config_file=./BERT_base_stockmark/bert_config.json \
    --init_checkpoint=./BERT_base_stockmark/output_model.ckpt \
    --layers 0,1,2,3,4,5,6,7,8,9,10,11

Extract only the embedded vector of the target word from the output jsonl file.

def extract_bert_embeddings(input_path, target_token, target_layer=10): 
  with open(input_path, 'r') as f:
      output_jsons = f.readlines()

  embs = []
  for output_json in output_jsons:
      output = json.loads(output_json)
      for feature in output['features']:
          if feature['token'] != target_token: continue
          for layer in feature['layers']:
              if layer['index'] != target_layer: continue
              embs.append(layer['values'])
  return np.array(embs)

bert_embeddings = []
for i in range(12):
  emb1 = extract_bert_embeddings('./bert_embeddings1.jsonl', 
                                 target_layer=i, target_token=target_words[0])
  emb2 = extract_bert_embeddings('./bert_embeddings2.jsonl', 
                                 target_layer=i, target_token=target_words[1])
  emb3 = extract_bert_embeddings('./bert_embeddings3.jsonl', 
                                 target_layer=i, target_token=target_words[2])
  emb4 = extract_bert_embeddings('./bert_embeddings4.jsonl', 
                                 target_layer=i, target_token=target_words[3])
  embeddings = np.vstack([emb1, emb2, emb3, emb4])
  bert_embeddings.append(embeddings)
bert_embeddings = np.array(bert_embeddings)

result

Now that we have the embedded vectors for the two models, let's determine the similarity of the words. First, prepare a function to calculate the correlation matrix of cosine similarity.

def calc_sim_mat(arr):
  num = len(arr) # number of vectors contained in arr
  sim_mat = np.zeros((num, num))
  norm = np.apply_along_axis(lambda x: np.linalg.norm(x), 1, arr) # norm of each vector
  normed_arr = arr / np.reshape(norm, (-1,1))
  for i, vec in enumerate(normed_arr):
    sim = np.dot(normed_arr, np.reshape(vec, (-1,1)))
    sim = np.reshape(sim, -1) #flatten
    sim_mat[i] = sim
  return sim_mat

First, let's look at the result of using the embedded vector of the second layer of LSTM for ELMo and the second layer from the end for BERT.

sim_mat_elmo = calc_sim_mat(elmo_lstm2_embeddings)
sim_mat_bert = calc_sim_mat(bert_embeddings[10])

Prepare and execute a function that visualizes the calculation result.

def show_sim_mat(sim_mat, labels, title=None, export_fig=False):
  sns.set(font_scale=1.1, font="IPAexGothic") 
  g = sns.heatmap(
      sim_mat,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
  ticks_pos = range(2, 32, 4)
  plt.xticks(ticks=ticks_pos, labels=labels, rotation='vertical')
  plt.yticks(ticks=ticks_pos, labels=labels)
  for i in range(8, 25, 8):
    plt.plot([0, 32], [i, i], ls='-', lw=1, color='b')
    plt.plot([i, i], [0, 32], ls='-', lw=1, color='b')
  for i in range(4, 29, 8):
    plt.plot([0, 32], [i, i], ls='--', lw=1, color='b')
    plt.plot([i, i], [0, 32], ls='--', lw=1, color='b')
  if title:
    plt.title(title, fontsize=24)
  if export_fig:
    plt.savefig(export_fig, bbox_inches='tight')
  plt.show()

labels = ['Quite A', 'Quite B', 'Disrespect A', 'Disrespect B', 'Innocent A', 'Innocent B', 'Neck A', 'Neck B']
show_sim_mat(sim_mat_elmo, labels, 'ELMo', 'ELMo.png')
show_sim_mat(sim_mat_bert, labels, 'BERT', 'BERT.png')

The results are as follows. The similarity between words in 32 example sentences is represented by a heat map of 32 x 32 squares. A dividing line is drawn every four sentences in which polysemous words are used interchangeably. Ideally, the color of the diagonal block of 4x4 squares will be darker, and the color of the other off-diagonal parts will be lighter. <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/c55744f2-649d-7628-3724-56689760acc3.png ", height=280> <img src="https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/613488/a67af78a-bb98-52cd-3844-9a6de901ead0.png ", height=280> Regarding ELMo, it was the same in the calculation of this article, but the angle between embedded vectors tends to be small, and the overall similarity is The value is large and the color is dark. At least you can see that the diagonal blocks of the 8x8 squares of the same word are darker than the rest, but the meanings of the polysemous words do not seem to be so clearly distinguishable. On the other hand, regarding BERT, you can clearly see that the color of the diagonal block of 4x4 squares is darker than that of the off-diagonal part. In particular, I think that A / B and Neck A can be distinguished quite well.

By the way, it is not possible to evaluate the accuracy accurately just by looking at the heat map, so let's evaluate the result using a quantitative index. We have prepared four example sentences for each meaning of a polysemous word, so each polysemous word has three companions with the same meaning. Therefore, for each polysemous word, the three with the highest degree of similarity are the polysemous words with the same meaning inferred by the model. Calculate the precision of this model for inference. The block_size in the function below is the size of one diagonal block in the heatmap above, which is now 4. Outputs a list of precision rates for each of the 32 example sentences and their average values.

def eval_precision(sim_mat, block_size):
  num_data = len(sim_mat)
  precision_list = []
  for i in range(num_data):
    block_id = int(i / block_size)
    pred = np.array([1 if (block_id * block_size <= j and j < (block_id+1) * block_size) 
                    else 0 for j in range(num_data)])
    sorted_args = np.argsort(sim_mat[i])[::-1]
    sorted_pred = pred[sorted_args]
    precision = np.mean(sorted_pred[1:block_size])
    precision_list.append(precision)
  precision_arr = np.array(precision_list)
  av_precision = np.mean(precision_arr)
  return av_precision, precision_arr

#ELMo LSTM 2nd layer
av_precision, precision_arr = eval_precision(sim_mat_elmo, block_size=4)
print(np.round(av_precision, 2))
for i in range(8):
  print(np.round(precision_arr[4*i:4*(i+1)], 2))

#BERT 11th layer
av_precision, precision_arr = eval_precision(sim_mat_bert, block_size=4)
print(np.round(av_precision, 2))
for i in range(8):
  print(np.round(precision_arr[4*i:4*(i+1)], 2))

The result is as follows.

** [ELMo LSTM 2nd layer] ** ** Average ** 0.54

	Example sentence 1	Example sentence 2	Example sentence 3	Example sentence 4
Quite A	1.0	0.33	0.67	0.67
Quite B	0.67	1.0	0.33	0.67
Disrespect A	0.33	0.33	0.33	0.33
Disrespect B	0.33	0.67	0.67	1.0
Innocent A	0.67	0.33	0.67	0.33
Innocent B	0.67	0.33	0.33	0.67
Neck A	0.67	0.67	1.0	0.67
Neck B	0.67	0	0	0.67

** [BERT 11th layer] ** ** Average ** 0.78

	Example sentence 1	Example sentence 2	Example sentence 3	Example sentence 4
Quite A	1.0	1.0	1.0	1.0
Quite B	1.0	1.0	1.0	1.0
Disrespect A	0	0.67	0.67	0.67
Disrespect B	1.0	0.67	0.67	0.67
Innocent A	1.0	1.0	1.0	1.0
Innocent B	0.67	0.67	0.33	0.33
Neck A	1.0	1.0	1.0	1.0
Neck B	0.67	0	0.67	0.67

After all, the result is that the accuracy of BERT is high even when viewed quantitatively. In Experiment in English version, the average precision rate was ELMo 0.61 and BERT 0.78, so both models have almost the same level of accuracy as the English version. It has become. Both models are struggling with the meaning B of the neck, but is it influenced by writing the neck instead of the dismissal?

Finally, we will compare the accuracy of the layers that extract the embedded vector. For BERT, the embedding vector of the average of all layers and the average of the latter 6 layers is also calculated.

# ELMo
#LSTM 1st layer
sim_mat_elmo = calc_sim_mat(elmo_lstm1_embeddings)
av_precision, _ = eval_precision(sim_mat_elmo, block_size=4)
print('LSTM1', np.round(av_precision, 2))

#LSTM 2nd layer
sim_mat_elmo = calc_sim_mat(elmo_lstm2_embeddings)
av_precision, _ = eval_precision(sim_mat_elmo, block_size=4)
print('LSTM2', np.round(av_precision, 2))

#3-layer average
sim_mat_elmo = calc_sim_mat(elmo_mean_embeddings)
av_precision, _ = eval_precision(sim_mat_elmo, block_size=4)
print('mean', np.round(av_precision, 2))

# BERT
#Each layer
for i in range(12):
  sim_mat_bert = calc_sim_mat(bert_embeddings[i])
  av_precision, _ = eval_precision(sim_mat_bert, block_size=4)
  print(i+1, np.round(av_precision, 2))

#All layers average
sim_mat_bert = calc_sim_mat(np.mean(bert_embeddings, axis=0))
av_precision, _ = eval_precision(sim_mat_bert, block_size=4)
print('average-all', np.round(av_precision, 2))

#Second half 6 layers average
sim_mat_bert = calc_sim_mat(np.mean(bert_embeddings[-6:], axis=0))
av_precision, _ = eval_precision(sim_mat_bert, block_size=4)
print('average-last6', np.round(av_precision, 2))

The results are as follows. Shows the average precision for all example sentences.

[ELMo]

layer	Average precision
LSTM 1st layer	0.52
LSTM 2nd layer	0.54
ELMo	0.54

[BERT]

layer	Average precision
1st layer	0.42
2nd layer	0.48
3rd layer	0.67
4th layer	0.73
5th layer	0.73
6th layer	0.73
7th layer	0.72
8th layer	0.73
9th layer	0.78
10th layer	0.78
11th layer	0.78
12th layer	0.77
All layers average	0.77
Final 6-layer average	0.77

In the English version of the experiment, ELMo achieved the highest accuracy in the LSTM 1st layer and BERT in the middle layer, but this time ELMo is the average of the LSTM 2nd and 3rd layers, and BERT is the last layer other than the final layer. It is the highest accuracy in the other layer. I think the tendency around this depends on the pre-learning data and tasks.

in conclusion

The story was the same as the article I wrote before, but I felt the difficulty because of Japanese, such as the need for word-separation. Comparing ELMo and BERT, it was found that BERT can distinguish polysemous words more accurately as in the English version. It was a pity that we couldn't compare with other Japanese pre-learned models due to the difference in tokenizers.