[PYTHON] Create Japanese sentence vector with BertModel of huggingface / transformers

I will introduce how to create a Japanese sentence vector from pre-learned BERT.

environment

Python (3.6.9)
PyTorch (1.3.0)
transformers(2.5.1)

procedure

1. Model loading


import torch
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers import BertModel 

#Japanese tokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese')
#Pre-learned Bert
model = BertModel.from_pretrained('bert-base-japanese')

2. Preparation of input data

This time we prepared a list containing 3 sentences

input_batch = \
    ["Of the thighs and thighs", 
    "The customer next door is a customer who often eats persimmons",
    "Director of the Tokyo Patent Licensing Bureau"]

3. Preprocessing (word Id conversion, Padding, special token grant)

If you use batch_encode_plus, it will preprocess from the text list to the mini-batch for model input. pad_to_max_length is a Padding option.

encoded_data = tokenizer.batch_encode_plus(
input_batch, pad_to_max_length=True, add_special_tokens=True)

result Please note that the dictionary type will be returned. ʻInput_ids` is a word ID.

{'input_ids': [[2, 340, 28480, 28480, 28, 18534, 28, 18534, 5, 859, 3, 0],
  [2, 2107, 5, 1466, 9, 1755, 14983, 761, 28489, 1466, 75, 3],
  [2, 391, 6192, 3591, 600, 3591, 5232, 3, 0, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

By the way, if you check how it was tokenized, it will be as follows.

input_ids = torch.tensor(encoded_data["input_ids"])
tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

result

Special tokens are properly granted.

['[CLS]', 'Su', '##Also', '##Also', 'Also', 'AlsoAlso', 'Also', 'AlsoAlso', 'of', 'home', '[SEP]', '[PAD]']

4. Sentence vectorization with BERT

Input the tensorized ʻinput_ids` to BERT.

According to the Official Documentation (https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel), the model returns tuples. Since the first element becomes the hidden state vector of the final layer, it is extracted with ʻoutputs [0]`.

outputs = model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states.size())
# torch.Size([3, 12, 768])

Looking at the size of the output vector, it is (mini-batch size, series length, number of vector dimensions). ** I want to create a sentence vector from the [CLS] added to the beginning of the entered text **, so extract it as follows.

sentencevec = last_hidden_states[:,0,:]
print(sentencevec.size())
# torch.Size([3, 768])

That's all there is to it.

Reference page

https://github.com/huggingface/transformers
https://github.com/cl-tohoku/bert-japanese