I will introduce how to create a Japanese sentence vector from pre-learned BERT.
import torch
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer
from transformers import BertModel
#Japanese tokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese')
#Pre-learned Bert
model = BertModel.from_pretrained('bert-base-japanese')
This time we prepared a list containing 3 sentences
input_batch = \
["Of the thighs and thighs",
"The customer next door is a customer who often eats persimmons",
"Director of the Tokyo Patent Licensing Bureau"]
If you use batch_encode_plus, it will preprocess from the text list to the mini-batch for model input.
pad_to_max_length
is a Padding option.
encoded_data = tokenizer.batch_encode_plus(
input_batch, pad_to_max_length=True, add_special_tokens=True)
result Please note that the dictionary type will be returned. ʻInput_ids` is a word ID.
{'input_ids': [[2, 340, 28480, 28480, 28, 18534, 28, 18534, 5, 859, 3, 0],
[2, 2107, 5, 1466, 9, 1755, 14983, 761, 28489, 1466, 75, 3],
[2, 391, 6192, 3591, 600, 3591, 5232, 3, 0, 0, 0, 0]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}
By the way, if you check how it was tokenized, it will be as follows.
input_ids = torch.tensor(encoded_data["input_ids"])
tokenizer.convert_ids_to_tokens(input_ids[0].tolist())
result
Special tokens are properly granted.
['[CLS]', 'Su', '##Also', '##Also', 'Also', 'AlsoAlso', 'Also', 'AlsoAlso', 'of', 'home', '[SEP]', '[PAD]']
Input the tensorized ʻinput_ids` to BERT.
According to the Official Documentation (https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel), the model returns tuples. Since the first element becomes the hidden state vector of the final layer, it is extracted with ʻoutputs [0]`.
outputs = model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states.size())
# torch.Size([3, 12, 768])
Looking at the size of the output vector, it is (mini-batch size, series length, number of vector dimensions).
** I want to create a sentence vector from the [CLS]
added to the beginning of the entered text **, so extract it as follows.
sentencevec = last_hidden_states[:,0,:]
print(sentencevec.size())
# torch.Size([3, 768])
That's all there is to it.
Recommended Posts