Recently I started kaggle. It's a contest to extract character strings from Tweets, but can you hit a few? I thought, I tried to convert words into features using various methods. At that time, I would like to introduce Toknizer, which was convenient.
In order to learn a word by machine learning, it is necessary to quantify (vectorize) the word. The converter is called Tokenizer. Probably. For example  This -> Tokenizer ->713 Quantify like this.
transformers The library used this time is a library called "transformers" developed by "Hugging Face, Inc". It is a library that is familiar for natural language processing (at least it was often used in kaggle), and not only tokenizer but also model by the latest method is implemented.
This time, I will use "RoBERTa" as an example. The source can be found here [https://github.com/ishikawa-takumi/transformers-sample/blob/master/tokenizer.ipynb). This source is in ipynb format, so you can run it in Google Colab or Visual Studio Code. Google Colab Visual Studio Code Alternatively, take each code and port it to the console or py to try it out.
Get the tokenizer with transformers.AutoTokenizer.from_pretrained (model name). By changing the "model name", you can get the tokenizer of that model. Since we will use RoBERTa this time, enter "roberta-base". Other models can be found here (https://huggingface.co/models). It will take some time to be generated, so please be patient.
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
Prepare the following sentence this time.
text = "This is a pen."
text2 = "I am a man"
Vectorize each of the created sentences. To do this, use encode.
ids = tokenizer.encode(text)
# output
# [713, 16, 10, 7670, 4]
ids2 = tokenizer.encode(text2)
# output
# [100, 524, 10, 313]
Now you have the numbers for the words you wanted to do. Looking at the first sentence, this -> 713 is -> 16 a -> 10 pen -> 7670 . -> 4 Converted to! !! !!
Special Token
Furthermore, in the methods such as BERT and RoBERTa, learning is done using special characters. These include the beginning of a sentence and the characters that represent a sentence-to-sentence break. Please check here for details. For example, let's see what kind of characters are used in RoBERTa. It is packed in a variable called special_tokens_map.
tokenizer.special_tokens_map
# output
# {'bos_token': '<s>',
# 'eos_token': '</s>',
# 'unk_token': '<unk>',
# 'sep_token': '</s>',
# 'pad_token': '<pad>',
# 'cls_token': '<s>',
# 'mask_token': '<mask>'}
"Token Meaning: Characters Assigned to Tokens"
The output is like this.
For example, the first bos_token is assigned the character \ .
Also, as the meaning of each,
bos_token: Begin of sequence token
eos_token: End of Sequence token
unk_token: Characters that cannot be converted to ID (Unknown token)
sep_token: Sentence-to-sentence separator (The separator token)
pad_token: Padding (The token used for padding)
cls_token: For classification (cls_token)
mask_token: The token used for masking values
is. Please check the URL mentioned earlier for a detailed explanation.
all_special_tokens is the character list assigned to the Special Token, and all_special_ids is filled with the IDs corresponding to the Special Token. They are arranged in the same order as all_special_tokens. Seven Special Tokens were given earlier, but all_special_tokens is a list of five characters because some of the assigned characters are duplicated.
tokenizer.all_special_tokens
# output
# ['<pad>', '<s>', '<mask>', '<unk>', '</s>']
tokenizer.all_special_ids
# output
# [1, 0, 50264, 3, 2]
In RoBERTa and BERT, it is necessary to add a Special Token when learning and inferring with Model. (I haven't padded this time.) You can manually attach the Special Token you checked earlier, but there is an API that will add the Special Token to the automatically entered sentence, so let's use it.
ids_with_special_token = tokenizer.build_inputs_with_special_tokens(ids, ids2)
# output(ids_with_special_token)
# [0, 713, 16, 10, 7670, 4, 2, 2, 100, 524, 10, 313, 2]
mask = tokenizer.get_special_tokens_mask(ids, ids2)
# output(mask)
# [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1]
To add a Special Token, tokenizer.build_inputs_with_special_tokens (text ID, text 2 ID) Is used. You can insert two sentences (even one is OK), and the Special Token is properly inserted at the beginning, break, and bottom of the two sentences.
In the future, I will actually process it with Model.
Recommended Posts