[PYTHON] Tokenize using the Hugging Face library

background

Recently I started kaggle. It's a contest to extract character strings from Tweets, but can you hit a few? I thought, I tried to convert words into features using various methods. At that time, I would like to introduce Toknizer, which was convenient.

What is Tokenizer?

In order to learn a word by machine learning, it is necessary to quantify (vectorize) the word. The converter is called Tokenizer. Probably. For example  This -> Tokenizer ->713 Quantify like this.

transformers The library used this time is a library called "transformers" developed by "Hugging Face, Inc". It is a library that is familiar for natural language processing (at least it was often used in kaggle), and not only tokenizer but also model by the latest method is implemented.

Let's use it immediately

This time, I will use "RoBERTa" as an example. The source can be found here [https://github.com/ishikawa-takumi/transformers-sample/blob/master/tokenizer.ipynb). This source is in ipynb format, so you can run it in Google Colab or Visual Studio Code. Google Colab Visual Studio Code Alternatively, take each code and port it to the console or py to try it out.

Get tokenizer

Get the tokenizer with transformers.AutoTokenizer.from_pretrained (model name). By changing the "model name", you can get the tokenizer of that model. Since we will use RoBERTa this time, enter "roberta-base". Other models can be found here (https://huggingface.co/models). It will take some time to be generated, so please be patient.

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")

Create a suitable sentence

Prepare the following sentence this time.

text = "This is a pen."
text2 = "I am a man"

Vectorization

Vectorize each of the created sentences. To do this, use encode.

ids = tokenizer.encode(text)
# output
# [713, 16, 10, 7670, 4]
ids2 = tokenizer.encode(text2)
# output
# [100, 524, 10, 313]

Now you have the numbers for the words you wanted to do. Looking at the first sentence, this -> 713 is -> 16 a -> 10 pen -> 7670 . -> 4 Converted to! !! !!

Special Token

What is Special Token?

Furthermore, in the methods such as BERT and RoBERTa, learning is done using special characters. These include the beginning of a sentence and the characters that represent a sentence-to-sentence break. Please check here for details. For example, let's see what kind of characters are used in RoBERTa. It is packed in a variable called special_tokens_map.

tokenizer.special_tokens_map
# output
# {'bos_token': '<s>',
#  'eos_token': '</s>',
#  'unk_token': '<unk>',
#  'sep_token': '</s>',
#  'pad_token': '<pad>',
#  'cls_token': '<s>',
#  'mask_token': '<mask>'}

"Token Meaning: Characters Assigned to Tokens" The output is like this. For example, the first bos_token is assigned the character \ . Also, as the meaning of each, bos_token: Begin of sequence token eos_token: End of Sequence token unk_token: Characters that cannot be converted to ID (Unknown token) sep_token: Sentence-to-sentence separator (The separator token) pad_token: Padding (The token used for padding) cls_token: For classification (cls_token) mask_token: The token used for masking values

is. Please check the URL mentioned earlier for a detailed explanation.

Special Token ID check

all_special_tokens is the character list assigned to the Special Token, and all_special_ids is filled with the IDs corresponding to the Special Token. They are arranged in the same order as all_special_tokens. Seven Special Tokens were given earlier, but all_special_tokens is a list of five characters because some of the assigned characters are duplicated.

tokenizer.all_special_tokens
# output
# ['<pad>', '<s>', '<mask>', '<unk>', '</s>']
tokenizer.all_special_ids
# output
# [1, 0, 50264, 3, 2]

Added Special Token

In RoBERTa and BERT, it is necessary to add a Special Token when learning and inferring with Model. (I haven't padded this time.) You can manually attach the Special Token you checked earlier, but there is an API that will add the Special Token to the automatically entered sentence, so let's use it.

ids_with_special_token = tokenizer.build_inputs_with_special_tokens(ids, ids2)
# output(ids_with_special_token)
# [0, 713, 16, 10, 7670, 4, 2, 2, 100, 524, 10, 313, 2]
mask = tokenizer.get_special_tokens_mask(ids, ids2)
# output(mask)
# [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1]

To add a Special Token, tokenizer.build_inputs_with_special_tokens (text ID, text 2 ID) Is used. You can insert two sentences (even one is OK), and the Special Token is properly inserted at the beginning, break, and bottom of the two sentences.

In the future, I will actually process it with Model.

Recommended Posts

Tokenize using the Hugging Face library
Outline the face using Dlib (1)
I checked the library for using the Gracenote API
Aggregate test results using the QualityForward Python library
Checking methods and variables using the library see
Using the 3D plot library MayaVi from Julia
I tried using the functional programming library toolz
Part 1 Running e-Gov using the SmartHR library kiji (e-Gov specifications)
Using the National Diet Library Search API in Python
I tried using the Python library from Ruby with PyCall
I tried face recognition of the laughter problem using Keras.
(Python3) No. oO (Are you using the standard library?): 5 shaders
I tried face recognition using Face ++
Try using the Twitter API
Clone using the dd command
Try using the Twitter API
I tried the changefinder library!
Try using the PeeringDB 2.0 API
Part 2 Using the SmartHR library kiji to run e-Gov (e-Gov public materials)
Part 3 Running e-Gov using the SmartHR library kiji (execution environment construction)