[PYTHON] Convert character strings to features with RoBERTa

Premise

・ Language: Python3 -Library: transformers

Implementation

The code is here

1. Library import

import transformers
import torch

2. Determine the maximum number of characters

In most cases, the strings you enter in your model will not be the same length. On the other hand, in order to perform tensor calculation with this string data in the model, the lengths must be the same. So, decide the maximum value, and if it does not reach that length, fill it with padding characters to that length. (Next section)

MAX_LENGTH = 192

4. Replace characters with ID & add Special Token

The explanation is here.

tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
text = "This is a pen."
text2 = "I am a man"

ids = tokenizer.encode(text)
ids2 = tokenizer.encode(text2)

token_ids = tokenizer.build_inputs_with_special_tokens(ids, ids2)

3. Creating Padding & Attention Mask

Putting is as explained in the previous section. Attention Mask is a character that tells the model how effective the character is and where the padding comes from. It is "1" for valid characters and "0" for padding characters.

#Attention Mask
mask = [1] * len(token_ids)

#Padding
padding_length = MAX_LENGTH - len(token_ids)
if padding_length > 0:
    token_ids = token_ids + ([1] * padding_length)
    mask = mask + ([0] * padding_length)

4. Model generation

You can generate that model by giving "roberta-base" a different model name. Other models are here.

model = transformers.AutoModel.from_pretrained("roberta-base")

5. Convert strings to features in the model

I have reached the point where the character string entered so far is used as the ID. Since it is a list type, I will make it a torch.tensor type. When input to the model, the output of (1) the final layer of BertLayer and (2) the output of (1) processed by BertPooler are output. The size of each is as shown in the output result of the code below.

#A type that allows you to enter an ID and mask in model(list -> pytorch.tenrsor)Conversion to
token_ids_tensor = torch.tensor([token_ids], dtype=torch.long)
mask_tensor = torch.tensor([mask], dtype=torch.long)

#conversion
out = model(input_ids=token_ids_tensor, attention_mask=mask_tensor)

print(out[0].shape)
#output
#torch.Size([1, 192, 768])
print(out[1].shape)
#output
#torch.Size([1, 768])

Recommended Posts

Convert character strings to features with RoBERTa
Convert 202003 to 2020-03 with pandas
Convert strings to character-by-character list format with python
How to separate strings with','
Convert .ipynb to .html (with BatchFile)
Convert list to DataFrame with python
Convert sentences to vectors with gensim
[Beginner] Extract character strings with Python
Convert PDF to image with ImageMagick
Convert memo at once with Python 2to3
Convert from PDF to CSV with pdfplumber
Convert Excel data to JSON with python
Convert Hiragana to Romaji with Python (Beta)
Convert FX 1-minute data to 5-minute data with Python
[Introduction to Udemy Python3 + Application] 11. Character strings
Convert PDF files to PNG files with GIMP
Convert array (struct) to json with golang
Convert HEIC files to PNG files with Python
Convert Chinese numerals to Arabic numerals with Python
Sample to convert image to Wavelet with Python
Convert DICOM to PNG with Ascending and Descending
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Convert PDF to image (JPEG / PNG) with Python
Convert to HSV
Convert PDFs to images in bulk with Python
Convert mp4 to mp3 with ffmpeg (thumbnail embedded version)
Convert svg file to png / ico with Python
Convert Windows epoch values to date with python
Easily convert Jupyter Notebooks to blogs with fastpages
Batch convert LineString coordinate strings with Shapely + Pyproj
[Tentative] How to convert a character string to Shift_jis with kivy-ios Memo kivy v1.8.0
Convert the character code of the file with Python3
How to convert (32,32,3) to 4D tensor (1,32,32,1) with ndarray type
How to convert / restore a string with [] in python
0 Convert unfilled date to datetime type with regular expression
Convert a text file with hexadecimal values to a binary file
How to convert horizontally held data to vertically held data with pandas
How to convert a class object to a dictionary with SQLAlchemy
Convert comma-separated numeric strings to numbers in Pandas DataFrame
Convert the image in .zip to PDF with Python
How to convert JSON file to CSV file with Python Pandas
I want to split a character string with hiragana
PyInstaller memorandum Convert Python [.py] to [.exe] with 2 lines
Convert numeric variables to categorical with thresholds in pandas
Convert kanji to kana
Convert jupyter to py
Convert keras-yolo3 to onnx
Convert dict to array
Convert json to excel
[Python3] Be careful with removing character strings (strip, lstrip, rstrip)
Convert Select query obtained from Postgre with Go to JSON
[Python] How to make a list of character strings character by character
How to convert an array to a dictionary with Python [Application]
Convert color space from RGB to CIELAB with PIL (Pillow)
Convert images to sepia toning with PIL (Python Imaging Library)
Convert garbled scanned images to PDF with Pillow and PyPDF
I want to convert an ISO-8601 character string to Japan time
Convert video to black and white with ffmpeg + python + opencv
Try to extract the features of the sensor data with CNN