[PYTHON] [TPU] [Transformers] Make BERT at explosive speed

I was able to build a BERT model very easily using TPU + Transformers in the competition I participated in, so I will share it.

Anyway, we will prioritize speed and create BERT, which is a recent NLP trend.

BERT It is a general-purpose language model devised by Google. This time, we will use a time-saving recipe using Distil BERT, which is a distillation model that is lighter and faster than BERT. The Pretrain model uses Japanese model published by NAMCO BANDAI Laboratory.

TPU TPU (Tensor Processing Unit) is a processor developed by Google that specializes in calculations around machine learning. It seems that matrix operations can be performed faster than general-purpose processors by replacing the arithmetic circuit from 32bit to 8 or 16bit, or by passing values between arithmetic circuits without reading and writing memory. Recently, there are TPUs for Edge that can be used with GCP and can be installed on Raspberry pi.

This time, we will use TPU on Google Colab to explode the learning time.

Transformers This is a deep learning framework provided by Hugging Face that specializes in Transformer models. You can easily load the Tokenizer and Pretrain models required to create Transformer models from those published on Hugging Face's HP. (Of course, you can also refer to the locally saved model.) At the beginning of development, it was only compatible with Pytorch, but now it is also compatible with Tensorflow. This time, we will load and build a model for Tensorflow (Keras) in consideration of the simplicity of I / F and the ease of use of TPU. (When using TPU with Pytorch, I use Tensorflow, which is relatively easy to use because it is troublesome to use XLA and write multi-processing.)

Build

This time it's solid, but we'll make a multi-class classifier using the Livedoor corpus that everyone loves.

Preparation

Change the runtime to TPU from Notebook's Runtime-> Change Runtime Type. (Default is None)

Since transfomers is not included in the Colab environment, I will drop it with pip. Also, since the Tokenizer used this time uses Mecab, install it together.

!pip install transformers
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
from google.colab import drive
drive.mount('/gdrive')
%cd "/gdrive/My Drive/workspace/python/bakusoku"

Data preprocessing

This time, I will use AutoTokenizer. If you specify a Pretrain model, the Tokenizer suitable for that model will be loaded automatically. (In this case, we are loading an instance of BertJapaneseTokenizer.) You can tokenize the statement with the tokenize method. This Tokenizer is divided into Word Piece units. (There are other division methods such as sentence piece.)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")

print(tokenizer.tokenize('Make BERT at explosive speed'))
# ['LOL', '##Fast', 'so', 'BE', '##R', '##T', 'To', 'create']

Since it is necessary to convert the sentence into a sequence of word IDs for learning, it is converted by the following method.

import numpy as np

def encode_texts(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    return np.array(enc_di['input_ids'])
x_train = encode_texts(train_df['text'].values, tokenizer)
x_valid = encode_texts(valid_df['text'].values, tokenizer)
x_test = encode_texts(test_df['text'].values, tokenizer)
print(x_train)
# [[    2   281   306 ...  2478     9     3]
#  [    2  1519     7 ...    15    16     3]
#  [    2 11634  3217 ...  2478     7     3]
#  ...
#  [    2  6093 16562 ...     0     0     0]
#  [    2   885  2149 ...     0     0     0]
#  [    2  5563  2037 ...     0     0     0]]

The correct label should also be one-hot encoded.

from tensorflow.keras.utils import to_categorical
y_train = to_categorical(train_df['label'].values)
y_valid = to_categorical(valid_df['label'].values)
print(y_train)
# [[1. 0. 0. ... 0. 0. 0.]
#  [1. 0. 0. ... 0. 0. 0.]
#  [1. 0. 0. ... 0. 0. 0.]
#  ...
#  [0. 0. 0. ... 0. 0. 1.]
#  [0. 0. 0. ... 0. 0. 1.]
#  [0. 0. 0. ... 0. 0. 1.]]

Preparing to use TPU

Unlike the GPU runtime that can be used just by switching, the TPU runtime needs to write the following code. It's almost a magical level, but it's faster to change the batch size according to the number of TPU cores.

import tensorflow as tf

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

num_replicas =  strategy.num_replicas_in_sync
print("REPLICAS: ", num_replicas)
# REPLICAS:  8
BATCH_SIZE = 16 * num_replicas # 128

Model building

This time, we will make a multi-class classifier using Distil BERT as an encoder. Of the encoder output, the one corresponding to the first token (a special token indicating the beginning of the sentence [CLS]) is connected to the HEAD (output layer by Softmax). I think it is also possible to connect after global cooling the entire output of the encoder.

from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model

def build_model(transformer, num_cls=1, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(num_cls, activation='softmax')(cls_token)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=2e-4), loss='categorical_crossentropy', metrics=['accuracy']) # lr = 5e-5 * 4
    
    return model

Load the Pretrain model published on the Hugging Face HP and create the above model. Like the Tokenizer, it uses the TFAutoModel to load the pretrain model onto the TPU. (In this case, we are loading an instance of TFDistilBertModel.)

Since the model released by Namco Bandai Laboratories is a Pytorch model, set from_pt to True.

from transformers import TFAutoModel
with strategy.scope():
    transformer_layer = (TFAutoModel.from_pretrained('bandainamco-mirai/distilbert-base-japanese', from_pt=True))
    model = build_model(transformer_layer, num_cls=9, max_len=512)
model.summary()
# Model: "model_1"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input_word_ids (InputLayer)  [(None, 512)]             0         
# _________________________________________________________________
# tf_distil_bert_model_1 (TFDi ((None, 512, 768), ((None 67497984  
# _________________________________________________________________
# tf_op_layer_strided_slice_1  [(None, 768)]             0         
# _________________________________________________________________
# dense_1 (Dense)              (None, 9)                 6921      
# =================================================================
# Total params: 67,504,905
# Trainable params: 67,504,905
# Non-trainable params: 0
# _________________________________________________________________

fine-tuning Let's train the model we made earlier. In the example below, learning will be completed in about 70 seconds with 5,500 cases-4 epoch.

AUTO = tf.data.experimental.AUTOTUNE

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=4
)
# Epoch 1/4
# 43/43 [==============================] - 31s 715ms/step - accuracy: 0.2473 - loss: 2.0548 - val_accuracy: 0.3355 - val_loss: 1.9584
# Epoch 2/4
# 43/43 [==============================] - 13s 308ms/step - accuracy: 0.6726 - loss: 1.4064 - val_accuracy: 0.6612 - val_loss: 1.1878
# Epoch 3/4
# 43/43 [==============================] - 13s 309ms/step - accuracy: 0.8803 - loss: 0.7522 - val_accuracy: 0.7877 - val_loss: 0.8257
# Epoch 4/4
# 43/43 [==============================] - 13s 309ms/step - accuracy: 0.9304 - loss: 0.4401 - val_accuracy: 0.8181 - val_loss: 0.6747

Evaluation

Accuracy is 81% ... The result is bimyo. I think that the accuracy will improve if you devise a little more data cleansing and the model structure after Encoder.

from sklearn.metrics import classification_report

test_df['predict'] = model.predict(test_dataset, verbose=1).argmax(axis=1)
print(classification_report(test_df['label'], test_df['predict'], target_names=target_names))
# 12/12 [==============================] - 11s 890ms/step
#                 precision    recall  f1-score   support
# 
# dokujo-tsushin       0.73      0.95      0.83       174
#   it-life-hack       0.66      0.91      0.76       174
#  kaden-channel       0.79      0.47      0.59       173
# livedoor-homme       0.91      0.31      0.47       102
#    movie-enter       0.81      0.96      0.88       174
#         peachy       0.81      0.71      0.76       169
#           smax       0.91      0.97      0.94       174
#   sports-watch       0.88      1.00      0.94       180
#     topic-news       0.91      0.75      0.83       154
# 
#       accuracy                           0.81      1474
#      macro avg       0.83      0.78      0.78      1474
#   weighted avg       0.82      0.81      0.79      1474