[PYTHON] [Implementation explanation] How to use the Japanese version of BERT in Google Colaboratory (PyTorch)

In this article, I will explain how to use the Japanese version of BERT with Google Colaboratory.

About BERT itself, a book I wrote last year

"Learn while making! Deep learning by PyTorch"

It is explained in detail in.

If you want to know how BERT works, please see the above book.

Since the book only deals with the English version, this post will explain how to use BERT in the Japanese version. (I would like to write about two after this article.)

The implementation code of this post is placed in the following GitHub repository.

GitHub: How to use Japanese version of BERT in Google Colaboratory: Implementation code It is 1_Japanese_BERT_on_Google_Colaboratory.ipynb.

** Series list ** [1] * This article [Implementation explanation] How to use the Japanese version of BERT with Google Colaboratory (PyTorch) [2] [Implementation explanation] Livedoor news classification in Japanese version BERT: Google Colaboratory (PyTorch) [3] [Implementation explanation] Brain science and unsupervised learning. Classify MNIST by information amount maximization clustering [4] [Implementation explanation] Classify livedoor news by Japanese BERT x unsupervised learning (information amount maximization clustering)

Preparation 1: Install MeCab on Google Colaboratory

Install MeCab, a tool for word-separation (morphological analysis). It cannot be installed with pip, so install it with apt.

!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y

Install mecab-python3 with pip so that you can use MeCab from Python.

!pip install mecab-python3

Install NEologd, a dictionary, so that you can use new words (recent new words) in MeCab. (However, it is not used in trained BERT)

!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a

Get the path to the new word dictionary NEologd.

import subprocess

cmd='echo `mecab-config --dicdir`"/mecab-ipadic-neologd"'
path_neologd = (subprocess.Popen(cmd, stdout=subprocess.PIPE,
                           shell=True).communicate()[0]).decode('utf-8')

This completes the MeCab specification preparation.

Remarks (IPAdic and UniDic)

** ipadic ** in mecab-ipadic-neologd, a new word dictionary, means "IPA dictionary". The IPA dictionary is organized by "IPA Part of Speech System".

You may have seen the expression ** UniDic ** in addition to IPAdic. UniDic is a system organized by "Kokugoken Short Unit Automatic Analysis Dictionary".

For example, for Sudachi for morphological analysis, the default dictionary is UniDic.

UniDic and IPAdic have different part-of-speech systems. For example, in UniDic, there is no adjective verb, but ** adjective ** (UniDic Part of Speech System % BB% F1% CE% C1 / UniDic% A4% CE% C9% CA% BB% EC% C2% CE% B7% CF)).

It has been three years since I joined a new graduate and started learning IT, and two and a half years since I started learning machine learning and deep learning.

Mr. F, a senior who has been my mentor since I was a new employee, is from Indonesia, but he is better at Japanese than I am.

Mr. F, an Indonesian mentor, also told me that sudachi's default is not an adjective verb but a shape verb. I was surprised.

Preparation 2: Checking the operation of MeCab

Now, let's actually divide the text (morphological analysis) and check the operation of MeCab.

The text is "I like machine learning." Let's.

The first is when the new word dictionary NEologd is not used.

import MeCab

m=MeCab.Tagger("-Ochasen")

text = "I like machine learning."

text_segmented = m.parse(text)
print(text_segmented)

(output)

I am my noun-Pronoun-General
Ha ha is a particle-Particle
Machine Kikai Machine Noun-General
Learning Gakushu Learning Nouns-Change connection
Ga ga ga particle-Case particles-General
Favorite favorite noun-Adjectival noun stem
It's death. Auxiliary verb special / death basic form
.. .. .. symbol-Kuten
EOS

**-Ochasen ** in MeCab.Tagger ("-Ochasen") is an output option. this, If **-Owakati ** is set, only the word-separation will be output. If **-Oyomi ** is set, only the reading will be output.

m=MeCab.Tagger("-Owakati")
text_segmented = m.parse(text)
print(text_segmented)

The output is `I like machine learning. `` is.

m=MeCab.Tagger("-Oyomi")
text_segmented = m.parse(text)
print(text_segmented)

If, the output is `Watashihakikaigakushugaskides. `` is.

Next is the case of using the new word dictionary NEologd.

m=MeCab.Tagger("-Ochasen -d "+str(path_neologd))  #Added path to NEologd

text = "I like machine learning."

text_segmented = m.parse(text)
print(text_segmented)

(output)

I am my noun-Pronoun-General
Ha ha is a particle-Particle
Machine learning Kikaigakushu Machine learning nouns-Proper noun-General
Ga ga ga particle-Case particles-General
Favorite favorite noun-Adjectival noun stem
It's death. Auxiliary verb special / death basic form
.. .. .. symbol-Kuten
EOS

In order to use NEologd, I added an option to MeCab.Tagger with -d and specified the path to NEologd.

When I wasn't using the new word dictionary, the word "machine learning" was separated into "machine" and "learning". Using the new word dictionary, it becomes "machine learning" (proper noun) and one word.

This is because the technical term "machine learning" is registered in the new word dictionary.

Even in the new word dictionary version, if you set **-Owakati **, only the word-separation will be output.

m=MeCab.Tagger("-Owakati -d "+str(path_neologd))  #Added path to NEologd
text_segmented = m.parse(text)
print(text_segmented)

The output is `I like machine learning. `` It will be.

Preparation 3: Prepare the trained model and morphological analysis of the Japanese version of BERT

Now, prepare the trained model and morphological analysis of the Japanese version of BERT.

The BERT model of my book, "Learn while making! Deep learning by PyTorch"](https://www.amazon.co.jp/dp/4839970254/) can also be used, but it has recently been used as a standard here. I'm using HuggingFace's model.

Hugging means to hug (embrace) in Japanese.

The BERT model uses HuggingFace's, learned parameters in Japanese, and Morphological analysis (Tokenizer) at the time of learning Published by Masatoshi Suzuki at Tohoku University (Professor Inui, Suzuki Laboratory) I will use.

This Japanese version of Tohoku University's trained model has been incorporated into HuggingFace's OSS ** transformers **, so it can be used directly from transformers.

First, install version 2.9 of transformers with pip.

!pip install transformers==2.9.0

Caution Transformers has been upgraded from 2.8 to 2.9 on May 8, 2020. If it is version 2.8, the file path to Japanese data will be an error, so please be careful to install 2.9.

Now, import PyTorch, BERT model, and tokenizer for BERT in Japanese (class to divide).

import torch
from transformers.modeling_bert import BertModel
from transformers.tokenization_bert_japanese import BertJapaneseTokenizer

Prepare a tokenizer for Japanese. Specify'bert-base-japanese-whole-word-masking'as an argument.

#It is a tokenizer that divides
tokenizer = BertJapaneseTokenizer.from_pretrained('bert-base-japanese-whole-word-masking')

Prepare a trained model in Japanese.

#BERT's Japanese-learned parameter model
model = BertModel.from_pretrained('bert-base-japanese-whole-word-masking')
print(model)

The following is a brief check of the output model results.

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(32000, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(

・ ・ ・

Here, check the settings (config) of the Japanese version model.

from transformers import BertConfig

#Tohoku University_Check the settings of the Japanese version
config_japanese = BertConfig.from_pretrained('bert-base-japanese-whole-word-masking')
print(config_japanese)

The output is below.

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 32000
}

Looking at the settings, you can see that the word vector is 768 dimensions, the maximum number of words (subwords) is 512, the number of BERT layers is 12, and the vocabulary size is 32,000.

With the above, we have prepared a model for which Japanese has been learned and a tokenizer for Japanese to be used before putting it into the model.

Handle sentences with Japanese version BERT

Finally, let's handle the sentences with the Japanese version of BERT.

"I got fired from the company." "Teleworking just hurts my neck." "The company was fired."

Prepare three sentences.

Then, compare the vectors of the three words ** "Kubi" **, ** "Kubi" **, and ** "Dismissal" ** in each sentence.

BERT is characterized by the fact that the word vector changes according to the context, so in the first and second sentences The same word ** "kubi" ** has a different 768-dimensional vector representation.

** And I'm happy if the "Kubi" in the first sentence is closer to the "Dismissal" in the third sentence than the "Kubi" in the second sentence. ** **

Let's measure the similarity of word vectors by cosine similarity.

Let's implement it. First, prepare a sentence.

text1 = "I got fired from the company."
text2 = "My neck hurts because of teleworking."
text3 = "The company was fired."

Process text1 with the Japanese version of BERT's word-separated tokenizer.

#Divide and convert to id
input_ids1 = tokenizer.encode(text1, return_tensors='pt')  #pt stands for PyTorch

print(tokenizer.convert_ids_to_tokens(input_ids1[0].tolist()))  #Sentence
print(input_ids1)  # id

The output is

['[CLS]', 'Company', 'To', 'neck', 'To', 'Became', 'Ta', '。', '[SEP]']
tensor([[    2,   811,    11, 13700,     7,    58,    10,     8,     3]])

It will be.

The word "kubi" was the third and I found the id to be 13700.

The second and third sentences are processed in the same way. The output of each is as follows.

['[CLS]', 'Tele', '##work', 'Just', 'so', 'neck', 'But', 'Pain', '##I', '。', '[SEP]']
tensor([[    2,  5521,  3118,  4027,    12, 13700,    14,  4897, 28457,     8,
             3]])

['[CLS]', 'Company', 'To', 'Dismissal', 'Sa', 'Re', 'Ta', '。', '[SEP]']
tensor([[   2,  811,   11, 7279,   26,   20,   10,    8,    3]])

I found that the second sentence "Kubi" was the fifth and the third sentence "Dismissal" was the third.

Now, input this id-ized content into the Japanese BERT model and calculate the output vector.

#Input to Japanese BERT model
result1 = model(input_ids1)

print(result1[0].shape)
print(result1[1].shape)

#result is sequence_output, pooled_output, (hidden_states), (attentions)is.
#However, hidden_states and attentions are optional and are not output by default.

The output is torch.Size([1, 9, 768]) torch.Size([1, 768]) It will be.

9 represents the number of words (subwords) in the first sentence. 768 is the embedded dimension of the word. Therefore, since the "dismissal" in the first sentence was in the third, the word vector is result1[0][0][3][:] It will be.

In addition, what is output in the calculation of the BERT model outputs # sequence_output, pooled_output, (hidden_states), (attentions) (However, hidden_states and attentions are optional and are not output by default).

Reference Explanation of BerT Model forward

Similarly, find the word vectors for "Kubi" (fifth) in the second sentence and "Dismissal" (third) in the third sentence.

#Input to Japanese BERT model
result2 = model(input_ids2)
result3 = model(input_ids3)

word_vec1 = result1[0][0][3][:]  #"Dismissal" in the first sentence (third)
word_vec2 = result2[0][0][5][:]  #"Dismissal" in the second sentence (fifth)
word_vec3 = result3[0][0][3][:]  #"Dismissal" in the third sentence (third)

Finally, let's find the similarity.

#Find cosine similarity
cos = torch.nn.CosineSimilarity(dim=0)
cos_sim_12 = cos(word_vec1, word_vec2)
cos_sim_13 = cos(word_vec1, word_vec3)

print(cos_sim_12)
print(cos_sim_13)

The output is tensor(0.6647, grad_fn=<DivBackward0>) tensor(0.7841, grad_fn=<DivBackward0>)

have become.

Therefore, the word expression processed by BERT is The similarity between "Kubi" in the first sentence and "Kubi" in the second sentence is 0.66. The similarity between "Kubi" in the first sentence and "Dismissal" in the third sentence is 0.78.

And, you can see that "Kubi" in the first sentence is close to "dismissal" in the third sentence (high degree of similarity).

By using BERT, we were able to confirm that even the same word "dismissal" has become a word vector that has changed its meaning according to the context.

As mentioned above, [Implementation explanation] How to use the Japanese version of BERT in Google Colaboratory (PyTorch).

[Remarks] The AI Technology Department development team that I lead is looking for members. Click here if you are interested

[Disclaimer] The content of this article itself is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.

References ● Use mecab ipadic-NEologd with Google Colaboratory