Target

We have summarized BERT using the Microsoft Cognitive Toolkit (CNTK).

Part 1 prepares for BERT.

I will introduce them in the following order.

Download Japanese Wikipedia and extract text data
Text data preprocessing and Sentence Piece model creation
Creating a pre-learning corpus

Introduction

Download Japanese Wikipedia

This time, we will use Japanese Wikipedia as the Japanese corpus.

Download jawiki-latest-pages-articles-multistream.xml.bz2 from the link above. Then use wikiextractor to remove the markup language.

$ python ./wikiextractor-master/WikiExtractor.py ./jawiki/jawiki-latest-pages-articles-multistream.xml.bz2 -o ./jawiki -b 500M

The structure of the directory this time is as follows.

BERT 　|―jawiki 　　jawiki-latest-pages-articles-multistream.xml.bz2 　|―wikiextractor-master 　　WikiExtractor.py 　　... 　bert_corpus.py Doc2Vec NMTT STSA Word2Vec

Text data preprocessing and Sentence Piece model creation

In addition to the pre-processing we have implemented so far, we have performed pre-processing such as normalizing the notation of brackets and punctuation marks and deleting spaces between kana and kanji.

For word division, create a subword model using sentencepiece [1]. In addition, \ [CLS], \ [SEP], \ [MASK] are defined as special words.

Creating a pre-learning corpus

In BERT [2] pre-learning, the language model is trained as unsupervised learning by masking the sentences contained in the corpus, so create training data for that purpose.

In the Masked Language Model, we decided to replace 15% of the word sequence, with an 80% chance of leaving it as a special word \ [MASK], a 10% chance of a random word, and a remaining 10% chance of leaving it as it is. Put.

Also, this time we will use Sentence-Order Prediction [3] instead of Next Sentence Prediction.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-7700 3.60GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Nltk 3.4.5 ・ Numpy 1.17.3 ・ Sentencepiece 0.1.91

Program to run

The implemented program is published on GitHub.

`bert_corpus.py`

result

When the program is executed, a preprocessed sentence is written on each line, and a Japanese corpus is created with each topic separated by blank lines.

The Sentence Piece model is then trained to create jawiki.model and jawiki.vocab.

Finally, a text file is created to be read by CTFDeserializer for pre-learning.

Now that you are ready to train, Part 2 will use CNTK for unsupervised Japanese pre-learning.

reference

Japanese Wikipedia wikiextractor

Taku Kudo and John Richardson. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing", arXiv preprint arXiv:1808.06226, (2018).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805, (2018).
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. "ALBERT: A Lite BERT for self-supervised learning of language representations", arXiv preprint arXiv:1909.11942 (2019).

[PYTHON] Natural Language: BERT Part1 --Japanese Wikipedia Corpus