[PYTHON] Natural Language: BERT Part1 --Japanese Wikipedia Corpus

Target

We have summarized BERT using the Microsoft Cognitive Toolkit (CNTK).

Part 1 prepares for BERT.

I will introduce them in the following order.

  1. Download Japanese Wikipedia and extract text data
  2. Text data preprocessing and Sentence Piece model creation
  3. Creating a pre-learning corpus

Introduction

Download Japanese Wikipedia

This time, we will use Japanese Wikipedia as the Japanese corpus.

Japanese Wikipedia

Download jawiki-latest-pages-articles-multistream.xml.bz2 from the link above. Then use wikiextractor to remove the markup language.

$ python ./wikiextractor-master/WikiExtractor.py ./jawiki/jawiki-latest-pages-articles-multistream.xml.bz2 -o ./jawiki -b 500M

The structure of the directory this time is as follows.

BERT  |―jawiki   jawiki-latest-pages-articles-multistream.xml.bz2  |―wikiextractor-master   WikiExtractor.py   ...  bert_corpus.py Doc2Vec NMTT STSA Word2Vec

Text data preprocessing and Sentence Piece model creation

In addition to the pre-processing we have implemented so far, we have performed pre-processing such as normalizing the notation of brackets and punctuation marks and deleting spaces between kana and kanji.

For word division, create a subword model using sentencepiece [1]. In addition, \ [CLS], \ [SEP], \ [MASK] are defined as special words.

Creating a pre-learning corpus

In BERT [2] pre-learning, the language model is trained as unsupervised learning by masking the sentences contained in the corpus, so create training data for that purpose.

In the Masked Language Model, we decided to replace 15% of the word sequence, with an 80% chance of leaving it as a special word \ [MASK], a 10% chance of a random word, and a remaining 10% chance of leaving it as it is. Put.

Also, this time we will use Sentence-Order Prediction [3] instead of Next Sentence Prediction.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-7700 3.60GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Nltk 3.4.5 ・ Numpy 1.17.3 ・ Sentencepiece 0.1.91

Program to run

The implemented program is published on GitHub.

bert_corpus.py


result

When the program is executed, a preprocessed sentence is written on each line, and a Japanese corpus is created with each topic separated by blank lines.

The Sentence Piece model is then trained to create jawiki.model and jawiki.vocab.

Finally, a text file is created to be read by CTFDeserializer for pre-learning.

Now that you are ready to train, Part 2 will use CNTK for unsupervised Japanese pre-learning.

reference

Japanese Wikipedia wikiextractor

  1. Taku Kudo and John Richardson. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing", arXiv preprint arXiv:1808.06226, (2018).
  2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805, (2018).
  3. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. "ALBERT: A Lite BERT for self-supervised learning of language representations", arXiv preprint arXiv:1909.11942 (2019).

Recommended Posts

Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Natural Language: BERT Part2 --Unsupervised pretraining ALBERT
Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus
[Natural language processing] Preprocessing with Japanese
Natural Language: Word2Vec Part3 --CBOW model
Natural Language: Doc2Vec Part2 --Document Classification
Natural Language: Word2Vec Part2 --Skip-gram model
Natural Language: ChatBot Part1-Twitter API Corpus
Natural Language: GPT --Japanese Generative Pretraining Transformer
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
Python: Natural language processing
RNN_LSTM2 Natural language processing