[PYTHON] Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus

Target

We have summarized machine translation using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for machine translation using the Microsoft Cognitive Toolkit.

I will introduce them in the following order.

  1. Download JESC dataset
  2. Creating a Sentence Piece model
  3. Creating a file to be read by the built-in reader provided by CNTK

Introduction

Download JESC dataset

Japanese-English Subtitle Corpus is a large Japanese-English bilingual corpus including colloquialism. [1]

Japanese-English Subtitle Corpus

Go to the page above to download and unzip the Official splits under Download. The structure of the directory this time is as follows.

Doc2Vec NMTT  |―JESC   dev   test   train  nmtt_corpus.py STSA Word2Vec

Creating a Sentence Piece model

This time, we performed preprocessing on the JESC dataset, such as reducing redundancy and removing non-Japanese.

Regarding word division, sentencepiece as well as Natural Language: Chat Bot Part1-Twitter API Corpus Create a subword model using sentencepiece).

Creating a file to be read by the built-in reader provided by CNTK

After converting to word IDs using the Sentence Piece model trained with the training data, we are ready to create a text file for the CTFDeserializer used for training.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-7700 3.60GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Sentencepiece 0.1.86

Program to run

The implemented program is published on GitHub.

nmtt_corpus.py


result

The function jesc_preprocessing produces train.english.txt and train.japanese.txt to create the Sentence Piece model.

Then train the Sentence Piece model. Training starts by setting the arguments as shown below. Create a model separately for both Japanese and English. I set the number of words to 32,000.

$ spm_train --input=/mnt/c/.../JESC/train.english.txt --model_prefix=english --vocab_size=32000

At the end of the training, english.model, english.vocab and japanese.model, japanese.vocab will be created.

Finally, execute the function jesc_sentencepiece to create a text file to be read by CTFDeserializer.

Now 10000 samples...
Now 20000 samples...
...
Now 2740000 samples...

Number of samples 2748930

Maximum Sequence Length 97

Now that you're ready to train, Part 2 will use CNTK to train you in machine translation.

reference

JESC sentencepiece

Natural Language : Chat Bot Part1 - Twitter API Corpus

  1. Reid Pryzant, Youngjoo Chung, Dan Jurafsky, and Denny Britz. "JESC: Japanese-English Subtitle Corpus", arXiv preprint arXiv:1710.10639 (2017).

Recommended Posts

Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
Natural Language: Word2Vec Part1 --Japanese Corpus
Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Natural Language: Word2Vec Part3 --CBOW model
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
Natural Language: Doc2Vec Part2 --Document Classification
Natural Language: Word2Vec Part2 --Skip-gram model
Natural Language: ChatBot Part1-Twitter API Corpus
Natural Language: BERT Part2 --Unsupervised pretraining ALBERT
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python