Target

We have summarized machine translation using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for machine translation using the Microsoft Cognitive Toolkit.

I will introduce them in the following order.

Download JESC dataset
Creating a Sentence Piece model
Creating a file to be read by the built-in reader provided by CNTK

Introduction

Download JESC dataset

Japanese-English Subtitle Corpus is a large Japanese-English bilingual corpus including colloquialism. [1]

Go to the page above to download and unzip the Official splits under Download. The structure of the directory this time is as follows.

Doc2Vec NMTT 　|―JESC 　　dev 　　test 　　train 　nmtt_corpus.py STSA Word2Vec

Creating a Sentence Piece model

This time, we performed preprocessing on the JESC dataset, such as reducing redundancy and removing non-Japanese.

Regarding word division, sentencepiece as well as Natural Language: Chat Bot Part1-Twitter API Corpus Create a subword model using sentencepiece).

Creating a file to be read by the built-in reader provided by CNTK

After converting to word IDs using the Sentence Piece model trained with the training data, we are ready to create a text file for the CTFDeserializer used for training.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-7700 3.60GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Sentencepiece 0.1.86

Program to run

The implemented program is published on GitHub.

`nmtt_corpus.py`

result

The function jesc_preprocessing produces train.english.txt and train.japanese.txt to create the Sentence Piece model.

Then train the Sentence Piece model. Training starts by setting the arguments as shown below. Create a model separately for both Japanese and English. I set the number of words to 32,000.

$ spm_train --input=/mnt/c/.../JESC/train.english.txt --model_prefix=english --vocab_size=32000

At the end of the training, english.model, english.vocab and japanese.model, japanese.vocab will be created.

Finally, execute the function jesc_sentencepiece to create a text file to be read by CTFDeserializer.

Now 10000 samples...
Now 20000 samples...
...
Now 2740000 samples...

Number of samples 2748930

Maximum Sequence Length 97

Now that you're ready to train, Part 2 will use CNTK to train you in machine translation.

reference

JESC sentencepiece

Natural Language : Chat Bot Part1 - Twitter API Corpus

Reid Pryzant, Youngjoo Chung, Dan Jurafsky, and Denny Britz. "JESC: Japanese-English Subtitle Corpus", arXiv preprint arXiv:1710.10639 (2017).

[PYTHON] Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus