We have summarized machine translation using the Microsoft Cognitive Toolkit (CNTK).
In Part 1, we will prepare for machine translation using the Microsoft Cognitive Toolkit.
I will introduce them in the following order.
Japanese-English Subtitle Corpus is a large Japanese-English bilingual corpus including colloquialism. 
Go to the page above to download and unzip the Official splits under Download. The structure of the directory this time is as follows.
This time, we performed preprocessing on the JESC dataset, such as reducing redundancy and removing non-Japanese.
Regarding word division, sentencepiece as well as Natural Language: Chat Bot Part1-Twitter API Corpus Create a subword model using sentencepiece).
After converting to word IDs using the Sentence Piece model trained with the training data, we are ready to create a text file for the CTFDeserializer used for training.
・ CPU Intel (R) Core (TM) i7-7700 3.60GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Sentencepiece 0.1.86
The implemented program is published on GitHub.
The function jesc_preprocessing produces train.english.txt and train.japanese.txt to create the Sentence Piece model.
Then train the Sentence Piece model. Training starts by setting the arguments as shown below. Create a model separately for both Japanese and English. I set the number of words to 32,000.
$ spm_train --input=/mnt/c/.../JESC/train.english.txt --model_prefix=english --vocab_size=32000
At the end of the training, english.model, english.vocab and japanese.model, japanese.vocab will be created.
Finally, execute the function jesc_sentencepiece to create a text file to be read by CTFDeserializer.
Now 10000 samples...
Now 20000 samples...
Now 2740000 samples...
Number of samples 2748930
Maximum Sequence Length 97
Now that you're ready to train, Part 2 will use CNTK to train you in machine translation.