[PYTHON] A quick introduction to the neural machine translation library

Introduction

English-Japanese translation will be added to WMT, the world's largest workshop on machine translation research, from 2020, DeepL will support Japanese, and attention to Japanese machine translation is increasing. Under such circumstances, there may be people who want to try out the machine translation model and want to learn it. I am also an amateur in this field, but I may want to use it in the future. I researched for myself what kind of machine translation library is currently used in my research. I have only used Fairseq and the old OpenNMT.

What kind of library is used

According to Findings of WMT2019, a machine translation competition --Marian over 30%

Marian NMT Marian is a framework developed by the Microsoft Translator team. Since it is written in C ++, it is very fast. In terms of accuracy, the Microsoft team's system using Marian at WMT2019 has a proven track record of being ranked high in various language pairs.

As far as the example is seen, the usage seems to be the flow of learning by creating a vocabulary file with Marian's command from the tokenized corpus.

To install the GPU version, prepare CMake 3.5.1, GCC / G ++ 5.4, Boost 1.65.1, CUDA 9.0 or newer and make it (I made it without knowing anything when I was young) I gave up because of moss).

Unfortunately, I haven't found a Japanese article about installing and using Marian so far. Read English documents and tutorials.

Fairseq Fairseq is a toolkit developed by Facebook AI. Written in Pytorch, the main feature is that it is designed to be easy to expand, and I get the impression that it is being updated steadily. In terms of accuracy, the Facebook team has achieved excellent results at WMT 2019. The speed is faster using FP16 mode (initially it was faster than Marian, but I think I saw somewhere that the update made Marian faster).

To use it, use a dedicated preprocess script to binary the corpus and vocabulary before training. At the time of inference, the test statement can be input as it is without being binaryized. One of my personal gratitude is that the binary training data and the checkpoint file of the trained model cannot be overwritten by the Fairseq command (it stops with an Assersion Error). .. There are many things I can write about Fairseq, but I can write one article by itself, and there are other Japanese articles, so I will only mention this level in this article. There is also an official example and it's very easy to use.

Installation is almost okay as long as PyTorch works. For PyTorch, follow the Official installation instructions for your OS, Package, and CUDA version.

OpenNMT OpenNMT is a tool developed by Harvard NLP group and SYSTRAN. It is the oldest of the ones introduced today. There used to be a Lua version, but it seems that it ended when the maintenance of Torch was completed. Currently, development of PyTorch version (OpenNMT-py) and TensorFlow version (OpenNMT-tf) is ongoing. The available features of the two are quite different. In addition to machine translation and language modeling, image to text, speech to text, summarization, series classification, and series tagging are also possible.

It is usually used as a learning process with special pre-processing.

When I used OpenNMT-py, I was suffering from the version of torchtext at the time of installation, and it does not automatically save the best model in validation, so I choose a model by looking at the learning log at the time of inference. I was dissatisfied with the need and not knowing the options and high paras that would give accuracy. I don't know what's going on now. There are so many Japanese articles, so you may want to refer to them.

Tensor2Tensor T2T is a library of deep learning and datasets developed by the Google Brain team. Written in TensorFlow. The other libraries featured in this article have machine translation as their main function, but they can use deep learning models for various tasks such as image classification and image generation.

The rough usage is to execute a data generation command to learn and infer, but there is an option called --problem. This is an option to specify the dataset to use, rather than just specifying the task. Therefore, it is very easy to experiment with the existing benchmark data set, but when using the data set prepared by yourself, you need to define a class that inherits the class called Problem. It's nice that this specification explicitly links the model and the data (and the pre-processing method), but I think other libraries are superior in terms of ease of use.

There are Japanese articles and official Jupyter Notebook, so I have the impression that the examples are substantial. Also, TensorFlow Serving can be used, so I wonder if this will be the case when using deep learning models in production.

Sockeye Seq2Seq framework using Apache MXNet (Incubating). The usage seems to be similar to OpenNMT.

I couldn't understand the features of this library by looking at it ... I'm sorry it looks like a blog ...

end

I think it's better to try from the top.

Recommended Posts

A quick introduction to the neural machine translation library
A quick introduction to pytest-mock
Introduction to Machine Learning Library SHOGUN
Introduction to machine learning
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
Introduction to Machine Translation Architecture by the University of Cambridge by Slack Translation App Kiara
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
[Introduction to Python] Basic usage of the library matplotlib
An introduction to machine learning from a simple perceptron
An introduction to machine learning
A super introduction to Linux
Super introduction to machine learning
A memorandum to register the library written in Hy in PyPI
How to make a Japanese-English translation
A story stuck with the installation of the machine learning library JAX
Introduction to machine learning Note writing
Introduction to Python Numerical Library NumPy
[Introduction to Python] How to split a character string with the split function
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
I created a Python library to call the LINE WORKS API
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
[Introduction to Python] How to use the in operator in a for statement?
[Python] How to import the library
A light introduction to object detection
Attempt to extend a function in the library (add copy function to pathlib)
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
[Introduction to Python] What is the difference between a list and a tuple?
[Introduction to Udemy Python3 + Application] 47. Process the dictionary with a for statement
[Introduction to machine learning] Until you run the sample code with chainer
Various methods to numerically create the inverse function of a certain function Introduction
[Introduction to Python] How to sort the contents of a list efficiently with list sort
How to use the library "torchdiffeq" that implements Neural ODE's ODE Block
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
[Introduction to Python] How to write a character string with the format function
I made a library konoha that switches the tokenizer to a nice feeling
I tried to visualize the model with the low-code machine learning library "PyCaret"
What is a C language library? What is the information that is open to the public?
Introduction to Machine Learning: How Models Work
How to build a sphinx translation environment
Record the steps to understand machine learning
Give a title to the ipywidgets tab
A super introduction to Python bit operations
An introduction to OpenCV for machine learning
A quick overview of the Linux kernel
Recurrent Neural Networks: An Introduction to RNN
Introduction to ClearML-Easy to manage machine learning experiments-
Probably the most straightforward introduction to TensorFlow
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
An introduction to Python for machine learning
Introduction to Machine Learning-Hard Margin SVM Edition-
Let's try neural machine translation using Transformer
An Introduction to Object-Oriented-Give an object a child.
Introduction to AI creation with Python! Part 2 I tried to predict the house price in Boston with a neural network
9 Steps to Become a Machine Learning Expert in the Shortest Time [Completely Free]
[Introduction to Python] How to get the index of data with a for statement
How to calculate the volatility of a brand
[Python] Easy introduction to machine learning with python (SVM)
How to use the C library in Python
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Introduction to Python Let's prepare the development environment
Visualize the inner layer of a neural network