Introduction

Deep learning requires a large amount of learning data, but the reality is that it is not possible to collect so much data in the real world. Against this background, the technique of learning with a small amount of data has become widespread in recent years. There are three ways to learn with less data:

--Use high quality data --Inflated --Transfer learning

In this article, we will focus on "inflating" data that utilizes translation into other languages used in natural language processing, what kind of technique is "inflating" in the first place, and how to perform "inflating". I would like to actually implement "inflated" while arranging what to pay attention to.

--Introduction --Inflating data in natural language processing --Analytical environment and advance preparation --Translation processing

Reference information
Finally

Inflating data in natural language processing

"Inflating" is a technique that ** converts the original learning data to increase the amount of data **, and is often used not only in natural language processing but also in image processing. As an aside, the original word for "inflated" is "Data Augmentation", which literally means "data expansion".

Analytical environment and advance preparation

The implementation in this article uses Kaggle's Kernel. The specifications and settings of the Kaggle environment used this time are listed below.

Python 3.6.6
Anaconda conda 4.6.14
RAM 16GB
Disk 4.9GB
Language Python
GPU Off
Internet On

Don't forget to turn Internet on when using Kaggle's Kernel. Also, when using the local environment, type the following command at the command prompt to install each module.

pip install -U joblib textblob
python -m textblob.download_corpora

Set the module as follows.

`augmentation.py`


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# augmentation
from joblib import Parallel, delayed
from textblob import TextBlob
from textblob.translate import NotTranslated

# sleep
from time import sleep 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

The data used this time will be the Kaggle competition dataset called Jigsaw Unintended Bias in Toxicity Classification. .. In this translation process, no pre-processing is performed, and only the first 100 records are extracted for trial use.

`augmentation.py`


# importing the dataset
train=pd.read_csv("../input/train.csv")
train=train.head(100)

Translation process

Once each module and data are ready, I would like to carry out the translation process. As a test, I would like to execute the following code to translate the example sentence set to * x * from English to Japanese.

`augmentation.py`


# a random example
x = "Great question! It's one we're asked a lot. We've designed the system assuming that people *will* try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!"

# translate
analysis = TextBlob(x)
print(analysis)
print(analysis.translate(to='ja'))

Then, I think that the following result will be returned.

Great question! It's one we're asked a lot. We've designed the system assuming that people *will* try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!
Great question. That is what we are looking for a lot. We designed the system on the assumption that people would try to abuse it. So, in addition to peer review, there are algorithms on the backend that do a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know it's a really solid start from months of beta testing, and we strive to improve it to continue!

It seems that you were able to translate it well. Next, in order to increase the versatility of the executed translation process, we will function the translation process part. In this translation process, I would like to do it in three languages: Spanish, German, and French. First, define the language used for translation, the number of cores used in parallel processing, and the output frequency of progress as parameters.

`augmentation.py`


languages = ["es", "de", "fr"]
parallel = Parallel(n_jobs=-1, backend="threading", verbose=5)

Next, define the translation processing function. The code below is the actual function of the translation process.

`augmentation.py`


def translate_text(comment, language):
    if hasattr(comment, "decode"):
        comment = comment.decode("utf-8")
    text = TextBlob(comment)
    try:
        text = text.translate(to=language)
        sleep(0.4)
        text = text.translate(to="en")
        sleep(0.4)
    except NotTranslated:
        pass
    return str(text)

As a point of the above processing, the sleep function of the time module is used to temporarily stop the translation processing. The reason why the pause is inserted here is that the following error will occur if it is not inserted.

HTTPError: HTTP Error 429: Too Many Requests

We will actually perform the translation process using the parameters and functions defined above. The translation process can be performed with the following code.

`augmentation.py`


comments_list = train["comment_text"].fillna("unknown").values

for language in languages:
    print('Translate comments using "{0}" language'.format(language))
    translated_data = parallel(delayed(translate_text)(comment, language) for comment in comments_list)
    train['comment_text'] = translated_data
    result_path = os.path.join("train_" + language + ".csv")
    train.to_csv(result_path, index=False)

After executing the above process, the log should be output as follows.

Translate comments using "es" language

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   20.8s finished

Translate comments using "de" language

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   20.7s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.

Translate comments using "fr" language
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   21.4s finished

If you go to all finished, the result of translation processing should be output as a CSV file.

The result of padding

Let's take a look at the output translation result using the following original text as an example.

** Original **

Great question! It's one we're asked a lot. We've designed the system assuming that people will try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!

Comparing with the original text, you can see that the text has changed slightly in the following translation results.

** English-> Spanish-> English **

Great question It is one that they ask us a lot. We have designed the system assuming that people will try to abuse it. So, in addition to peer reviews, there are algorithms in the backend that perform many meta-analyzes. I'm sure the system is not 100% perfect yet, but we know for months of beta testing that it's a really solid start, and we'll keep working to improve it!

** English-> German-> English **

Good question! We are often asked about it. We designed the system on the assumption that people will try to abuse it. In addition to the peer reviews, there are backend algorithms that do a lot of meta-analysis. I'm sure the system is not 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll continue to work on improving it!

** English-> French-> English **

Good question! We are asked a lot. We designed the system on the assumption that people will * try * to abuse it. Thus, in addition to peer reviews, there are algorithms on the backend that do a lot of meta-analysis. I'm sure the system is not 100% perfect yet, but months of beta testing have taught us that it was a good start, and we will continue to improve it!

Reference information

Introducing the sites that helped this article and the sites that deal with related topics.

TextBlob: Simplified Text Processing： If you want to know about "TextBlob" used for translation processing, please see here.

[Explanation of all arguments of Parallel of Joblib](https://own-search-and-study.xyz/2018/01/17/Explanation of all arguments of parallel of joblib /): About the library "Joblib" for parallel processing, the arguments other than the arguments used in this article are also explained.

Inflating and transfer learning (Vol.7): It covers the technique of "inflating" in image processing.

Finally

When I finally build a model using inflated data, I would like to talk about how the similarity between the original language and the language used in the translation affects the accuracy of the model next time.

[PYTHON] Easy padding of data that can be used in natural language processing

Introduction

table of contents

Inflating data in natural language processing

Analytical environment and advance preparation

`augmentation.py`

`augmentation.py`

Translation process

`augmentation.py`

`augmentation.py`

`augmentation.py`

`augmentation.py`

The result of padding

Reference information

Finally