[PYTHON] Easy padding of data that can be used in natural language processing

Introduction

Deep learning requires a large amount of learning data, but the reality is that it is not possible to collect so much data in the real world. Against this background, the technique of learning with a small amount of data has become widespread in recent years. There are three ways to learn with less data:

--Use high quality data --Inflated --Transfer learning

In this article, we will focus on "inflating" data that utilizes translation into other languages used in natural language processing, what kind of technique is "inflating" in the first place, and how to perform "inflating". I would like to actually implement "inflated" while arranging what to pay attention to.

table of contents

--Introduction --Inflating data in natural language processing --Analytical environment and advance preparation --Translation processing

Inflating data in natural language processing

"Inflating" is a technique that ** converts the original learning data to increase the amount of data **, and is often used not only in natural language processing but also in image processing. As an aside, the original word for "inflated" is "Data Augmentation", which literally means "data expansion".

Analytical environment and advance preparation

The implementation in this article uses Kaggle's Kernel. The specifications and settings of the Kaggle environment used this time are listed below.

Don't forget to turn Internet on when using Kaggle's Kernel. Also, when using the local environment, type the following command at the command prompt to install each module.

pip install -U joblib textblob
python -m textblob.download_corpora

Set the module as follows.

augmentation.py


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# augmentation
from joblib import Parallel, delayed
from textblob import TextBlob
from textblob.translate import NotTranslated

# sleep
from time import sleep 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

The data used this time will be the Kaggle competition dataset called Jigsaw Unintended Bias in Toxicity Classification. .. In this translation process, no pre-processing is performed, and only the first 100 records are extracted for trial use.

augmentation.py


# importing the dataset
train=pd.read_csv("../input/train.csv")
train=train.head(100)

Translation process

Once each module and data are ready, I would like to carry out the translation process. As a test, I would like to execute the following code to translate the example sentence set to * x * from English to Japanese.

augmentation.py


# a random example
x = "Great question! It's one we're asked a lot. We've designed the system assuming that people *will* try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!"

# translate
analysis = TextBlob(x)
print(analysis)
print(analysis.translate(to='ja'))

Then, I think that the following result will be returned.

Great question! It's one we're asked a lot. We've designed the system assuming that people *will* try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!
Great question. That is what we are looking for a lot. We designed the system on the assumption that people would try to abuse it. So, in addition to peer review, there are algorithms on the backend that do a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know it's a really solid start from months of beta testing, and we strive to improve it to continue!

It seems that you were able to translate it well. Next, in order to increase the versatility of the executed translation process, we will function the translation process part. In this translation process, I would like to do it in three languages: Spanish, German, and French. First, define the language used for translation, the number of cores used in parallel processing, and the output frequency of progress as parameters.

augmentation.py


languages = ["es", "de", "fr"]
parallel = Parallel(n_jobs=-1, backend="threading", verbose=5)

Next, define the translation processing function. The code below is the actual function of the translation process.

augmentation.py


def translate_text(comment, language):
    if hasattr(comment, "decode"):
        comment = comment.decode("utf-8")
    text = TextBlob(comment)
    try:
        text = text.translate(to=language)
        sleep(0.4)
        text = text.translate(to="en")
        sleep(0.4)
    except NotTranslated:
        pass
    return str(text)

As a point of the above processing, the sleep function of the time module is used to temporarily stop the translation processing. The reason why the pause is inserted here is that the following error will occur if it is not inserted.

HTTPError: HTTP Error 429: Too Many Requests

We will actually perform the translation process using the parameters and functions defined above. The translation process can be performed with the following code.

augmentation.py


comments_list = train["comment_text"].fillna("unknown").values

for language in languages:
    print('Translate comments using "{0}" language'.format(language))
    translated_data = parallel(delayed(translate_text)(comment, language) for comment in comments_list)
    train['comment_text'] = translated_data
    result_path = os.path.join("train_" + language + ".csv")
    train.to_csv(result_path, index=False)

After executing the above process, the log should be output as follows.

Translate comments using "es" language

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.4s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   20.8s finished

Translate comments using "de" language

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   20.7s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.

Translate comments using "fr" language
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   21.4s finished

If you go to all finished, the result of translation processing should be output as a CSV file.

The result of padding

Let's take a look at the output translation result using the following original text as an example.

** Original **

Great question! It's one we're asked a lot. We've designed the system assuming that people will try to abuse it. So, in addition to the peer reviews, there are algorithms on the backend doing a lot of meta-analysis. I'm sure the system isn't 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll keep working to improve it!

Comparing with the original text, you can see that the text has changed slightly in the following translation results.

** English-> Spanish-> English **

Great question It is one that they ask us a lot. We have designed the system assuming that people will try to abuse it. So, in addition to peer reviews, there are algorithms in the backend that perform many meta-analyzes. I'm sure the system is not 100% perfect yet, but we know for months of beta testing that it's a really solid start, and we'll keep working to improve it!

** English-> German-> English **

Good question! We are often asked about it. We designed the system on the assumption that people will try to abuse it. In addition to the peer reviews, there are backend algorithms that do a lot of meta-analysis. I'm sure the system is not 100% perfect yet, but we know from months of beta testing that it's a really solid start, and we'll continue to work on improving it!

** English-> French-> English **

Good question! We are asked a lot. We designed the system on the assumption that people will * try * to abuse it. Thus, in addition to peer reviews, there are algorithms on the backend that do a lot of meta-analysis. I'm sure the system is not 100% perfect yet, but months of beta testing have taught us that it was a good start, and we will continue to improve it!

Reference information

Introducing the sites that helped this article and the sites that deal with related topics.

TextBlob: Simplified Text Processing: If you want to know about "TextBlob" used for translation processing, please see here.

[Explanation of all arguments of Parallel of Joblib](https://own-search-and-study.xyz/2018/01/17/Explanation of all arguments of parallel of joblib /): About the library "Joblib" for parallel processing, the arguments other than the arguments used in this article are also explained.

Inflating and transfer learning (Vol.7): It covers the technique of "inflating" in image processing.

Finally

When I finally build a model using inflated data, I would like to talk about how the similarity between the original language and the language used in the translation affects the accuracy of the model next time.

Recommended Posts

Easy padding of data that can be used in natural language processing
Performance verification of data preprocessing in natural language processing
Easy program installer and automatic program updater that can be used in any language
Summary of statistical data analysis methods using Python that can be used in business
Functions that can be used in for statements
Maximum number of function parameters that can be defined in each language
Unbearable shortness of Attention in natural language processing
A personal memo of Pandas related operations that can be used in practice
Summary of scikit-learn data sources that can be used when writing analysis articles
Basic algorithms that can be used in competition pros
ANTs image registration that can be used in 5 minutes
Overview of natural language processing and its data preprocessing
Types of preprocessing in natural language processing and their power
Processing of python3 that seems to be usable in paiza
Goroutine (parallel control) that can be used in the field
Goroutine that can be used in the field (errgroup.Group edition)
Scripts that can be used when using bottle in Python
Evaluation index that can be specified in GridSearchCV of sklearn
A timer (ticker) that can be used in the field (can be used anywhere)
Python standard input summary that can be used in competition pro
I made it because I want JSON data that can be used freely in demos and prototypes
100 Language Processing Knock-91: Preparation of Analogy Data
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
[WIP] Pre-processing memo in natural language processing
Acoustic signal processing module that can be used with Python-Sounddevice ASIO [Application]
Acoustic signal processing module that can be used with Python-Sounddevice ASIO [Basic]
I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python.
File types that can be used with Go
Building Sphinx that can be written in Markdown
I made a familiar function that can be used in statistics with Python
Python: Deep Learning in Natural Language Processing: Basics
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
List of tools that can be used to easily try sentiment analysis of Japanese sentences in Python (try with google colab)
[Python] I examined the practice of asynchronous processing that can be executed in parallel with the main thread (multiprocessing, asyncio).
Overview and useful features of scikit-learn that can also be used for deep learning
Introduction of automatic image collection package "icrawler" (0.6.3) that can be used during machine learning
Geographic information visualization of R and Python that can be expressed in Power BI
[Python] Introduction to web scraping | Summary of methods that can be used with webdriver
Full-width and half-width processing of CSV data in Python
Japanese can be used with Python in Docker environment
Features that can be extracted from time series data
Japan may be Galapagos in terms of programming language
Model using convolutional neural network in natural language processing
Python knowledge notes that can be used with AtCoder
List of Python code used in big data analysis
Can be used in competition pros! Python standard library
[Django] About users that can be used on template
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Python: Natural language processing
Simple statistics that can be used to analyze the effect of measures on EC sites and codes that can be used in jupyter notebook
RNN_LSTM2 Natural language processing
Can be used with AtCoder! A collection of techniques for drawing short code in Python!
[Django] Field names, user registration, and login methods that can be used in the User model
[Python3] Code that can be used when you want to resize images in folder units
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
Format summary of formats that can be serialized with gensim
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Basic knowledge of DNS that can not be heard now
Text analysis that can be done in 5 minutes [Word Cloud]