[PYTHON] Performance verification of data preprocessing in natural language processing

Introduction

When systematizing machine learning, know-how that can be utilized in design is required in consideration of the time and resources required for data preprocessing. In this post, we will introduce an overview of data preprocessing for natural language and performance verification results based on data preprocessing in chABSA-dataset, which is an implementation example of emotion polarity analysis.

Post list

  1. Overview of natural language processing and its preprocessing
  2. Performance verification of data preprocessing in natural language processing ... This post

The table of contents of this post is as follows.

-[3. Example of resources and processing time required for preprocessing of natural language processing](# 3-Example of resources and processing time required for preprocessing of natural language processing) -[3.1 Verification environment](# 31-Verification environment) -[3.2 Experiment contents](# 32-Experiment contents) -[3.2.1 Experimental flow](# 321-Experimental flow) -[3.2.2 Word-separated library comparison](# 322-Divided library comparison) -[(1) Dependent Library](# 1-Dependent Library) -[(2) How to call processing (function) in code](# 2-How to call processing (function) in code) -[(3) I / O data format in code](# 3-io-data format in code) -[3.3 Experimental Results](# 33-Experimental Results) -[3.4 Consideration of experimental results](# 34-Discussion of experimental results) -Summary

3. Example of resources and processing time required for preprocessing of natural language processing

Here, the data preprocessing (Table 2 of the first part) in chABSA-dataset is executed, and the required resource amount and processing time are verified. In particular, change the library used for the word-separation process from Janome to MeCab, and compare the amount of resources and processing time in both patterns.

3.1 Verification environment

Table 4 shows the performance of the virtual machines (VMs) used in this experiment.

Table 4 VM specifications used for verification

item item name
OS Red Hat Enterprise Linux Server 7.7 (Maipo)
CPU Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.5GHz
Number of CPU cores 8
Hyper-threading off
Memory capacity 16GB
HDD capacity 160GB
HDD Sequential Read 166MB/s
HDD Sequential Write 487 MB/s
HDD Random Read(4k,QD32) 68MB/s
HDD Random Write(4k, QD32) 71MB/s

While GPU resources are often used in image processing, they are not required in text data processing. Therefore, we will not use GPU resources in this experiment.

Table 5 shows the software versions used in the verification.

Table 5 Software version used for verification

Software name version Use
Python 3.7.4 Used as an execution environment
beautifulsoup4 4.9.0 Used for XML parsing
lxml 4.5.0 Used for XML parsing
Janome 0.3.10 Used for word-separation
MeCab 0.996.3 Used for word-separation

3.2 Experiment contents

3.2.1 Experimental flow

[Last post](https://qiita.com/mkakida/items/eba36f519b08dbda1d82#22-%E5%8F%96%E3%82%8A%E6%89%B1%E3%81%86%E5%89 Normalize the data preprocessing in chABSA-dataset shown in Table 2 of% 8D% E5% 87% A6% E7% 90% 86% E3% 81% AE% E6% A6% 82% E8% A6% 81) The amount of computer resources used in each process and the processing time are measured. Regarding vectorization processing, vectorization is not the target of measurement in this experiment because the processing content largely depends on the model implementation and it is unlikely that the measurement results will be useful for other projects.

The processing of the measurement target in this experiment is defined again in Table 6. Each of these processes is performed independently and sequentially. In addition, although 2,260 types of independent company data are handled in each process, each company data is processed sequentially and not in parallel due to the implementation of chABSA-dataset.

Table 6 Processes to be tested

# Process name Description
1 Performance information extraction process XBRL analysis processing. Performance data by analyzing data in XBRL format(HTML format)Processing to extract
2 Sentence extraction process Processing to remove tags from html format data and extract only Japanese sentences
3 Analysis target data extraction process A process that combines the word-separation process and the normalization process. chABSA-In the code in the dataset, it was called collectively inside the loop process that is executed for each sentence to be analyzed, and it was difficult to divide and measure the resources used for the process. The measurement target is the combined processing of the two.

3.2.2 Word-separated library comparison

In chABSA-dataset, the word-separation processing part of the analysis data extraction processing is implemented using Janome. In this experiment, the word-separation process implemented in Janome is rewritten to the process in MeCab, and the required processing time is compared.

The following three points will be changed when rewriting the word-separation process.

(1) Dependent library

Janome is provided as a dictionary-included Python package, and can be used only by installing the package with pip. In addition to installing the Python package using pip, MeCab requires the installation of MeCab middleware (e.g. rpm package) for each OS.

(2) How to call a process (function) in the code

Although the import target and the name of the word-separated function are different, the calling method itself is almost the same between the libraries.

(3) I / O data format in code

In both Janome and MeCab, the input is character string type data that stores sentences, but the output data type is different. The output of Janome's word-separation function is array-type data with each word as an element, while the output of MeCab's word-separation function is string-type data in which each word is separated by a space.

For example, when rewriting the code that outputs the word-separated result to tokens by inputting the character string data sentence of the sentence, rewrite as code 1 to code 2.

Code 1 Example of a word-separated code using Janome
##Janome usage example
from janome.tokenizer import Tokenizer
...(abridgement)...

#Instance creation
tokenizer = Tokenizer(wakati=True)

#Sentence string data Input sentence and output the word-separation result to tokens
tokens = tokenizer.tokenize(sentence)

#The defined tokens are an array with each word as an element
...(abridgement)...
Code 2 Code example using MeCab
##Example of using Mecab
import MeCab
...(abridgement)...

#MeCab instance for printing word-separation
mecab = MeCab.Tagger("-Owakati")

#Sentence string data Input sentence and output the word-separation result to tokens
tokens = mecab.parse(sentence)

#The defined tokens are a single character string in which words are separated by a single-byte space.
# tokens.split(" ")Then, the same output as in the case of Janome
...(abridgement)...

3.3 Experimental results

Figures 1 to 3 show the results of executing the data preprocessing of chABSA-dataset shown in Table 2 of the first part. Figure 1 shows the time taken to execute each process, and you can see the total execution time and breakdown of processes # 1 to # 3. Figure 2 shows the average CPU usage rate during each process execution, and Figure 3 shows the average memory usage during each process execution separately for each process. In each figure, processing # 3 is shown separately when it is Janome implementation and when it is MeCab implementation.

Figure 1 Processing execution time fig1.png

From Fig. 1, regardless of whether the implementation of process # 3 is Janome or MeCab, process # 1 takes about 26 minutes, while other processes # 2 and # 3 are combined for about 2 minutes at most. It can be seen that there is a large difference in execution time for each process. In the time-consuming process # 1, a few MB xbrl format data file is read and analyzed for each company, and the HTML and metadata of the performance data are extracted. Since the data extracted by processing # 1 has a data amount of at most several tens of KB for each company, it is considered that there is a difference in the processing time order between processing # 1 and others.

Furthermore, from Fig. 1, it can be seen that in the process # 3 for performing the word-separation and normalization, the case of implementing the word-separation with Janome is about 8 times faster than the case of implementing it with MeCab. This is due to the fact that the Janome library itself is a Python script, while MeCab is implemented using C ++.

Figure 2 Average CPU usage fig2.png

From Figure 2, it can be seen that the CPU usage rate for processing # 1, # 2, and # 3-Janome is around 90% across the board, while that for # 3-MeCab is around 70%. This difference is affected by the ratio of the time it takes to read the data in each process.

For example, comparing # 3-Janome and # 3-MeCab, as shown in Figure 1, # 3-MeCab takes about 5 seconds for the entire process, while # 3-Janome takes about 40 seconds. .. In each process, in addition to reading the same input data from the file, the Japanese corpus file for word-separation is read internally. Since the CPU is waiting while reading this file, it is possible that the CPU utilization will decrease. Similarly, the CPU utilization of process # 2 is lower than that of process # 1 because of the ratio of data read time to process time.

Figure 3 Average memory usage fig3.png

First of all, from Figure 3, we can see that there is a large variation in memory usage between each process. This is thought to be due to the type and amount of data being handled, as well as the type and amount of libraries loaded at the time of processing execution. The library used for the process # 1 that extracts performance information from raw data in XBRL format, which is a type of XML, and the process # 2 that removes tags from performance data (XML + HTML) data are similar. However, the data size handled is different. The raw data that represents the securities report of each company entered in processing # 1 is about several MB in size, while the input data in processing # 2 is only the performance information part, so it is several to several tens of KB in size. It becomes. As a result of this difference in the size of the input data becoming large due to the data expression format inside the program, how to hold the intermediate data, and the difference in the detailed processing contents, the difference in memory usage between processing # 1 and processing # 2 is several tens. It is believed that it became MB. Regarding processing # 3-Janome and processing # 3-MeCab, the implementation of the library used is very different. In terms of language, Janome is written purely in the Python language, while MeCab is written in C ++. It is thought that the difference in the internal implementation language is reflected in the difference in memory usage.

3.4 Consideration of experimental results

In this post, we conducted an experiment on the data preprocessing of chABSA-dataset as an example of data preprocessing in natural language processing. It cannot be said that it is a general rule for all natural language processing, but as an experimental result of one case, it can be seen that the following way of thinking can be made.

--Premise --While the total amount of raw data exists in GB order, the data is split into files in MB order --No parallel processing is executed

--Thinking --The bottleneck resource is the CPU, and the bottleneck process is the analysis of raw data. --If the memory usage during preprocessing is likely to fit in the MB order and the restrictions on processing time are loose, you do not need an expensive server with a large amount of memory. --If time constraints are strong, speeding up can be expected by performing parallelization within the range where the server memory does not become a new bottleneck. --If the word-separation process is implemented in Janome, it can be easily rewritten in MeCab, which can be expected to speed up and reduce memory usage. This is because Janome is a Python implementation, while MeCab is a C ++ implementation. ――However, as we experimented in this post, it should be noted that there are cases where the ratio of the execution time of word-separation to the entire pre-processing is small, and the effect of speeding up is large in parallel processing. In the experiment in this post, the amount of data to be separated (chapter of achievements) extracted was as small as 3.5MB with respect to the total amount of raw data (securities report) of 7.9GB, so the processing time Raw data analysis accounted for most of the. On the other hand, in the case where the difference between the raw data and the data to be divided is small, it is considered that the execution time of the division processing becomes longer and the effect of speeding up the division processing becomes larger.

Summary

――In this post, we introduced the performance verification results and their consideration on the subject of data preprocessing in chABSA-dataset, which is an implementation example of emotion polarity analysis. --As a result, most of the processing execution time was spent analyzing the raw data. This is thought to be due to the fact that the data size to be processed decreases from MB order to KB order before and after the raw data analysis. In addition, when comparing the case where the word-separation process is implemented with Janome and the case where it is implemented with MeCab, MeCab has a smaller execution time and memory consumption. ――When the data size changes significantly before and after the analysis of raw data as in this case, there is a high possibility that the part where the amount of data to be processed is large becomes a bottleneck. Speeding up can be expected by parallelizing processing as long as memory does not become a new bottleneck.

Recommended Posts

Performance verification of data preprocessing in natural language processing
Overview of natural language processing and its data preprocessing
Types of preprocessing in natural language processing and their power
[WIP] Pre-processing memo in natural language processing
Unbearable shortness of Attention in natural language processing
Easy padding of data that can be used in natural language processing
[Natural language processing] Preprocessing with Japanese
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
100 Language Processing Knock-91: Preparation of Analogy Data
Python: Deep Learning in Natural Language Processing: Basics
Python: Natural language processing
RNN_LSTM2 Natural language processing
Preprocessing of prefecture data
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
Full-width and half-width processing of CSV data in Python
Model using convolutional neural network in natural language processing
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
Study natural language processing with Kikagaku
100 Language Processing Knock Chapter 1 in Python
Natural language processing for busy people
Artificial language Lojban and natural language processing (artificial language processing)
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
100 Language Processing Knock-59: Analysis of S-expressions
Dockerfile with the necessary libraries for natural language processing in python
Why is distributed representation of words important for natural language processing?
Preprocessing in machine learning 2 Data acquisition
Time series analysis 3 Preprocessing of time series data
Preparing to start natural language processing
Natural language processing analyzer installation summary
[Word2vec] Let's visualize the result of natural language processing of company reviews
Preprocessing in machine learning 4 Data conversion
Summary of multi-process processing of script language
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
I made a kind of simple image processing tool in Go language.
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Answers and impressions of 100 language processing knocks-Part 1
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
Answers and impressions of 100 language processing knocks-Part 2
100 Language Processing Knock-26: Removal of emphasized markup
3. Natural language processing with Python 2-1. Co-occurrence network
Python: Preprocessing in machine learning: Data acquisition
3. Natural language processing with Python 1-1. Word N-gram
Separation of design and data in matplotlib
Conversion of time data in 25 o'clock notation
Python: Preprocessing in machine learning: Data conversion
I tried natural language processing with transformers.
Preprocessing in machine learning 1 Data analysis process
Example of efficient data processing with PANDAS
Status of each Python processing system in 2020
Convenient goods memo around natural language processing