When systematizing machine learning, know-how that can be utilized in design is required in consideration of the time and resources required for data preprocessing. In this post, we will introduce an overview of data preprocessing for natural language and performance verification results based on data preprocessing in chABSA-dataset, which is an implementation example of emotion polarity analysis.
Post list
The table of contents of this post is as follows.
-[3. Example of resources and processing time required for preprocessing of natural language processing](# 3-Example of resources and processing time required for preprocessing of natural language processing) -[3.1 Verification environment](# 31-Verification environment) -[3.2 Experiment contents](# 32-Experiment contents) -[3.2.1 Experimental flow](# 321-Experimental flow) -[3.2.2 Word-separated library comparison](# 322-Divided library comparison) -[(1) Dependent Library](# 1-Dependent Library) -[(2) How to call processing (function) in code](# 2-How to call processing (function) in code) -[(3) I / O data format in code](# 3-io-data format in code) -[3.3 Experimental Results](# 33-Experimental Results) -[3.4 Consideration of experimental results](# 34-Discussion of experimental results) -Summary
Here, the data preprocessing (Table 2 of the first part) in chABSA-dataset is executed, and the required resource amount and processing time are verified. In particular, change the library used for the word-separation process from Janome to MeCab, and compare the amount of resources and processing time in both patterns.
Table 4 shows the performance of the virtual machines (VMs) used in this experiment.
Table 4 VM specifications used for verification
item | item name |
---|---|
OS | Red Hat Enterprise Linux Server 7.7 (Maipo) |
CPU | Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.5GHz |
Number of CPU cores | 8 |
Hyper-threading | off |
Memory capacity | 16GB |
HDD capacity | 160GB |
HDD Sequential Read | 166MB/s |
HDD Sequential Write | 487 MB/s |
HDD Random Read(4k,QD32) | 68MB/s |
HDD Random Write(4k, QD32) | 71MB/s |
While GPU resources are often used in image processing, they are not required in text data processing. Therefore, we will not use GPU resources in this experiment.
Table 5 shows the software versions used in the verification.
Table 5 Software version used for verification
Software name | version | Use |
---|---|---|
Python | 3.7.4 | Used as an execution environment |
beautifulsoup4 | 4.9.0 | Used for XML parsing |
lxml | 4.5.0 | Used for XML parsing |
Janome | 0.3.10 | Used for word-separation |
MeCab | 0.996.3 | Used for word-separation |
[Last post](https://qiita.com/mkakida/items/eba36f519b08dbda1d82#22-%E5%8F%96%E3%82%8A%E6%89%B1%E3%81%86%E5%89 Normalize the data preprocessing in chABSA-dataset shown in Table 2 of% 8D% E5% 87% A6% E7% 90% 86% E3% 81% AE% E6% A6% 82% E8% A6% 81) The amount of computer resources used in each process and the processing time are measured. Regarding vectorization processing, vectorization is not the target of measurement in this experiment because the processing content largely depends on the model implementation and it is unlikely that the measurement results will be useful for other projects.
The processing of the measurement target in this experiment is defined again in Table 6. Each of these processes is performed independently and sequentially. In addition, although 2,260 types of independent company data are handled in each process, each company data is processed sequentially and not in parallel due to the implementation of chABSA-dataset.
Table 6 Processes to be tested
# | Process name | Description |
---|---|---|
1 | Performance information extraction process | XBRL analysis processing. Performance data by analyzing data in XBRL format(HTML format)Processing to extract |
2 | Sentence extraction process | Processing to remove tags from html format data and extract only Japanese sentences |
3 | Analysis target data extraction process | A process that combines the word-separation process and the normalization process. chABSA-In the code in the dataset, it was called collectively inside the loop process that is executed for each sentence to be analyzed, and it was difficult to divide and measure the resources used for the process. The measurement target is the combined processing of the two. |
In chABSA-dataset, the word-separation processing part of the analysis data extraction processing is implemented using Janome. In this experiment, the word-separation process implemented in Janome is rewritten to the process in MeCab, and the required processing time is compared.
The following three points will be changed when rewriting the word-separation process.
Janome is provided as a dictionary-included Python package, and can be used only by installing the package with pip. In addition to installing the Python package using pip, MeCab requires the installation of MeCab middleware (e.g. rpm package) for each OS.
Although the import target and the name of the word-separated function are different, the calling method itself is almost the same between the libraries.
In both Janome and MeCab, the input is character string type data that stores sentences, but the output data type is different. The output of Janome's word-separation function is array-type data with each word as an element, while the output of MeCab's word-separation function is string-type data in which each word is separated by a space.
For example, when rewriting the code that outputs the word-separated result to tokens by inputting the character string data sentence of the sentence, rewrite as code 1 to code 2.
##Janome usage example
from janome.tokenizer import Tokenizer
...(abridgement)...
#Instance creation
tokenizer = Tokenizer(wakati=True)
#Sentence string data Input sentence and output the word-separation result to tokens
tokens = tokenizer.tokenize(sentence)
#The defined tokens are an array with each word as an element
...(abridgement)...
##Example of using Mecab
import MeCab
...(abridgement)...
#MeCab instance for printing word-separation
mecab = MeCab.Tagger("-Owakati")
#Sentence string data Input sentence and output the word-separation result to tokens
tokens = mecab.parse(sentence)
#The defined tokens are a single character string in which words are separated by a single-byte space.
# tokens.split(" ")Then, the same output as in the case of Janome
...(abridgement)...
Figures 1 to 3 show the results of executing the data preprocessing of chABSA-dataset shown in Table 2 of the first part. Figure 1 shows the time taken to execute each process, and you can see the total execution time and breakdown of processes # 1 to # 3. Figure 2 shows the average CPU usage rate during each process execution, and Figure 3 shows the average memory usage during each process execution separately for each process. In each figure, processing # 3 is shown separately when it is Janome implementation and when it is MeCab implementation.
Figure 1 Processing execution time
From Fig. 1, regardless of whether the implementation of process # 3 is Janome or MeCab, process # 1 takes about 26 minutes, while other processes # 2 and # 3 are combined for about 2 minutes at most. It can be seen that there is a large difference in execution time for each process. In the time-consuming process # 1, a few MB xbrl format data file is read and analyzed for each company, and the HTML and metadata of the performance data are extracted. Since the data extracted by processing # 1 has a data amount of at most several tens of KB for each company, it is considered that there is a difference in the processing time order between processing # 1 and others.
Furthermore, from Fig. 1, it can be seen that in the process # 3 for performing the word-separation and normalization, the case of implementing the word-separation with Janome is about 8 times faster than the case of implementing it with MeCab. This is due to the fact that the Janome library itself is a Python script, while MeCab is implemented using C ++.
Figure 2 Average CPU usage
From Figure 2, it can be seen that the CPU usage rate for processing # 1, # 2, and # 3-Janome is around 90% across the board, while that for # 3-MeCab is around 70%. This difference is affected by the ratio of the time it takes to read the data in each process.
For example, comparing # 3-Janome and # 3-MeCab, as shown in Figure 1, # 3-MeCab takes about 5 seconds for the entire process, while # 3-Janome takes about 40 seconds. .. In each process, in addition to reading the same input data from the file, the Japanese corpus file for word-separation is read internally. Since the CPU is waiting while reading this file, it is possible that the CPU utilization will decrease. Similarly, the CPU utilization of process # 2 is lower than that of process # 1 because of the ratio of data read time to process time.
Figure 3 Average memory usage
First of all, from Figure 3, we can see that there is a large variation in memory usage between each process. This is thought to be due to the type and amount of data being handled, as well as the type and amount of libraries loaded at the time of processing execution. The library used for the process # 1 that extracts performance information from raw data in XBRL format, which is a type of XML, and the process # 2 that removes tags from performance data (XML + HTML) data are similar. However, the data size handled is different. The raw data that represents the securities report of each company entered in processing # 1 is about several MB in size, while the input data in processing # 2 is only the performance information part, so it is several to several tens of KB in size. It becomes. As a result of this difference in the size of the input data becoming large due to the data expression format inside the program, how to hold the intermediate data, and the difference in the detailed processing contents, the difference in memory usage between processing # 1 and processing # 2 is several tens. It is believed that it became MB. Regarding processing # 3-Janome and processing # 3-MeCab, the implementation of the library used is very different. In terms of language, Janome is written purely in the Python language, while MeCab is written in C ++. It is thought that the difference in the internal implementation language is reflected in the difference in memory usage.
In this post, we conducted an experiment on the data preprocessing of chABSA-dataset as an example of data preprocessing in natural language processing. It cannot be said that it is a general rule for all natural language processing, but as an experimental result of one case, it can be seen that the following way of thinking can be made.
--Premise --While the total amount of raw data exists in GB order, the data is split into files in MB order --No parallel processing is executed
--Thinking --The bottleneck resource is the CPU, and the bottleneck process is the analysis of raw data. --If the memory usage during preprocessing is likely to fit in the MB order and the restrictions on processing time are loose, you do not need an expensive server with a large amount of memory. --If time constraints are strong, speeding up can be expected by performing parallelization within the range where the server memory does not become a new bottleneck. --If the word-separation process is implemented in Janome, it can be easily rewritten in MeCab, which can be expected to speed up and reduce memory usage. This is because Janome is a Python implementation, while MeCab is a C ++ implementation. ――However, as we experimented in this post, it should be noted that there are cases where the ratio of the execution time of word-separation to the entire pre-processing is small, and the effect of speeding up is large in parallel processing. In the experiment in this post, the amount of data to be separated (chapter of achievements) extracted was as small as 3.5MB with respect to the total amount of raw data (securities report) of 7.9GB, so the processing time Raw data analysis accounted for most of the. On the other hand, in the case where the difference between the raw data and the data to be divided is small, it is considered that the execution time of the division processing becomes longer and the effect of speeding up the division processing becomes larger.
――In this post, we introduced the performance verification results and their consideration on the subject of data preprocessing in chABSA-dataset, which is an implementation example of emotion polarity analysis. --As a result, most of the processing execution time was spent analyzing the raw data. This is thought to be due to the fact that the data size to be processed decreases from MB order to KB order before and after the raw data analysis. In addition, when comparing the case where the word-separation process is implemented with Janome and the case where it is implemented with MeCab, MeCab has a smaller execution time and memory consumption. ――When the data size changes significantly before and after the analysis of raw data as in this case, there is a high possibility that the part where the amount of data to be processed is large becomes a bottleneck. Speeding up can be expected by parallelizing processing as long as memory does not become a new bottleneck.