Introduction

When systematizing machine learning, know-how that can be utilized in design is required in consideration of the time and resources required for data preprocessing. This time, an overview of data preprocessing for natural language and data preprocessing in chABSA-dataset, which is an implementation example of emotion polarity analysis. Introducing the performance verification results based on.

Post list

Overview of natural language processing and its data preprocessing ... This post
Performance verification of data preprocessing in natural language processing

The table of contents of this post is as follows.

-[1. Natural language processing and its data preprocessing](# 1-Natural language processing and its data preprocessing) -[1.1 What is natural language processing](# 11-What is natural language processing) -[1.2 What is natural language processing data preprocessing in machine learning systems](# 12-What is natural language processing data preprocessing in machine learning systems) -[2. Example of pretreatment based on emotional polarity analysis pretreatment](# 2-Example of pretreatment based on emotional polarity analysis pretreatment) -[2.1 Natural Language Processing Use Case Selection](# 21-Natural Language Processing Use Case Selection) -[2.2 Overview of pre-processing to be handled](# 22-Overview of pre-processing to be handled) -[2.3 Data Volume Estimate](# 23-Data Volume Estimate) -[2.4 OSS selection for preprocessing](# 24--oss-selection for preprocessing) -[2.4.1 About word division (word division)](# 241-About word division) -[2.4.2 About vectorization](# 242-About vectorization) -[2.4.3 About other steps](# 243-About other steps) -Summary

1. Natural language processing and its data preprocessing

1.1 What is natural language processing?

Natural language is a language that has developed naturally, such as Japanese and English, which humans use on a daily basis to communicate. Unlike artificial languages such as programming languages, natural languages have ambiguities that do not uniquely determine the meaning or interpretation of sentences.

Natural language processing refers to making it possible for a computer to handle a huge amount of text data written in natural language practically based on the ambiguity of words, or a technology for that purpose. Examples of applications of natural language processing include smart speakers, web search engines, machine translation, Japanese input systems, and emotional polarity analysis.

1.2 What is data preprocessing of natural language processing in machine learning systems?

Image data (a set of pixel values) and time-series data that can be acquired from various sensors are data that can be expressed as numerical values. On the other hand, natural language is a set of words and cannot be treated as a numerical value as it is.

In order to handle natural language in machine learning, which is a statistical method for extracting rules from data, it is necessary to convert natural language into numerical data in some way. This conversion is called vectorization, and the expression as numerical data obtained by vectorization is called a feature quantity.

Pre-processing in natural language processing includes conversion (vectorization) from natural language, which is text data, to features, which are numerical data, and processing such as noise removal and decomposition into word strings, which are performed before that. Point to. Table 1 shows the flow of preprocessing in the field of natural language processing.

Table 1 Preprocessing and data state transitions in the field of natural language processing

Data status	Processing classification	Description
Raw data	-
↓	cleaning	Remove unnecessary non-text data such as HTML tags attached to the text data you want to analyze
Sentence	-
↓	Word-separation	Break down sentences by part of speech into an array of words(Word split)Split
Word string	-
↓	Normalization, stopword removal	Unification of notational fluctuations, removal of meaningless words for analysis
Word strings required for analysis	-
↓	Vectorization	Convert word strings to numerical data
Feature vector	-

2. Example of pretreatment based on the pretreatment of emotional polarity analysis

2.1 Use case selection for natural language processing

One of the use cases of natural language processing is emotional polarity analysis, which judges the quality of the content indicated by a certain text and uses it as decision support. For example, it is expected to be used in a wide range of fields such as reputation analysis of in-house products on SNS in B2C business and validity analysis of loans and investments based on corporate performance information for financial business.

In this post, we will focus on emotional polarity analysis that may be adopted in a wide range of industries, and will explain the following using chABSA-dataset, which is a Python implementation example of emotional polarity analysis published on the Internet.

2.2 Overview of pre-processing to be handled

In chABSA-dataset, the financial report data for FY2016 (XBRL format [^ 1] / 2,260 companies / approx. 7.9GB) is treated as raw data. We will create a supervised learning model using a support vector machine (SVM) in order to extract positive / negative emotional polarity information from sentences that explain achievements in this data. The processing of chABSA-dataset can be roughly divided into three.

[^ 1]: The XBRL format data has an XML nested structure so that metadata is added to the outside of the HTML format data of the securities report published by the company.

(1) Annotation processing to create training data to create a model that makes positive / negative judgments (2) Model creation (learning) processing based on the data created in (1) (3) Emotion polarity analysis processing using the model created in (2)

Data preprocessing in the field of natural language processing is included in the first half of processing (1) and (2) (before model creation). Table 2 shows the data preprocessing included in the process (1) and the process (2).

Table 2 Data preprocessing in chABSA-dataset

Data status	Processing classification	Description
Raw data(XBRL format)	-
↓	cleaning	Extract the HTML data of the section representing the performance of the securities report from the raw data in Xbrl format, which is a type of XML.
Sentence(HTML format)	-
↓	cleaning	HTML tag removal
Sentence	-
↓	Word-separation	Break down sentences into word strings
Word string	-
↓	Normalization, stopword removal	-Replace the number with 0-Remove whitespace characters
Word strings required for analysis	-
↓	Vectorization	Convert word strings to numerical data
Feature vector	-

2.3 Data volume estimation

In machine learning, the larger the amount of data to be processed, the larger the processing time and required resources for processing and learning processing. Therefore, it is necessary to estimate the amount of raw data to be input in advance.

The content and amount of text in the securities report (FY2016), which is the raw data in chABSA-dataset, varies from company to company. The XBRL format securities report published on EDINET has separate data files for each company, and the data file for one company was within 10MB. chABSA-dataset handles a total of 2,260 corporate data, for a total of approximately 7.9 GB of data.

Now, if you look at the processing shown in Table 2, you can see two things:

First, it is possible to process each company data file individually and sequentially, and it is not necessary to be able to process data for all companies at once. Second, the preprocessing results of one company's data files do not affect the processing results of other company data files. Therefore, it is possible to process them separately without considering the processing order.

From these things, it is not necessary to have a server with a large capacity memory that can load all the data into the memory at once, and it is possible to increase the speed by applying the parallel processing / distributed processing mechanism as far as the resources allow. I understand this.

2.4 OSS selection for preprocessing

In the preprocessing of natural language processing, the characteristic steps are word-separation (when the language to be handled is Japanese) and vectorization. Here are some examples of OSS selection individually for these processes.

2.4.1 About word-separation (word division)

Table 3 shows a typical word-separation library (morphological analyzer). Each library has different internal implementations and development languages, but what you can do is not much different.

Table 3 Main libraries for word-separation (morphological analyzer)

#	Library name	Description
1	MeCab	Mainstream Japanese morphological analyzer. A Japanese dictionary based on the IPA corpus is open to the public at the same time. C++Operates at high speed by implementation in
2	Janome	A dictionary-encapsulated morphological analyzer implemented only in Python. Designed to be easy for Python programmers to use
3	CaboCha	Morphological analyzer using SVM. You need to prepare your own Japanese dictionary, and you need to develop it while considering the data rights.

As Janome is used in chABSA-dataset, data scientists may also use Janome or other word-separation libraries during model development pre-processing for reasons such as ease of deployment in the development environment.

If you receive a pre-processing program from a data scientist along with the model, you may need to change the library while paying attention to the license and performance of the dictionary data.

In "Performance Verification of Data Preprocessing in Natural Language Processing", which will be uploaded later, we will show the points of change and the difference in performance when the word-separation processing written by Janome is replaced using MeCab.

2.4.2 About vectorization

The method of expressing the feature vector largely depends on how the model is created. Therefore, basically, the vectorization method by the data scientist is reproduced in the code, so there is no particular library recommended.

Most vector tracing in recent years is based on the distribution hypothesis that the meaning of a word is formed by the surrounding words. Among them, there are "count-based method" that creates a vector by frequency of occurrence and "inference-based method" that utilizes a weight vector to infer the applicable word from the word sequence information. The former method is used in chABSA-dataset. Word2vec is a typical example of the latter method.

2.4.3 Other steps

Most of the processes other than the steps of word-separation and vectorization, such as cleaning and normalization, are performed by data format-dependent parsing such as XML or string replacement.

The functions required at this time are the data format-dependent analysis function and the regular expression processing function that is standard in programming languages. Therefore, no particular library is recommended for these steps either.

Summary

――In this post, I have outlined what natural language processing is and what kind of preprocessing it has. --Focusing on emotional polarity analysis in natural language processing, we explained an example of preprocessing using chABSA-dataset as an implementation example.

[PYTHON] Overview of natural language processing and its data preprocessing