[PYTHON] Overview of natural language processing and its data preprocessing

Introduction

When systematizing machine learning, know-how that can be utilized in design is required in consideration of the time and resources required for data preprocessing. This time, an overview of data preprocessing for natural language and data preprocessing in chABSA-dataset, which is an implementation example of emotion polarity analysis. Introducing the performance verification results based on.

Post list

  1. Overview of natural language processing and its data preprocessing ... This post
  2. Performance verification of data preprocessing in natural language processing

The table of contents of this post is as follows.

-[1. Natural language processing and its data preprocessing](# 1-Natural language processing and its data preprocessing) -[1.1 What is natural language processing](# 11-What is natural language processing) -[1.2 What is natural language processing data preprocessing in machine learning systems](# 12-What is natural language processing data preprocessing in machine learning systems) -[2. Example of pretreatment based on emotional polarity analysis pretreatment](# 2-Example of pretreatment based on emotional polarity analysis pretreatment) -[2.1 Natural Language Processing Use Case Selection](# 21-Natural Language Processing Use Case Selection) -[2.2 Overview of pre-processing to be handled](# 22-Overview of pre-processing to be handled) -[2.3 Data Volume Estimate](# 23-Data Volume Estimate) -[2.4 OSS selection for preprocessing](# 24--oss-selection for preprocessing) -[2.4.1 About word division (word division)](# 241-About word division) -[2.4.2 About vectorization](# 242-About vectorization) -[2.4.3 About other steps](# 243-About other steps) -Summary

1. Natural language processing and its data preprocessing

1.1 What is natural language processing?

Natural language is a language that has developed naturally, such as Japanese and English, which humans use on a daily basis to communicate. Unlike artificial languages such as programming languages, natural languages have ambiguities that do not uniquely determine the meaning or interpretation of sentences.

Natural language processing refers to making it possible for a computer to handle a huge amount of text data written in natural language practically based on the ambiguity of words, or a technology for that purpose. Examples of applications of natural language processing include smart speakers, web search engines, machine translation, Japanese input systems, and emotional polarity analysis.

1.2 What is data preprocessing of natural language processing in machine learning systems?

Image data (a set of pixel values) and time-series data that can be acquired from various sensors are data that can be expressed as numerical values. On the other hand, natural language is a set of words and cannot be treated as a numerical value as it is.

In order to handle natural language in machine learning, which is a statistical method for extracting rules from data, it is necessary to convert natural language into numerical data in some way. This conversion is called vectorization, and the expression as numerical data obtained by vectorization is called a feature quantity.

Pre-processing in natural language processing includes conversion (vectorization) from natural language, which is text data, to features, which are numerical data, and processing such as noise removal and decomposition into word strings, which are performed before that. Point to. Table 1 shows the flow of preprocessing in the field of natural language processing.

Table 1 Preprocessing and data state transitions in the field of natural language processing

Data status Processing classification Description
Raw data -
cleaning Remove unnecessary non-text data such as HTML tags attached to the text data you want to analyze
Sentence -
Word-separation Break down sentences by part of speech into an array of words(Word split)Split
Word string -
Normalization, stopword removal Unification of notational fluctuations, removal of meaningless words for analysis
Word strings required for analysis -
Vectorization Convert word strings to numerical data
Feature vector -

2. Example of pretreatment based on the pretreatment of emotional polarity analysis

2.1 Use case selection for natural language processing

One of the use cases of natural language processing is emotional polarity analysis, which judges the quality of the content indicated by a certain text and uses it as decision support. For example, it is expected to be used in a wide range of fields such as reputation analysis of in-house products on SNS in B2C business and validity analysis of loans and investments based on corporate performance information for financial business.

In this post, we will focus on emotional polarity analysis that may be adopted in a wide range of industries, and will explain the following using chABSA-dataset, which is a Python implementation example of emotional polarity analysis published on the Internet.

2.2 Overview of pre-processing to be handled

In chABSA-dataset, the financial report data for FY2016 (XBRL format [^ 1] / 2,260 companies / approx. 7.9GB) is treated as raw data. We will create a supervised learning model using a support vector machine (SVM) in order to extract positive / negative emotional polarity information from sentences that explain achievements in this data. The processing of chABSA-dataset can be roughly divided into three.

[^ 1]: The XBRL format data has an XML nested structure so that metadata is added to the outside of the HTML format data of the securities report published by the company.

(1) Annotation processing to create training data to create a model that makes positive / negative judgments (2) Model creation (learning) processing based on the data created in (1) (3) Emotion polarity analysis processing using the model created in (2)

Data preprocessing in the field of natural language processing is included in the first half of processing (1) and (2) (before model creation). Table 2 shows the data preprocessing included in the process (1) and the process (2).

Table 2 Data preprocessing in chABSA-dataset

Data status Processing classification Description
Raw data(XBRL format) -
cleaning Extract the HTML data of the section representing the performance of the securities report from the raw data in Xbrl format, which is a type of XML.
Sentence(HTML format) -
cleaning HTML tag removal
Sentence -
Word-separation Break down sentences into word strings
Word string -
Normalization, stopword removal -Replace the number with 0-Remove whitespace characters
Word strings required for analysis -
Vectorization Convert word strings to numerical data
Feature vector -

2.3 Data volume estimation

In machine learning, the larger the amount of data to be processed, the larger the processing time and required resources for processing and learning processing. Therefore, it is necessary to estimate the amount of raw data to be input in advance.

The content and amount of text in the securities report (FY2016), which is the raw data in chABSA-dataset, varies from company to company. The XBRL format securities report published on EDINET has separate data files for each company, and the data file for one company was within 10MB. chABSA-dataset handles a total of 2,260 corporate data, for a total of approximately 7.9 GB of data.

Now, if you look at the processing shown in Table 2, you can see two things:

First, it is possible to process each company data file individually and sequentially, and it is not necessary to be able to process data for all companies at once. Second, the preprocessing results of one company's data files do not affect the processing results of other company data files. Therefore, it is possible to process them separately without considering the processing order.

From these things, it is not necessary to have a server with a large capacity memory that can load all the data into the memory at once, and it is possible to increase the speed by applying the parallel processing / distributed processing mechanism as far as the resources allow. I understand this.

2.4 OSS selection for preprocessing

In the preprocessing of natural language processing, the characteristic steps are word-separation (when the language to be handled is Japanese) and vectorization. Here are some examples of OSS selection individually for these processes.

2.4.1 About word-separation (word division)

Table 3 shows a typical word-separation library (morphological analyzer). Each library has different internal implementations and development languages, but what you can do is not much different.

Table 3 Main libraries for word-separation (morphological analyzer)

# Library name Description
1 MeCab Mainstream Japanese morphological analyzer. A Japanese dictionary based on the IPA corpus is open to the public at the same time. C++Operates at high speed by implementation in
2 Janome A dictionary-encapsulated morphological analyzer implemented only in Python. Designed to be easy for Python programmers to use
3 CaboCha Morphological analyzer using SVM. You need to prepare your own Japanese dictionary, and you need to develop it while considering the data rights.

As Janome is used in chABSA-dataset, data scientists may also use Janome or other word-separation libraries during model development pre-processing for reasons such as ease of deployment in the development environment.

If you receive a pre-processing program from a data scientist along with the model, you may need to change the library while paying attention to the license and performance of the dictionary data.

In "Performance Verification of Data Preprocessing in Natural Language Processing", which will be uploaded later, we will show the points of change and the difference in performance when the word-separation processing written by Janome is replaced using MeCab.

2.4.2 About vectorization

The method of expressing the feature vector largely depends on how the model is created. Therefore, basically, the vectorization method by the data scientist is reproduced in the code, so there is no particular library recommended.

Most vector tracing in recent years is based on the distribution hypothesis that the meaning of a word is formed by the surrounding words. Among them, there are "count-based method" that creates a vector by frequency of occurrence and "inference-based method" that utilizes a weight vector to infer the applicable word from the word sequence information. The former method is used in chABSA-dataset. Word2vec is a typical example of the latter method.

2.4.3 Other steps

Most of the processes other than the steps of word-separation and vectorization, such as cleaning and normalization, are performed by data format-dependent parsing such as XML or string replacement.

The functions required at this time are the data format-dependent analysis function and the regular expression processing function that is standard in programming languages. Therefore, no particular library is recommended for these steps either.

Summary

――In this post, I have outlined what natural language processing is and what kind of preprocessing it has. --Focusing on emotional polarity analysis in natural language processing, we explained an example of preprocessing using chABSA-dataset as an implementation example.

Recommended Posts

Overview of natural language processing and its data preprocessing
Performance verification of data preprocessing in natural language processing
Types of preprocessing in natural language processing and their power
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
Answers and impressions of 100 language processing knocks-Part 1
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
Answers and impressions of 100 language processing knocks-Part 2
[WIP] Pre-processing memo in natural language processing
Easy padding of data that can be used in natural language processing
Data cleansing 3 Use of OpenCV and preprocessing of image data
Unbearable shortness of Attention in natural language processing
Python: Natural language processing
RNN_LSTM2 Natural language processing
Preprocessing of prefecture data
Analysis of financial data by pandas and its visualization (2)
Full-width and half-width processing of CSV data in Python
Analysis of financial data by pandas and its visualization (1)
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
Pre-processing and post-processing of pytest
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
Natural language processing for busy people
100 Language Processing Knock-59: Analysis of S-expressions
Why is distributed representation of words important for natural language processing?
Preparing to start natural language processing
Natural language processing analyzer installation summary
[Word2vec] Let's visualize the result of natural language processing of company reviews
Summary of multi-process processing of script language
Preprocessing of Wikipedia dump files and word-separation of large amounts of data by MeCab
Learn the basics of document classification by natural language processing, topic model
I have 0 years of programming experience and challenge data processing with python
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock-26: Removal of emphasized markup
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
Separation of design and data in matplotlib
I tried natural language processing with transformers.
Example of efficient data processing with PANDAS
Convenient goods memo around natural language processing
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
[Kaggle] From data reading to preprocessing and encoding
Multivariate LSTM and data preprocessing in TensorFlow 2.x
[Python] I played with natural language processing ~ transformers ~
Use decorators to prevent re-execution of data processing
[Pandas] Basics of processing date data using dt