[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 02, I will write down my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

I created a simple dialogue agent in the previous chapter, but it does not handle similar sentences in the same way, and treats words (particles, etc.) and differences (uppercase and lowercase letters of the alphabet) that are not originally important as features. Learn the following techniques and apply them to dialogue agents.

02.1 What is preprocessing?

Properly format the text before going into the text classification process.

#Cannot be treated as the same sentence
Do you like python
Do you like python

#Characterizes the commonality of particles and auxiliary verbs
#label,Sentence
0,I like you
1,I like ramen!
# ↓
#Label the sentence "I like ramen"=It may be judged as 0
#Semantic label=I want to judge it as 1

02.2 Normalization

The process of absorbing fluctuations in notation and unifying it into a certain notation is called normalization of character strings. The goal is to get the same word-separated results and the same BoW, even if there are fluctuations in the notation. Approximate normalization is performed by neologdn, and normalization (lowercase and Unicode normalization) that neologdn lacks is handled individually.

neologdn There is a library called neologdn that summarizes multiple normalization processes. This is the normalization process used when generating data for NEologd, a type of MeCab dictionary. The advantage of neologdn is that it is easy to use because the normalization process is integrated into one function, and it is fast because it is implemented in C language.

Example of use


import neologdn

print(neologdn.normalize(<Sentence>))

Lowercase and uppercase

neologdn.normalize does not include lowercase / uppercase conversion of the alphabet. Therefore, to absorb the notational fluctuation, use .lower () and .upper (), which are Python's str type built-in methods, to unify the notation to lowercase or uppercase.

However, ** proper nouns may be important to distinguish between lowercase and uppercase letters, so take appropriate measures **.

Unicode normalization overview

Unicode is now widely used as the de facto standard for character encoding. "De" and "Co., Ltd." and "de", which is a combination of the single characters "de" and "te and" even if they are the same "de", are treated as separate characters as they are, so the result of Bow is naturally different. It ends up.

Detailed explanation of Unicode normalization

In Unicode, characters are represented by ** code points **. (Hexadecimal notation) They can be converted to each other using Python's built-in functions ord () and chr (), respectively.

Unicode and code point examples


>>> hex(ord('Ah'))
'0x3042'
>>> chr(0x3042)
'Ah'

#By the way, decimal notation is also possible
>>> ord('Ah')
12354
>>> chr(12354)
'Ah'

Next, for the character "de", check the code point for one character and for the combined character string (base character and combined character).

De code point confirmation


#One letter
>>> chr(0x30C7)
'De'

#Combined string
>>> chr(0x30C6)
'Te'
>>> chr(0x3099)
'S'
>>> chr(0x30C6) + chr(0x3099)
'De'

As mentioned above, Unicode responded to this problem, which means that there are multiple ways of expressing the same character, by ** "defining a set of code points that should be treated as the same character" **. This is called Unicode equivalence, and there are the following two.

--Canonical equivalence --Equivalent to characters that look and function the same --"De" and "Te" + "" " --Compatibility equivalence --Although the appearance and function may be different, those based on the same character are regarded as equivalent. --Including canonical equivalence --"Te" and "Te"

Unicode normalization is to decompose and synthesize precomposed characters based on this equivalence, and there are the following four. Canonical means canonical and Compatibility means compatibility.

When actually performing Unicode normalization, it is necessary to ** decide which normalization to use according to the problem handled by the application and the nature of the data **.

02.3 Headwordization

Correcting changes in word form due to conjugation and correcting it to the form listed in the dictionary heading is called heading word conversion. However, at this point, the same characteristics have not yet been extracted by "reading a book" and "reading a book". By corresponding to the stop word in the next section, it can be treated as the same feature.

I read a book
I read a book

↓ Word-separation+Headline conversion

Read a book
I read a book

Story when implementing

It is similar to the above-mentioned normalization in terms of absorbing notational fluctuations, but it is often described together with the word-separation process in order to correct the word-separation.

If you use node.feature obtained from parseToNode of MeCab, you can get the original form ** from the 6th element separated by commas.

However, ** words whose original form is not registered use the surface form **.

** BOS / EOS ** is a pseudo word that represents the beginning and end of a sentence as a result of MeCab, so it should not be included in the result of word-separation.

02.4 Stop word

In the previous section, the word is the same until "reading a book" as a result of word-separation, but after that, "da" and "masuta" are different, so the BoW is also different. It does not have a significant effect on the meaning of the sentence, and it is not desirable from the viewpoint of memory and storage efficiency when it is included in the vocabulary.

Dictionary-based stopword removal

Prepare a list of exclusion words in advance as shown below, and make a judgment using an if statement. In some cases, you can prepare the necessary stopword list from the net, such as slothlib.

~~
stop_words = ['hand', 'To', 'To', 'Is', 'is', 'Masu']

~~
if token not in stop_words:
  result.append(token)

Part-of-speech-based stopword removal

Particles and auxiliary verbs are important parts of speech in writing sentences, but they are not necessary in expressing the meaning of a sentence (in a dialogue agent, the characteristics necessary for class ID classification are acquired).

~~

if features[0] not in ['Particle', 'Auxiliary verb']:
~~

02.5 word replacement

As in the previous section, it is important as a sentence, but since "numerical value and date and time" may not have much meaning in expressing the meaning of the sentence, replace it with a specific character string.

#Before conversion
I bought an egg
I bought two eggs
I bought 10 eggs

#After conversion
I bought SOMENUMBER eggs
I bought SOMENUMBER eggs
I bought SOMENUMBER eggs

――Although the information on the number of eggs is lost, the meaning of "I bought an egg" remains the same, and the difference in the number of eggs can be unified. --Include a half-width space before and after "SOME NUMBER" to prevent it from being combined with the characters before and after it in a word-separated manner. --The same result can be obtained by including "SOME NUMBER" and a half-width space, but avoid it because the number of dimensions will increase by one unnecessarily.

02.6 Application to dialogue agent

As mentioned at the beginning, apply the following techniques learned in this chapter to the dialogue agent.

~~

# _tokenize()Improvement of
    def _tokenize(self, text):
        text = unicodedata.normalize('NFKC', text)  #Unicode normalization
        text = neologdn.normalize(text)  #Normalization with neologdn
        text = text.lower()  #Lowercase alphabet

        node = self.tagger.parseToNode(text)
        result = []
        while node:
            features = node.feature.split(',')

            if features[0] != 'BOS/EOS':
                if features[0] not in ['Particle', 'Auxiliary verb']:  #Removal of stop words by part of speech
                    token = features[6] \
                            if features[6] != '*' \
                            else node.surface  #Headline conversion
                    result.append(token)

            node = node.next

        return result

Execution result


# evaluate_dialogue_agent.Fixed loading module name of py
from dialogue_agent import DialogueAgent
↓
from dialogue_agent_with_preprocessing import DialogueAgent

$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.43617021

--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6%

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 1 Memo "Preliminary Knowledge Before Beginning Exercises"
[WIP] Pre-processing memo in natural language processing
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Performance verification of data preprocessing in natural language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Types of preprocessing in natural language processing and their power
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Dockerfile with the necessary libraries for natural language processing in python
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
100 natural language processing knocks Chapter 4 Commentary
[Natural language processing] Preprocessing with Japanese
Cython to try in the shortest
Preparing to start natural language processing
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
Introduction to Effectiveness Verification Chapter 1 in Python
Convenient goods memo around natural language processing
Even if the development language is changed to python3 in Cloud9, version 2 is displayed in python --version
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to extract named entities with the natural language processing library GiNZA
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.