[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 03, I will make a note of my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

Understand how MeCab works and try tuning it. Also, check morphological analyzers other than MeCab.

03.1 MeCab

dictionary

Word-separation by MeCab is based on a dictionary. The information obtained from morphological analysis using MeCab in the dictionary depends on what kind of information is registered in the dictionary, and the information registered in the dictionary differs depending on the dictionary.

Dictionary name Contents
IPAdic ・ Dictionary officially recommended by MeCab
・ Based on data called IPA Corpus
UniDic ・ Based on data called UniDic
・ The unit to be divided is small, which is close to strict "morphological analysis".
jumandic -MeCab porting of the dictionary used in a morphological analyzer called JUMAN, which is different from MeCab
・ Based on data called Kyoto Corpus
・ Meta information such as representative notation is given
ipadic-NEologd ・ Significantly expanded the number of words based on the IPA dictionary
・ The vocabulary is frequently expanded by crawling words from the Internet, and the ability to respond to new words is very high.
It is recommended to perform normalization as a pretreatment
unidic-NEologd ・ Ipadic-Similar to NEologd, a dictionary with word extensions based on UniDic

Install and run ipadic-NEologd

The difference between IPAdic and ipadic-NEologd is, for example, the analysis of the word "Deep Learning" (a relatively new word).

--IPAdic: Divided by "Deep" and "Learning" --ipadic-NEologd: Treated by one word of "Deep Learning"

Behavior of morphological analysis of MeCab

The dictionary holds not only information about morphemes such as execution results, but also the following various information.

--The cost of occurrence of each word --Left context ID of each word --Right context ID of each word --Concatenation cost for each combination of context IDs

The analysis result is the combination that minimizes the combination of the occurrence cost and the connection cost for the given sentence. (In the example below, the cost when dividing into "I love Higashi Osaka" is (minimum) the lowest, so this is the analysis result)

Example) "I love Higashi Osaka"


#When splitting with "I love Higashi Osaka"
Connection cost between the beginning of the sentence and "Higashi Osaka"
Occurrence cost of "Higashi Osaka"
Connection cost of "Higashi Osaka" and "I love you"
Occurrence cost of "love"
"I love you" and the connection cost at the end of the sentence

#When splitting with "I like the University of Tokyo and Osaka University"
Connection cost between the beginning of the sentence and "Higashi Osaka"
Occurrence cost of "The University of Tokyo"
Connection cost of "University of Tokyo" and "Osaka University"
Occurrence cost of "Osaka University"
Connection cost of "Osaka University" and "like"
Occurrence cost of "like"
"I love you" and the connection cost at the end of the sentence

03.2 MeCab Dictionary Modification

If you don't get the results you expect with an existing dictionary, tune the dictionary yourself.

--Addition of new words --Adjustment of morphological analysis

Build MeCab dictionary

#UTF source file encoding-Convert to 8
$ nkf --overwrite -Ew ./mecab-ipadic-2.7.0-20070801/*

#Build dictionary
$ mkdir build
$ $(mecab-config --libexecdir)/mecab-dict-index -d ./mecab-ipadic-2.7.0-20070801 -o build -f utf8 -t utf8
$ cp mecab-ipadic-2.7.0-20070801/dicrc ./build/. #Copy dicrc

nkf is an abbreviation for "Network Kanji Filter".

Add new word

Create a csv file in the source file directory


#Surface form, left context ID, right context ID, occurrence cost, part of speech, part of speech subclassification 1, part of speech subclassification 2, part of speech subclassification 3, inflected type, inflected form, prototype, reading, pronunciation

#If you want to add natural language processing
Natural language processing,1288,1288,0,noun,固有noun,General,*,*,*,Shizengengoshori,Shizengen Goshori,Shizengen Goshori

Adjustment of morphological analysis

  1. Adjust the connection cost
  2. Find the context ID of the target word in .csv
  3. Modify the concatenation cost of the target context ID with matrix.def
  4. Adjust the cost of occurrence
  5. Correct the cost of occurrence of the target word in .csv

However, when modifying the above costs, ** be aware that it may affect results other than the intended part **.

There seems to be a method to automatically adjust the cost, but ** it seems that the range of influence can be kept small by manually adjusting the cost of the part you want to correct **.

03.3 Various morphological analyzers

Get an overview of morphological analyzers other than MeCab.

Morphological analyzer Contents
MeCab ・ Based on the dictionary
・ The dictionary contains information on words, occurrence costs, and connection costs.
・ Execution speed is fast
-Since the dictionary is made into an external file, it can be customized as needed.
JUMAN++ ・ A relatively new morphological analyzer that uses a neural network
・ Consider not only grammatical correctness but also the meaning of words
・ Consider the information of all words before a word
・ Corresponds to notation fluctuation
・ There are many advantages over MeCab, but the execution speed is inferior.
KyTea(Cutie) -SVM predicts whether a word is separated between one character and the next character based on the characters before and after it.
-Python wrapper is provided by a third party
Janome ・ Written only in Python
・ IPA dictionary is built in, and API that can be handled from Python is provided.
・ Execution speed is slow
・ Dictionary options are limited
SudachiPy -Morphological analysis for Java Sudachi's Python binding
・ As of May 2019, official release is(Has it been officially released yet?)
Esanpy(Kuromoji) ・ Kuromoji is a morphological analyzer implemented in Java.
・ When using from Python, pass through Esanpy
・ Esanpy is a text analysis library that uses Elasticsearch (full-text search engine) internally.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock Chapter 4: Morphological Analysis
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
[WIP] Pre-processing memo in natural language processing
The first step to log analysis (how to format and put log data in Pandas)
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
[Natural language processing] I tried to visualize the remarks of each member in the Slack community