[PYTHON] You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing

Click here until yesterday

You will become an engineer in 100 days --Day 63 --Programming --Probability 1

You will become an engineer in 100 days-Day 59-Programming-Algorithms

You will become an engineer in 100 days --- Day 53 --Git --About Git

You will become an engineer in 100 days --Day 42 --Cloud --About cloud services

You will become an engineer in 100 days --Day 36 --Database --About the database

You will be an engineer in 100 days --Day 24 --Python --Basics of Python language 1

You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1

You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1

You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1

From this time on, it's about natural language processing.

What is natural language processing?

Languages that humans have spontaneously used, such as English and Japanese, are called natural languages.

On the other hand, artificial languages based on rules such as programming languages are called formal languages to distinguish them.

What is natural language processing? Let the computer process thenatural languagethat humans use on a daily basis. It refers to a series of technologies.

Many technologies are included in natural language processing.

Major natural language processing technologies

The technical system in natural language processing is like this.

name Contents
Morphological analysis A method of dividing into morphemes and discriminating the part of speech of each morpheme
Parsing A method of dividing into morphemes and clarifying the relationships between them and syntactic relationships by diagramming them.
Semantic analysis A method of interpreting the meaning of a sentence using a concept dictionary, etc.
Context analysis A method to check the connection of multiple sentences

When processing Japanese with a computer, morphological analysis is a basic technology. Since the language is changing day by day, it is difficult for computers to handle it.

Because humans do not completely process linguistic information, but make reasonable interpretations out of many interpretations. It makes it difficult to implement that validity on a computer.

It is quite difficult to do more than semantic analysis, and future research is awaited.

About morphological analysis

Morphological analysis`` separates sentences into the smallest unit of words called morphemes. It is a method to distinguish the part of speech of each morpheme.

** Divided **

It is a writing style that puts a space between words like English. Watashi Ga Hentai Death I Had Lewd Death

** English morphological analysis **

Very easy in languages like English where words are separated by spaces The procedure for English morphological analysis is summarized below.

1.Lowercase the entire sentence to prevent words from being distinguished by word position

2.it's and don'Split abbreviations such as t (it's → it 's 、 don't → do n't)

3.Separate the period at the end of the sentence from the previous word (Mr.Do not separate periods that are not related to the end of the sentence used for

4.Divide by space

** Japanese morphological analysis **

Unlike English, Japanese has few spaces and you can't see the breaks in words. Therefore, it is necessary to consider division by rules on a dictionary basis using a dedicated dictionary.

If you do your own morphological analysis, you need to define and implement this division rule yourself.

Several libraries have been developed for Japanese morphological analysis. It is common to use this for morphological analysis.

A typical library is called MeCab.

https://ja.wikipedia.org/wiki/MeCab

There is also a library called janome in the Python language.

https://mocobeta.github.io/janome/

If implemented using such a library, morphological analysis can be performed relatively easily.

The mechanism of the library around here is explained in this article. Peek behind the scenes of Japanese morphological analysis! How MeCab Parses Morphological Analysis

The basic idea is to build a lattice and select the best path.

A lattice is a possible word-breaking solution.

I think the following is an easy-to-understand example, so I will refer to it.

Reference: https://techlife.cookpad.com/entry/2016/05/11/170000

This is the lattice, from which the optimal path is selected based on the cost.

The cost depends on the dictionary used for morphological analysis.

In general morphological analysis, the NAIST dictionary is used, The calculated values of occurrence cost and concatenation cost are listed in this. It seems that the cost value for the corpus is calculated from it.

It seems that this path with the lowest cost value is the result of morphological analysis.

Of course, if it does not exist in this dictionary, proper nouns etc. will be divided by ordinary words. The maintenance of a dictionary is indispensable for correct morphological analysis.

Newly created words are sometimes called unknown words, but in the work of morphological analysis, Correspondence to such unknown words and maintenance of dictionaries will occupy most of the man-hours of development work.

If you are a company that handles natural language processing, you have registered a large number of words on your own. We are building a database to handle unknown words.

About parsing

Syntax analysis is also called dependency analysis and is a kind of natural language processing technology. After dividing the sentence into morphemes, we will analyze the modifier relationships between words.

There is a famous library called CaboCha.

https://taku910.github.io/cabocha/

It is not suitable for parsing too long sentences, and it is necessary to think in short sentences.

The result of the analysis looks like this.

Ichiro filled the holes made by Jiro with potatoes purchased in Hokkaido.

Ichiro-------------D
Jiro-D         |
Had made-D       |
In the hole-------D
In Hokkaido-D   |
Purchased-D |
Potatoes-D
Stuffed

Dependency analysis is a technology that can be used when you want to analyze the meaning of a sentence. I think it can be used to analyze the grammatical structure and clarify the meaning of sentences.

Words that often appear in natural language processing

Regular expressions

This is an expression method for expressing several character strings in one format. It is often used when processing a large amount of sentences according to certain rules.

Click here for details You will become an engineer in 100 days --Day 46 --Programming --Regular expressions

N-Gram

A text segmentation method that divides an arbitrary character string or document into consecutive n characters. When n is 1,uni-gramis when2 isbi-gram Case 3 is calledtri-gram`.

Character-based

# unigram
'now', 'Day', 'Is', 'I', 'I', 'Heaven', 'Qi'

# bigram
'today', 'day', 'Yes', 'Good', 'Heaven', 'weather'

# trigram
'today', 'Yes yes', 'Is good', 'Good heaven', 'Good weather'

If it is word-based, it will be a concatenation of n morphologically analyzed words.

# unigram
'today', 'Is', 'Good', 'weather'

# bigram
'today', 'Is good', 'Nice weather'

# trigram
'Good today', 'Nice weather'

** word vector **

After dividing the sentence into words, the words are assigned to the columns of the table and converted into data. If there is a word, the data will be 1, otherwise it will be 0.

[1,0,0,0,0,0,1,1,1],
[1,0,0,0,0,0,1,1,0], ...

TF-IDF

tf-idf is a type of weight for words in a document and is used in fields such as information retrieval and sentence summarization. Calculations are made based on the word vector and are used to determine the rarity of words.

Summary

Natural language processing is one of the most difficult research fields, but the fields where research has not progressed are On the contrary, it is also a field with many opportunities.

Studying Japanese is particularly difficult, and where to implement the part that analyzes the meaning It is very difficult, so you need to sit down and work on your research.

If you are interested, let's learn natural language processing.

34 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter: https://twitter.com/otupython

Recommended Posts

You become an engineer in 100 days ――Day 66 ――Programming ――About natural language processing
You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis
You will be an engineer in 100 days ――Day 71 ――Programming ――About scraping 2
You will be an engineer in 100 days ――Day 74 ――Programming ――About scraping 5
You will be an engineer in 100 days ――Day 73 ――Programming ――About scraping 4
You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6
You will be an engineer in 100 days --Day 68 --Programming --About TF-IDF
You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping
You will be an engineer in 100 days ――Day 81 ――Programming ――About machine learning 6
You will be an engineer in 100 days ――Day 82 ――Programming ――About machine learning 7
You will be an engineer in 100 days ――Day 79 ――Programming ――About machine learning 4
You will be an engineer in 100 days ――Day 76 ――Programming ――About machine learning
You will be an engineer in 100 days ――Day 80 ――Programming ――About machine learning 5
You will be an engineer in 100 days ――Day 78 ――Programming ――About machine learning 3
You will be an engineer in 100 days ――Day 84 ――Programming ――About machine learning 9
You will be an engineer in 100 days ――Day 83 ――Programming ――About machine learning 8
You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2
You will be an engineer in 100 days ――Day 85 ――Programming ――About machine learning 10
You will be an engineer in 100 days --Day 63 --Programming --Probability 1
You will be an engineer in 100 days --Day 65 --Programming --Probability 3
You will be an engineer in 100 days --Day 64 --Programming --Probability 2
You will be an engineer in 100 days --Day 86 --Database --About Hadoop
You will be an engineer in 100 days ――Day 60 ――Programming ――About data structure and sorting algorithm
You will be an engineer in 100 days ――Day 24 ―― Python ―― Basics of Python language 1
You will be an engineer in 100 days ――Day 30 ―― Python ―― Basics of Python language 6
You will be an engineer in 100 days ――Day 25 ―― Python ―― Basics of Python language 2
You will be an engineer in 100 days --Day 29 --Python --Basics of the Python language 5
You will be an engineer in 100 days --Day 33 --Python --Basics of the Python language 8
You will be an engineer in 100 days --Day 26 --Python --Basics of the Python language 3
You will be an engineer in 100 days --Day 32 --Python --Basics of the Python language 7
You will be an engineer in 100 days --Day 28 --Python --Basics of the Python language 4
You will be an engineer in 100 days --Day 27 --Python --Python Exercise 1
You will be an engineer in 100 days --Day 34 --Python --Python Exercise 3
You will be an engineer in 100 days --Day 31 --Python --Python Exercise 2
You will be an engineer in 100 days --Day 35 --Python --What you can do with Python
[WIP] Pre-processing memo in natural language processing
Python: Deep Learning in Natural Language Processing: Basics
Unbearable shortness of Attention in natural language processing
I read an introductory book on natural language processing
Model using convolutional neural network in natural language processing
Performance verification of data preprocessing in natural language processing
Building an environment for natural language processing with Python
Python: Natural language processing
RNN_LSTM2 Natural language processing
Types of preprocessing in natural language processing and their power
[Natural language processing] I want to meet an engineer who is changing jobs (or just before)
Programming language in "Hello World"
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
A story about a magic trick in LT that performs live coding of natural language processing and dependency analysis in an instant from nothing.
Natural language processing 2 Word similarity
Dockerfile with the necessary libraries for natural language processing in python
Think seriously about what language to use in programming education and programming education.
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
Can I become an AI engineer in an online course at AI Academy?
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
Become an AI engineer soon! Comprehensive learning of Python / AI / machine learning / deep learning / statistical analysis in a few days!