[PYTHON] You become an engineer in 100 days ――Day 67 ――Programming ――About morphological analysis

Click here until yesterday

From this time on, it's about natural language processing.

What is morphological analysis?

Morphological analysis``s a sentence in the smallest unit called morpheme`` It is a method to distinguish the part of speech of each morpheme.

** Divided **

It is a writing style that puts a space between words like English. Watashi Ga Hentai Death I Had Lewd Death

** English morphological analysis **

Very easy in languages like English where words are separated by spaces The procedure for English morphological analysis is summarized below.

1.Lowercase the entire sentence to prevent words from being distinguished by word position

2.it's and don'Split abbreviations such as t (it's → it 's 、 don't → do n't）

3.Separate the period at the end of the sentence from the previous word (Mr.Do not separate periods that are not related to the end of the sentence used for

4.Divide by space

** Japanese morphological analysis **

Unlike English, Japanese has few spaces and you can't see the breaks in words. Therefore, it is necessary to consider division by rules on a dictionary basis using a dedicated dictionary.

If you do your own morphological analysis, you need to define and implement this division rule yourself.

Several libraries have been developed for Japanese morphological analysis. It is common to use this for morphological analysis.

A typical library is called MeCab.

https://ja.wikipedia.org/wiki/MeCab

There is also a library called janome in the Python language.

https://mocobeta.github.io/janome/

If implemented using such a library, morphological analysis can be performed relatively easily.

Today, we will perform morphological analysis using the janome library.

What is a morpheme?

English: morpheme In linguistic terms, the smallest unit of meaningful expression element It is a phrase that becomes meaningless if it is further decomposed.

Example: How much should I divide?

What you do is a "verb", but if you divide it further It doesn't make sense. Do ⇒ verb 〇

Line ⇒ noun × U ⇒ Interjection ×

Therefore, it is necessary to stop the disassembly in an appropriate place.

Japanese part of speech

The Japanese part of speech is as follows.

Problems in Japanese morphological analysis

There are the following problems in morphological analysis of Japanese.

・ Problem of word boundary discrimination ・ Problem of part of speech discrimination ・ Problem of unknown words ・ Loose grammar problems ・ The problem that the meaning changes depending on the presence or absence of modifiers

** Word boundary discrimination problem **

For example, there are several different grammatically correct readings for the sentence Uraniwa has a chicken.

・ There is / chicken / in the backyard / ・ There are / two / feathers / birds / in the backyard / ・ Back / on / crocodile / is / chicken / is / ・ Backyard / in / haniwa / tori / is /

In Sumomomo Momomo, the problem is where to divide it. In order to get the perfect answer, you have to understand the background of the sentence and the intention of the writer.

** Part of speech problem **

For example, the word "time" has a noun meaning other than "time". It also has a meaning as a verb "--double", so it depends on which meaning you take. The grammatical structure of the sentence and the derived meaning will be completely different.

The type of part of speech is closely related to the structure of the sentence and must be considered together.

** Problem of unknown language ** Because morphological analysis is usually done using a dictionary containing words in that language Words that are not included in the dictionary in the sentence to be analyzed are called unknown words.

Depending on how you handle unknown words, the results of subsequent analysis will greatly affect it. Therefore, the dictionary should be updated regularly. .. (Especially proper nouns: personal names, facility names, product names, gag, etc.)

** Loose grammar problems ** The content of conversations such as SNS, emails, and talk apps is often far from a particular modeled Japanese grammar. In order to analyze such contents, it is necessary to fundamentally consider the notational fluctuations of sentences and words and the proofreading method.

The problem that the meaning changes depending on the presence or absence of modifiers Modifiers work to change the meaning of verbs and adjectives, but when morphological analysis is done It is divided by words. If you do not interpret it as a combination of multiple words, the meaning will be completely different.

Morphological analysis using janome

Let's perform morphological analysis using python's janome library.

If you don't have it installed, you can't use it. Please install with the following command.

pip install janome

This is an example of a simple morphological analysis.


from janome.tokenizer import Tokenizer
t = Tokenizer()

tokens = t.tokenize('Not sweet, not spicy, not delicious')
for token in tokens:
    print(token)

From sweet to adjective, independence, *, *, adjective / auo dan, nu connection, sweet, amakara, amakara Zu auxiliary verb, *, *, *, special / nu, continuous use two connections, nu, zu, zu Spicy to adjective, independence, *, *, adjective / auo dan, nu connection, spicy, karakara, karakara Zu auxiliary verb, *, *, *, special / nu, continuous use two connections, nu, zu, zu From the meaning, adjective, independence, *, *, adjective / auo dan, nu connection, delicious, horse kara, horse kara Zu auxiliary verb, *, *, *, special / nu, continuous use two connections, nu, zu, zu

Call Tokenizer of janome and instantiate it.

from janome.tokenizer import Tokenizer
Variable name= Tokenizer()

You can return the result of morphological analysis of a sentence with variable name.tokenize ('sentence').

The result of morphological analysis is ·word ・ Part of speech ·reading It is divided into.

The result divided by words will be returned, so if you want to process each one Repeat the process using a for statement or the like.

I will change the sentence.

words = 'Of the thighs and thighs'
tokens = t.tokenize(words)
for token in tokens:
    print(token)

Plum noun, general, *, *, *, *, plum, plum, plum Mo particle, particle, *, *, *, *, mo, mo, mo Peach noun, general, *, *, *, *, peach, peach, peach Mo particle, particle, *, *, *, *, mo, mo, mo Peach noun, general, *, *, *, *, peach, peach, peach Particles, attributive forms, *, *, *, *, of, no, no Of which nouns, non-independent, adverbs possible, *, *, *, of which, Uchi, Uchi

What happens if the following sentence is morphologically analyzed?

NTV Tokyo

tokens = t.tokenize('Nippon Television Tokyo')
for token in tokens:
    print(token)

NTV Nouns, Proper Nouns, Organizations, *, *, *, NTV, Nippon Television, Nippon Television Tokyo noun, proper noun, region, general, *, *, Tokyo, Tokyo, Tokyo

This is the result calculated by the cost stored in the dictionary data for morphological analysis. Word breaks that are more well-known or easier to connect and have a lower cost Because of the mechanism of choice, the result is as follows.

Most of the recently created proper nouns are not supported by dictionaries. This can lead to undesired results.

tokens = t.tokenize(u'I saw a fish at Tokyo Skytree Station')
for token in tokens:
    print(token)

Tou Adverb, Particle Connection, *, *, *, *, Tou, Tou, Tou Today nouns, adverbs possible, *, *, *, *, today, kyo, kyo Sky noun, general, *, *, *, *, sky, sky, sky Tree noun, general, *, *, *, *, tree, tree, tree Station noun, suffix, region, *, *, *, station, excitement, excitement In particle, case particle, general, *, *, *, in, de, de Fish nouns, general, *, *, *, *, fish, fish, fish Particles, case particles, general, *, *, *, wo, wo Verb, independence, *, *, one-step, continuous form, see, mi, mi Te particle, connecting particle, *, *, *, *, te, te, te Ki verb, non-independence, *, *, kahen / kuru, continuous form, kuru, ki, ki Auxiliary verb, *, *, *, special / mass, continuous form, masu, mashi, mashi Ta auxiliary verb, *, *, *, special ta, uninflected word, ta, ta, ta

Originally I wanted to say Tokyo Skytree Station, but since there is no such word in the dictionary, it will be divided into appropriate words.

In such a case, use the user dictionary. Create a dictionary file and read it.

Variable name = Tokenizer ('dictionary file name', udic_enc ='character code')

`userdic.csv`


Tokyo Sky Tree,1288,1288,4569,noun,固有noun,General,*,*,*,Tokyo Sky Tree,Tokyo Sky Tree,Tokyo Sky Tree
Tokyo Skytree Station,1288,1288,4143,noun,固有noun,General,*,*,*,Tokyo Skytree Station,Tokyo Skytree Station,Tokyo Skytree Station
Tobu Sky Tree Line,1288,1288,4700,noun,固有noun,General,*,*,*,Tobu Sky Tree Line,Tobu Sky Tree Line,Tobu Sky Tree Line
Tokyo Skytree Station,1288,1288,4143,noun,固有noun,General,*,*,*,Tokyo Skytree Station,Tokyo Skytree Station,Tokyo Skytree Station

t = Tokenizer("userdic.csv", udic_enc="utf8")
tokens = t.tokenize(u'I saw a fish at Tokyo Skytree Station')
for token in tokens:
    print(token)

Tokyo Skytree Station Nouns, Proper Nouns, General, *, *, *, Tokyo Skytree Station, Tokyo Skytree Eki, Tokyo Skytree Eki In particle, case particle, general, *, *, *, in, de, de Fish nouns, general, *, *, *, *, fish, fish, fish Particles, case particles, general, *, *, *, wo, wo Verb, independence, *, *, one-step, continuous form, see, mi, mi Te particle, connecting particle, *, *, *, *, te, te, te Ki verb, non-independence, *, *, kahen / kuru, continuous form, kuru, ki, ki Auxiliary verb, *, *, *, special / mass, continuous form, masu, mashi, mashi Ta auxiliary verb, *, *, *, special ta, uninflected word, ta, ta, ta

ʻThe proper nouns registered in userdic.csvare now reflected. Unknown word` creates a dictionary and handles it.

You can get the word part with token.surface.

for token in tokens:
    print(token.surface)

Tokyo Skytree Station so fish To You see hand Ki Better Ta

You can extract only the part of speech with token.part_of_speech. Because the part of speech is subdivided by , delimiters When you get the subdivided one, divide it by , and take it out.

for token in tokens:
    print(token.part_of_speech)

Nouns, proper nouns, general, * Particles, case particles, general, * Noun, noun, general, * Particles, case particles, general, * Verb, independence, *, * Particles, connecting particles, *, * Verb, non-independent, *, * Auxiliary verb, *, *, * Auxiliary verb, *, *, *

You can check the reading with token.reading.

for token in tokens:
    print(token.reading)

Tokyo Skytree Station De Fish Wo Mi Te Ki Mashi Ta

Part of speech and reading are used properly in the for sentence, and the subsequent processing is performed.

Summary

Morphological analysis is one of the basics of language analysis. There are many libraries, so it's a good idea to experiment.

The only way to deal with new words is to create a dictionary If you want to perform morphological analysis correctly, it is essential to prepare a dictionary for unknown words.

33 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython