[PYTHON] 100 language processing knocks 2020: Chapter 4 (morphological analysis)

Language processing 100 knock 2020 version has been released, so I took this opportunity to solve it. Since it is a markdown output of Jupyter published on GitHub, the package once loaded is not loaded after that. I hope you can use it as one of the code references.

Since I am a beginner in natural language processing, I would like to proceed while studying and record the materials and contents that I referred to at that time.

The first chapter is here The third chapter is here

Chapter 4: Morphological analysis

The text of Natsume Soseki's novel "I am a cat" (neko.txt) was morphologically analyzed using MeCab, and the results were analyzed. Save it in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.

What is morphological analysis? WikipediaThan,

Morphological analysis is from text data (sentences) in natural language without notes of grammatical information to information such as the grammar of the target language and the part of words of words called dictionaries. Originally, it is the work of dividing into columns of morphemes (Morpheme, roughly speaking, the smallest unit that has meaning in the language), and determining the part of each morpheme.

MeCab Principle

Peek behind the scenes of Japanese morphological analysis! How MeCab parses morphologically.

As a simple image Give all possible division candidates for a sentence. Calculate the occurrence cost of each node (element) and the connection cost of the part of speech of the edge. It seems that the path that minimizes the cumulative cost is calculated by dynamic programming. Unknown words are added as words for each character type (katakana, etc.), and the cost seems to be determined for each character type.

By the way, this cost is decided in advance when MeCab is installed, but it is modeled by CRF (Conditional Random Field) and calculated in advance. There is a detailed explanation in the above article, but I have summarized the recommended articles below to understand a little more from the basics.

About CRF

It is an "discriminative model" and one of its features is "structural learning". It's an old article, but it's based on CRF story in 30 minutes for people who don't understand CRF and have a stomachache. It seems to be good for understanding the outline that is the basis of. To understand in a little more detail This article was good. Also, for more mathematical content, this slide Was very easy to understand.

The Conditional Random Field (English: Conditional random field, abbreviation: CRF) is one of the probabilistic graphical models represented by undirected graphs and is a discriminative model. (Wikipedia)

The CRF is a characteristically trained Markov random field (a set of Markov random variables). (The Markov random field is an undirected graph.)

P({\bf S}|{\bf O})=\frac{1}{Z_o} \exp \left(\sum_{t=1}^{T} \sum_{k=1}^K \lambda_k f_k(s_{t-1},s_t,{\bf o},t)\right)

Z_o=\sum_{s \in {\bf S}} \exp \left(\sum_{t=1}^{T} \sum_{k=1}^K \lambda_k f_k(s_{t-1},s_t,{\bf o},t)\right)

Here, fk is a feature function and consists of transition features and observation features. λ is the weight, and the cost is calculated based on that value. Since the CRF is based on the Markov random field, the transition feature consists of only the previous state and the current state. The occurrence cost of MeCab is calculated using the feature function belonging to the observed feature, and the concatenation cost is calculated using the feature function belonging to the transition resuscitation. Also, in MeCab, transition features only deal with part-speech states, so it seems that the concatenation cost only considers part-speech information.

Preprocessing

Before starting to solve the problem, import matplotlib and perform morphological analysis with MeCab. MeCab uses the one installed by Homebrew.

mecab < ../data/neko.txt > ../data/neko.txt.mecab

import matplotlib.pyplot as plt

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

The mapping type in which each morpheme is stored is like {'surface':'de',' base':'da','pos':'auxiliary verb','pos1':'*'}.

fname = '../data/neko.txt.mecab'

def read_mecab(fname):
    with open(fname, mode='rt', encoding='utf-8') as f:
        #EOS removal While removing, list storage sentence by sentence
        sentence_list = f.read().split('EOS\n')
        sentence_list = list(filter(lambda x: x != '', sentence_list))
    result_list = []
    for sentence in sentence_list:
        morpheme_list = []
        for word in sentence.split('\n'):
            if word != '':
                (surface, attr) = word.split('\t')
                attr = attr.split(',')
                component_dict = {
                    'surface': surface,
                    'base': attr[6],
                    'pos': attr[0],
                    'pos1': attr[1]
                }
                morpheme_list.append(component_dict)
        result_list.append(morpheme_list)
    return result_list

neko_morpheme = read_mecab(fname)
print(neko_morpheme[:4])

[[{'surface':'one','base':'one','pos':'noun','pos1':'number'}], [{'surface':'\ u3000','base ':' \ u3000','pos':'symbol','pos1':'blank'}, {'surface':'my','base':'my','pos':'noun',' pos1':'Synonyms'}, {'surface':'is','base':'is','pos':'auxiliary','pos1':'participant'}, {'surface':'cat ',' base':' cat','pos':'noun','pos1':'general'}, {'surface':' at','base':'da','pos':' auxiliary verb ',' pos1':''}, {'surface':'is','base':'is','pos':'auxiliary verb','pos1':''}, {'surface': '. ',' base':'. ',' pos':'symbol','pos1':' punctuation'}], [{'surface':'name','base':'name','pos':'particle','pos1': 'General'}, {'surface':'is','base':'is','pos':'particle','pos1':'particle'}, {'surface':'yet',' base':'yet','pos':'particle','pos1':'particle connection'}, {'surface':'no','base':'no','pos':'adjective' ,'pos1':'Independence'}, {'surface':'. ',' base':'. ',' pos':'symbol','pos1':' punctuation'}], [{'surface':'\ u3000','base':'\ u3000','pos':'symbol','pos1 ':' Blank'}, {'surface':'where','base':'where','pos':'noun','pos1':'particle'}, {'surface':'in', 'base':'in','pos':'particle','pos1':'case particle'}, {'surface':'born','base':'born','pos':' verb ',' pos1':'Independence'}, {'surface':'Ta',' base':' Ta','pos':' Particles','pos1':''}, {'surface': 'Ka','base':'ka','pos':'particle','pos1':'sub-particle / parallel particle / final particle'}, {'surface':'tonto','base':' Tonto','pos':'particle','pos1':'general'}, {'surface':'register','base':'register','pos':'noun','pos1':' Sahen connection'}, {'surface':'is','base':'is','pos':'particle','pos1':'case particle'}, {'surface':'tsuka',' base':'Tsuku','pos':'Particle','pos1':'Independence'}, {'surface':'Nu','base':'Nu','pos':'Particle',' pos1':''}, {'surface':'. ',' base':'. ',' pos':' sign',' pos1':' punctuation'}]]

31. Verb

Extract all the surface systems of the verb.

pos extracts the surface of the verb. For the time being, all I had to do was extract the verbs, so I once flattened the list for each sentence and processed it. The output is a list, but if you don't need duplicate elements, you can set it to set. Just do set (verbose_list).

from itertools import chain

def extract_verbose(target):
    flatten_list = list(chain.from_iterable(target))
    return [word['surface'] for word in flatten_list if word['pos'] == 'verb']
    
verbose_list = extract_verbose(neko_morpheme)
print(verbose_list[:10])

['Birth','Tsuka','Shi','Crying','Shi','I','Beginning','See','Listen','Capture']

32. Prototype of verb

Extract all the original forms of the verb

def extract_verb_base(target):
    flatten_list = list(chain.from_iterable(target))
    return [word['base'] for word in flatten_list if word['pos'] == 'verb']

verb_base_list = extract_verb_base(neko_morpheme)
print(verb_base_list[:10])

['Birth','Tsuku','Do','Cry','Do','I','Start','See','Listen','Catch']

33. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

You have to process each sentence. Add the concatenated noun phrases to the list as a dictionary of the form {A: B}.

def extract_connected_by_of(target):
    result_list = []
    for dict in target:
        for i in range(1, len(dict)-1):
            if dict[i-1]['pos'] == 'noun' and dict[i]['base'] == 'of' and dict[i+1]['pos'] == 'noun':
                result_list.append({dict[i-1]['surface']: dict[i+1]['surface']})
    return result_list

a_of_b_list = extract_connected_by_of(neko_morpheme)
print(a_of_b_list[:10])

[{'He':'Palm'}, {'Palm':'Upper'}, {'Shosei':'Face'}, {'Should':'Face'}, {'Face':'Middle'} , {'Hole':'Middle'}, {'Shosei':' Palm'}, {'Palm':'Back'}, {'What':'Thing'}, {'Important':'Mother'} ]

34. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

As long as the nouns are connected, all nouns must be added to the result.

def extract_chain_noun(target_list):
    result_list = []
    temp_list = []
    for d in target_list:
        for word in d:  
            if word['pos'] == 'noun':
                temp_list.append(word['surface'])
            elif len(temp_list) > 1 :
                result_list.append(temp_list)
                temp_list = []
            else:
                temp_list = []
    return result_list

chain_noun_list = extract_chain_noun(neko_morpheme)

for i in chain_noun_list[:10]:
    print(i)

['Human','Medium'] ['Ichiban','Evil'] ['Time','Mysterious'] ['One','Hair'] ['After','Cat'] ['one time'] ['Puuputo','Smoke'] ['House','Inside'] ['Three','Hair'] ['Shosei', other than']

35. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

Use the Counter of the Python standard library collections to get a list of words and tuples of occurrences.

#from itertools import chain
from collections import Counter

def extract_word_frequency(target):
    flatten_list = list(chain.from_iterable(target))
    word_list = [word['surface'] for word in flatten_list]
    counted_list = Counter(word_list).most_common()
    return counted_list

word_counted_list = extract_word_frequency(neko_morpheme)
word, counts = zip(*word_counted_list)
print(word[:20])

('No','.','Te',',','is','to',' to',' and',' is',' "',' Also','not','da','shi','from','are','na')

36. Top 10 most frequent words

Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph).

# import matplotlib.pyplot as plt
import japanize_matplotlib

plt.figure(figsize=(12, 8))
plt.barh(word[:10], counts[:10])
plt.title('Top 10 words by frequency')
plt.xlabel('Number of appearances')
plt.ylabel('word')
plt.show()

37. Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of occurrence in a graph (for example, a bar graph).

First, we have to think about how to define the co-occurrence frequency, but here we interpret it as extracting words that appear in the same sentence.

Extract lines containing cats → Define a function to store words other than cats in a list.
Definition of a function that calculates the frequency of occurrence of elements from a list.
Plot

It is done in 3 steps.

def extract_cooccurrence_list(target):
    cooccurrence_list = []
    for sentence in target:
        #If the surface contains cats, add them to the list.
        word_list = [word['surface'] for word in sentence]
        if "Cat" in word_list:
            cooccurrence_list.extend([word for word in word_list if word != "Cat"])
    return cooccurrence_list

def caluculate_word_frequence(target):
    counted_list = Counter(target).most_common()
    word, counts = zip(*counted_list)
    return word, counts


def main():
    cat_frequence_list = extract_cooccurrence_list(neko_morpheme)
    neko_word, neko_count = caluculate_word_frequence(cat_frequence_list)
    plt.figure(figsize=(12, 8))
    plt.barh(neko_word[:10], neko_count[:10])
    plt.title('Top 10 words that frequently co-occur with cats')
    plt.xlabel('Number of appearances')
    plt.ylabel('word')
    plt.show()

if __name__ == '__main__':
    main()

38. Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

Use the counts list calculated in 35. If you refer to it later like this, you should give it a slightly better name. This time, we plotted up to 100 appearance frequencies.

plt.figure(figsize = (12,8))
plt.hist(counts, bins=20, range=(1,100))
plt.show()

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

The counts list acquired in 35 is arranged in order of rank.

#Create a list of appearance ranking
rank_list = [i+1 for i in range(len(counts))]

plt.figure(figsize = (12,8))
plt.scatter(rank_list, counts)
plt.xscale('log')
plt.yscale('log')
plt.grid(which='major',color='black')

plt.show()