Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

Article description

--This article contains the results of a student who is an amateur in both language processing and python solving 100 language processing knock 2020. .. I am very pleased to point out any mistakes or improvements, so thank you. ―― ~~ I'm following pycharm's inspection for studying python, so there may be a lot of useless code. ~~ --I'm using Atom this time because I can't solve the problem that "" cannot be entered on pycharm. --Chapter 1 to 3 will be skipped.

** Environment ** --MacBook Pro (13-inch, 2016, Thunderbolt 3 ports x 2)

macOS Catalina 10.15.5 --Python 3.8.1 (not Anaconda)

Advance preparation

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab.

So, execute the following program to create it.

** Created program (click) **

`pre_processing.py`


# -*- coding: utf-8 -*-

import MeCab
from MeCab import Tagger

analyser: Tagger = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")

with open('./neko.txt', "r") as infile:
    lines = infile.readlines()

with open('./neko.txt.mecab.txt', "w") as outfile:
    for line in lines:
        mors = analyser.parse(line)
        outfile.write(mors)

note

--Looking at the output morpheme, "I am a cat" in the first line is identified as a proper noun. I do not care.

--The output format is (according to mecab) Surface \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation type, Conjugation form, Prototype, Reading, Pronunciation It should be 10 elements, but occasionally there was a line with 8 elements as shown below. (There is no 9 elements) ['Neck muscle',' Noun',' General','*','*','*','*','* \ n'] ['Girigo','noun','general','*','*','*','*','* \ n'] ['Mug','Noun','General','*','*','*','*','* \ n'] I wonder if it will be omitted.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

** Created program (click) **

`k30input.py`


#! /usr/bin/env python3
# -*- coding: utf-8 -*-

import re
import sys

def input_macab(filename):
    with open(filename, "r") as infile:
        sentences = []
        sentence =[]
        for line in infile.readlines():
            #Surface \ t part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Utilization type,Inflected form,Prototype,reading,pronunciation
            if line == 'EOS\n':
                if len(sentence) > 0:
                    sentences.append(sentence)
                    sentence =[]
                continue

            sline = re.split('[,\t]', line)

            if len(sline) < 8:
                print("###Read error:\n", sline + "\n")
                sys.exit(1)

            sentence.append({'surface': sline[0], 'base': sline[7], 'pos': sline[1], 'pos1': sline[2] })

    print("**Loading completed**\n")
    return sentences

if __name__ == '__main__':
    filename = "neko.txt.mecab.txt"
    ss = input_macab(filename)
    print("")
    print("It was run as main.")

note

--Since it is used for other problems, I made it a function. --Part of speech = pos = part of speech

31. Verb

Extract all the surface forms of the verb.

** Created program (click) **

`k31verb_surface.py`


# -*- coding: utf-8 -*-
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt")

for sentence in sentences:
    for mor in sentence:
        if mor['pos']=="verb":
            print(mor['surface'])

32. The original form of the verb

Extract all the original forms of the verb.

** Created program (click) **

`k32verb_base.py`


# -*- coding: utf-8 -*-
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt")

for sentence in sentences:
    for mor in sentence:
        if mor['pos']=="verb":
            print(mor['base'])

33. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

** Created program (click) **

`k33noun_no_noun.py`


# -*- coding: utf-8 -*-
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt")

noun_flag = 0
no_flag = 0
noun1: str

for sentence in sentences:
    for mor in sentence:
        if noun_flag == 0 :
            if mor['pos']=="noun":
                noun_flag = 1
                noun1 = mor['surface']
        elif noun_flag == 1 and no_flag == 0:
            if mor['surface']=="of":
                no_flag = 1
            else:
                noun1 = ""
                noun_flag = no_flag = 0
        elif noun_flag == 1 and no_flag == 1:
            if mor['pos']=="noun":
                print(noun1+"of"+mor['surface'])
                noun_flag = no_flag = 0

34. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

** Created program (click) **

`k34nounoun_longest.py`


# -*- coding: utf-8 -*-
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt")

nouns = []

for sentence in sentences:
    for mor in sentence:
        if mor['pos']=="noun":
            nouns.append(mor['surface'])
        else:
            if len(nouns) > 1:
                for i in nouns:
                    print(i+" ", end="")
                print("")
            nouns = []

note

--The output is made with a space between the nouns for easy understanding. ――I was confused because I forgot that mecab was judged as an adverb or a noun depending on the context, even if the surface system was the same "finally".

35. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

** Created program (click) **

`k35word_freq.py`


# -*- coding: utf-8 -*-
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt")

nouns = []

mor_freq = dict()

for sentence in sentences:
    for mor in sentence:
        #The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
        mor_freq.setdefault((mor['surface'], mor['pos']), 0)
        mor_freq[(mor['surface'], mor['pos'])] = mor_freq[(mor['surface'], mor['pos'])] + 1

ranking = sorted(mor_freq.items(), key=lambda i: i[1], reverse=True)

for i in ranking:
    print(i)

note

--Even if the morphemes have the same surface language, those with different part of speech are counted separately. --For example, the adverb "finally" and the noun "finally" are counted separately.

36. Top 10 most frequent words

Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).

** Created program (click) **

`k36word10_graph.py`


# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt")

nouns = []

mor_freq = dict()

for sentence in sentences:
    for mor in sentence:
        #The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
        mor_freq.setdefault((mor['surface'], mor['pos']), 0)
        mor_freq[(mor['surface'], mor['pos'])] = mor_freq[(mor['surface'], mor['pos'])] + 1

ranking = sorted(mor_freq.items(), key=lambda i: i[1], reverse=True)

top10 = ranking[0:10]

x = []
y = []
for i in top10:
    x.append(i[0][0])
    y.append(i[1])

pyplot.bar(x, y)

#Graph title
pyplot.title('Top 10 most frequent words')

#Graph axis
pyplot.xlabel('morpheme')
pyplot.ylabel('frequency')

pyplot.show()

note

--The default font of matplotlib cannot display Japanese, so I switched the font. --Specifically, / matplotlib / mpl-data / matplotlibrc is the configuration file, so I added `` `font.family: Hiragino Maru Gothic Pro``` to it. (Mac standard font)

37. Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).

** Created program (click) **

`k37co_cat.py`


# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt") # neko.txt.mecab.clean.txt

nouns = []

tmp_count = dict()
co_cat_count = dict()
cat_flag = 0

for sentence in sentences:
    for mor in sentence:
        #The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
        tmp_count.setdefault((mor['surface'], mor['pos']), 0)
        tmp_count[(mor['surface'], mor['pos'])] = tmp_count[(mor['surface'], mor['pos'])] + 1
        if mor['surface'] == "Cat":
            cat_flag = 1

    if cat_flag == 1:
        for k, v in tmp_count.items():
            co_cat_count.setdefault(k, 0)
            co_cat_count[k] = co_cat_count[k] + v
        cat_flag = 0
    tmp_count = {}

ranking = sorted(co_cat_count.items(), key=lambda i: i[1], reverse=True)

top10 = ranking[0:10]

x = []
y = []
for i in top10:
    x.append(i[0][0])
    y.append(i[1])

pyplot.bar(x, y)

#Graph title
pyplot.title('Top 10 words that frequently co-occur with "cat"')

#Graph axis
pyplot.xlabel('morpheme')
pyplot.ylabel('frequency')

pyplot.show()

note

Co-occurrence (co-occurrence) means that when a word appears in a sentence (or sentence), another limited word frequently appears in the sentence (sentence). ^ 1

――This time, it means that the top 10 words (morphemes) that often appear in sentences that include "cat". ―― "Cat" itself should be excluded, but I left it because it is interesting as one of the criteria and it can be easily erased if you want to erase it.

result

1	2	3	4	5	6	7	8	9	10
of	Is	、	Cat	To	To	hand	。	When	But

Kuten, particles, and auxiliary verbs are also counted as words, so it's not interesting because the result doesn't feel like a result because it's a "cat." In other words, any word will co-occur with these words, so it's not interesting.

Result 2

So, as a result of preparing a file neko.txt.mecab.clean.txt that excludes morpheme information of punctuation, particles, auxiliary verbs from neko.txt.mecab

1	2	3	4	5	6	7	8	9	10
Cat	Shi	Thing	I	of	Is	is there	To do	Human	こof

――Although it's a little better, I don't feel that the characteristics of "cats" have been captured yet. ――I wonder if any sentence says "yes, yes, do". -"No" is judged as a noun, and it is annoying because it is ranked in without being excluded. ――What is the noun ""? --A. In the case of "field" and formal noun (example sentence: ** of me ** cannot be found), etc. [^ 2] ――If you calculate and exclude all these words and words with high co-occurrence frequency, you can find a more meaningful co-occurrence frequency, but it is not troublesome. ――The only meaningful information that can be obtained from the results of the current co-occurrence frequency is that in the famous book "I Am a Cat", "" cats "and" humans "co-occur often". ――In "I am a cat", I wonder if there are many sentences that contrast humans and cats. ――The first person of the cat may be me (a very famous fact)

38. Histogram

Draw a histogram of the frequency of word occurrences. However, the horizontal axis represents the frequency of occurrence, and is a linear scale from 1 to the maximum value of the frequency of occurrence of words. The vertical axis is the number of different words (number of types) that have become the frequency of occurrence indicated by the x-axis.

** Created program (click) **

`k38histogram.py`


# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input

sentences = k30input.input_macab("neko.txt.mecab.txt") # neko.txt.mecab.clean.txt

nouns = []

tmp_count = dict()
co_cat_count = dict()
cat_flag = 0

for sentence in sentences:
    for mor in sentence:
        #The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
        tmp_count.setdefault((mor['surface'], mor['pos']), 0)
        tmp_count[(mor['surface'], mor['pos'])] = tmp_count[(mor['surface'], mor['pos'])] + 1
        if mor['surface'] == "Cat":
            cat_flag = 1

    if cat_flag == 1:
        for k, v in tmp_count.items():
            co_cat_count.setdefault(k, 0)
            co_cat_count[k] = co_cat_count[k] + v
        cat_flag = 0
    tmp_count = {}

ranking = sorted(co_cat_count.items(), key=lambda i: i[1], reverse=True)

x = []
for i in ranking:
    x.append(i[1])

pyplot.hist(x, range=(0,ranking[0][1]))

#Graph title
pyplot.title('Frequency of word occurrence')

#Graph axis
pyplot.xlabel('Frequency of appearance')
pyplot.ylabel('Number of types')

pyplot.show()

note

--There was another qiita contributor who used this vertical axis as the frequency of appearance, but the vertical axis is the number of words (different numbers) appearing at that frequency.

result

A logarithmic histogram looks like this

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

** Created program (click) **

`k39loglog_graph.py`


# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input
import numpy as np

sentences = k30input.input_macab("neko.txt.mecab.txt") # neko.txt.mecab.clean.txt

nouns = []

tmp_count = dict()
co_cat_count = dict()
cat_flag = 0

for sentence in sentences:
    for mor in sentence:
        #The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
        tmp_count.setdefault((mor['surface'], mor['pos']), 0)
        tmp_count[(mor['surface'], mor['pos'])] = tmp_count[(mor['surface'], mor['pos'])] + 1
        if mor['surface'] == "Cat":
            cat_flag = 1

    if cat_flag == 1:
        for k, v in tmp_count.items():
            co_cat_count.setdefault(k, 0)
            co_cat_count[k] = co_cat_count[k] + v
        cat_flag = 0
    tmp_count = {}

ranking = sorted(co_cat_count.items(), key=lambda i: i[1], reverse=True)

y = []
for i in ranking:
    y.append(i[1])
x = range(len(ranking))

print("size", len(ranking))

pyplot.title('Frequency of word occurrence')
pyplot.xlabel('Occurrence frequency ranking log(y)')
pyplot.ylabel('Number of types log(x)')

# scatter(Scatter plot)I didn't know how to make it logarithmic memory, so I cheated only here. That's why I use numpy.
pyplot.scatter(np.log(x),np.log(y))
pyplot.show()

result

note

--Zipf's Law

It is an empirical rule that the ratio of the element with the kth highest frequency of appearance to the whole is proportional to 1 / k. [^ 3]

――It seems that this law is not just an empirical rule of natural language, but it holds for various phenomena.

ウィキペディア（30ヶ国語版）における単語の出現頻度

--The frequency of appearance of words on Wikipedia (30 languages). It's similar. [^ 3]

Impressions

It is still in the category of studying python. However, studying the numpy, pandas, and collections modules was a hassle, and I didn't have to use them. But isn't it more difficult not to use it? Also, I wanted to make the same process a function and make it cool. Continue to next time. (I will definitely do it)

[^ 2]: --Wiktionary Japanese version [^ 3]: [Zip's Law-Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE% E6% B3% 95% E5% 89% 87)

[PYTHON] 100 Language Processing Knock Chapter 4: Morphological Analysis

Chapter 4: Morphological analysis

Article description

Advance preparation

pre_processing.py

30. Reading morphological analysis results

k30input.py

31. Verb

k31verb_surface.py

32. The original form of the verb

k32verb_base.py

33. "B of A"

k33noun_no_noun.py

34. Noun articulation

k34nounoun_longest.py

35. Frequency of word occurrence

k35word_freq.py

36. Top 10 most frequent words

k36word10_graph.py

37. Top 10 words that frequently co-occur with "cat"

k37co_cat.py

result

Result 2

38. Histogram

k38histogram.py

result

39. Zipf's Law

k39loglog_graph.py

result

Impressions

`pre_processing.py`

`k30input.py`

`k31verb_surface.py`

`k32verb_base.py`

`k33noun_no_noun.py`

`k34nounoun_longest.py`

`k35word_freq.py`

`k36word10_graph.py`

`k37co_cat.py`

`k38histogram.py`

`k39loglog_graph.py`