Introduction

Language processing 100 knock 2020 has been released, so I will try it immediately.

In Chapter 4, morphological analysis is performed with MeCab. However, since it is a big deal, I will do it with GiNZA (although I used it only at the beginning).

Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

`code`


import spacy
import pandas as pd
import pprint
from functools import reduce
import collections
import matplotlib.pyplot as plt
import seaborn as sns

with open('neko.txt') as f:
    raw_text = f.read()
    #Remove the extra characters at the beginning
    raw_text = raw_text.replace('one\n\n　', '')
    nlp = spacy.load('ja_ginza')
    doc = nlp(raw_text)
    with open('neko.txt.ginza', 'a') as f2:
        for sent in doc.sents:
            for token in sent:
                #Output numbers, surface systems, uninflected words, and part of speech
                f2.write(','.join([str(token.i), token.orth_, token.lemma_, token.tag_]) + '\n')

GiNZA seems to be able to output the same output as MeCab on the command line, but since I could not find the API from Python, I output it as it is.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

`code`


neko_df = pd.read_csv('neko.txt.ginza', header=None)

docs = []
sentence = []
for row in neko_df.itertuples():
    pos, *pos1 = row[4].split('-')
    neko_dict = {
        'surface': row[2],
        'base': row[3],
        'pos': pos,
        'pos1': pos1
    }
    sentence.append(neko_dict)
    #.. Separated by
    if row[2] == '。':
        docs.append(sentence)
        sentence = []
pprint.pprint(docs[0])

`Output result`


[{'base': 'My fellow', 'pos': 'Pronoun', 'pos1': [], 'surface': 'I'},
 {'base': 'Is', 'pos': 'Particle', 'pos1': ['係Particle'], 'surface': 'Is'},
 {'base': 'Cat', 'pos': 'noun', 'pos1': ['普通noun', 'General'], 'surface': 'Cat'},
 {'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': [], 'surface': 'so'},
 {'base': 'Yes', 'pos': 'verb', 'pos1': ['Non-independent'], 'surface': 'is there'},
 {'base': '。', 'pos': 'Auxiliary symbol', 'pos1': ['Kuten'], 'surface': '。'}]

31. Verb

Extract all the surface forms of the verb.

`code`


surfaces = []
for sentence in docs:
    for morpheme in sentence:
        surfaces.append(morpheme['surface'])
print(surfaces[:30])

`Output result`


['I', 'Is', 'Cat', 'so', 'is there', '。', 'name', 'Is', 'yet', 'No', '。', 'Where', 'so', 'Born', 'Ta', 'Or', 'When', 'んWhen', 'Register', 'But', 'つOr', 'Nu', '。', 'what', 'so', 'Also', 'dim', 'Damp', 'Shi', 'Ta']

32. The original form of the verb

Extract all the original forms of the verb.

`code`


bases = []
for sentence in docs:
    for morpheme in sentence:
        bases.append(morpheme['base'])
print(bases[:30])

`Output result`


['My fellow', 'Is', 'Cat', 'Is', 'Yes', '。', 'name', 'Is', '未Is', 'No', '。', 'Where', 'so', 'to be born', 'Ta', 'Or', 'When', 'うんWhen', 'Register', 'But', 'Attach', 'Zu', '。', 'what', 'so', 'Also', 'dim', 'Damp', 'Make', 'Ta']

33. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

`code`


nouns = []
for sentence in docs:
    for i in range(len(sentence) - 2):
        if sentence[i]['pos'] == 'noun' and sentence[i + 1]['surface'] == 'of' and sentence[i + 2]['pos'] == 'noun':
            nouns.append(sentence[i]['surface'] + sentence[i + 1]['surface'] + sentence[i + 2]['surface'])
print(nouns[:30])

`Output result`


['On the palm', 'Student's face', 'Seeing things', 'Should face', 'In the middle of the face', 'In the hole', 'Calligraphy palm', 'The back of the palm', 'So far', 'On the straw', 'In Sasahara', 'In front of the pond', 'On the pond', 'Thanks to Kazuki', 'Hedge hole', 'Neighboring calico cat', 'Passage of time', 'Momentary grace', 'Inside the house', 'Humans other than', 'Previous student', 'Your chance', 'Three of you', 'Chest itching', 'Housekeeper', 'Master', 'None little cat', 'Under the nose', 'My home', 'Home stuff']

34. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

`code`


nouns2 = []
for sentence in docs:
    word = ''
    count = 0
    for morpheme in sentence:
        if morpheme['pos'] == 'noun':
            word += morpheme['surface']
            count += 1
        else:
            if count >= 2:
                nouns2.append(word)
            word = ''
            count = 0 
print(nouns2[:30])

`Output result`


['Start', 'Timely', 'One hair', 'Rear cat', 'Up to now', 'Uchiike', 'Other than student', 'Mao', 'No inn', 'Mama back', 'All-day study', 'Almost', 'Sometimes stealth', 'A few pages', 'Other than my husband', 'Morning master', 'Sou side', 'One ken', 'Nerve stomach weakness', 'Sometimes the same', 'Language break', 'My wife', 'The other day ball', 'The whole story', 'No matter how human', 'Mr.', 'Sora Munemori', 'January', 'Monthly salary date', 'Watercolor paint']

35. Frequency of word occurrence

`code`


#Flatten a two-dimensional list
words = reduce(list.__add__, docs)
#Extract only words
words = collections.Counter(map(lambda e: e['surface'], words))
#Calculate the frequency of appearance
words = words.most_common()
#Sort by frequency of appearance
words = sorted(words, key=lambda e: e[1], reverse=True)
print(words[:30])

`Output result`


[('of', 9546), ('。', 7486), ('hand', 7401), ('To', 7047), ('、', 6772), ('Is', 6485), ('When', 6150), ('To', 6118), ('But', 5395), ('so', 4542), ('Ta', 3975), ('「', 3238), ('」', 3238), ('Also', 3229), ('Is', 2705), ('Shi', 2530), ('Absent', 2423), ('From', 2213), ('Or', 2041), ('is there', 1729), ('Hmm', 1625), ('Nana', 1600), ('Is', 1255), ('Thing', 1214), ('To do', 1056), ('Alsoof', 1005), ('What', 998), ('soす', 978), ('You', 967), ('say', 937)]

36. Top 10 most frequent words

`code`


words_df = pd.DataFrame(words[:10], columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(words_df['word'], words_df['count'])
plt.show()

Output result

37. Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).

`code`


cats = []
for sentence in docs:
    cat_list = list(filter(lambda e: e['surface'] == 'Cat', sentence))
    if len(cat_list) > 0:
        for morpheme in sentence:
            if morpheme['surface'] != 'Cat':
                cats.append(morpheme['surface'])
cats = collections.Counter(cats)
#Calculate the frequency of appearance
cats = cats.most_common()
#Sort by frequency of appearance
cats = sorted(cats, key=lambda e: e[1], reverse=True)

cats_df = pd.DataFrame(cats[:10], columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(cats_df['word'], cats_df['count'])
# plt.show()
plt.savefig('37.png')

Output result

38. Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis is the frequency of occurrence, and the vertical axis is the number of types of words that have the frequency of appearance as a bar graph).

`code`


hist_df = pd.DataFrame(words, columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(hist_df['count'], range=(1, 100))
# plt.show()
plt.savefig('38.png')

Output result

The horizontal axis is limited to 100. Since the number of appearances around 1 is overwhelmingly large, if you display it without restrictions, it will be as follows and you can not visualize it properly.

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

`code`


zipf_df = pd.DataFrame(words, columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_yscale('log')
ax.set_xscale('log')
ax.plot(zipf_df['count'])
# plt.show()
plt.savefig('39.png')

Output result

You can see it by making it a logarithmic graph.

in conclusion

What you can learn in Chapter 4

How to analyze morphological elements
Graph output by matplotlib
List and dictionary operations

[PYTHON] 100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4

Introduction

Chapter 4: Morphological analysis

code

30. Reading morphological analysis results

code

Output result

31. Verb

code

Output result

32. The original form of the verb

code

Output result

33. "B of A"

code

Output result

34. Noun articulation

code

Output result

35. Frequency of word occurrence

code

Output result

36. Top 10 most frequent words

code

37. Top 10 words that frequently co-occur with "cat"

code

38. Histogram

code

39. Zipf's Law

code

in conclusion

`code`

`code`

`Output result`

`code`

`Output result`

`code`

`Output result`

`code`

`Output result`

`code`

`Output result`

`code`

`Output result`

`code`

`code`

`code`

`code`