[PYTHON] 100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4

Introduction

Language processing 100 knock 2020 has been released, so I will try it immediately.

In Chapter 4, morphological analysis is performed with MeCab. However, since it is a big deal, I will do it with GiNZA (although I used it only at the beginning).

Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

code


import spacy
import pandas as pd
import pprint
from functools import reduce
import collections
import matplotlib.pyplot as plt
import seaborn as sns

with open('neko.txt') as f:
    raw_text = f.read()
    #Remove the extra characters at the beginning
    raw_text = raw_text.replace('one\n\n ', '')
    nlp = spacy.load('ja_ginza')
    doc = nlp(raw_text)
    with open('neko.txt.ginza', 'a') as f2:
        for sent in doc.sents:
            for token in sent:
                #Output numbers, surface systems, uninflected words, and part of speech
                f2.write(','.join([str(token.i), token.orth_, token.lemma_, token.tag_]) + '\n')

GiNZA seems to be able to output the same output as MeCab on the command line, but since I could not find the API from Python, I output it as it is.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

code


neko_df = pd.read_csv('neko.txt.ginza', header=None)

docs = []
sentence = []
for row in neko_df.itertuples():
    pos, *pos1 = row[4].split('-')
    neko_dict = {
        'surface': row[2],
        'base': row[3],
        'pos': pos,
        'pos1': pos1
    }
    sentence.append(neko_dict)
    #.. Separated by
    if row[2] == '。':
        docs.append(sentence)
        sentence = []
pprint.pprint(docs[0])

Output result


[{'base': 'My fellow', 'pos': 'Pronoun', 'pos1': [], 'surface': 'I'},
 {'base': 'Is', 'pos': 'Particle', 'pos1': ['係Particle'], 'surface': 'Is'},
 {'base': 'Cat', 'pos': 'noun', 'pos1': ['普通noun', 'General'], 'surface': 'Cat'},
 {'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': [], 'surface': 'so'},
 {'base': 'Yes', 'pos': 'verb', 'pos1': ['Non-independent'], 'surface': 'is there'},
 {'base': '。', 'pos': 'Auxiliary symbol', 'pos1': ['Kuten'], 'surface': '。'}]

31. Verb

Extract all the surface forms of the verb.

code


surfaces = []
for sentence in docs:
    for morpheme in sentence:
        surfaces.append(morpheme['surface'])
print(surfaces[:30])

Output result


['I', 'Is', 'Cat', 'so', 'is there', '。', 'name', 'Is', 'yet', 'No', '。', 'Where', 'so', 'Born', 'Ta', 'Or', 'When', 'んWhen', 'Register', 'But', 'つOr', 'Nu', '。', 'what', 'so', 'Also', 'dim', 'Damp', 'Shi', 'Ta']

32. The original form of the verb

Extract all the original forms of the verb.

code


bases = []
for sentence in docs:
    for morpheme in sentence:
        bases.append(morpheme['base'])
print(bases[:30])

Output result


['My fellow', 'Is', 'Cat', 'Is', 'Yes', '。', 'name', 'Is', '未Is', 'No', '。', 'Where', 'so', 'to be born', 'Ta', 'Or', 'When', 'うんWhen', 'Register', 'But', 'Attach', 'Zu', '。', 'what', 'so', 'Also', 'dim', 'Damp', 'Make', 'Ta']

33. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

code


nouns = []
for sentence in docs:
    for i in range(len(sentence) - 2):
        if sentence[i]['pos'] == 'noun' and sentence[i + 1]['surface'] == 'of' and sentence[i + 2]['pos'] == 'noun':
            nouns.append(sentence[i]['surface'] + sentence[i + 1]['surface'] + sentence[i + 2]['surface'])
print(nouns[:30])

Output result


['On the palm', 'Student's face', 'Seeing things', 'Should face', 'In the middle of the face', 'In the hole', 'Calligraphy palm', 'The back of the palm', 'So far', 'On the straw', 'In Sasahara', 'In front of the pond', 'On the pond', 'Thanks to Kazuki', 'Hedge hole', 'Neighboring calico cat', 'Passage of time', 'Momentary grace', 'Inside the house', 'Humans other than', 'Previous student', 'Your chance', 'Three of you', 'Chest itching', 'Housekeeper', 'Master', 'None little cat', 'Under the nose', 'My home', 'Home stuff']

34. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

code


nouns2 = []
for sentence in docs:
    word = ''
    count = 0
    for morpheme in sentence:
        if morpheme['pos'] == 'noun':
            word += morpheme['surface']
            count += 1
        else:
            if count >= 2:
                nouns2.append(word)
            word = ''
            count = 0 
print(nouns2[:30]) 

Output result


['Start', 'Timely', 'One hair', 'Rear cat', 'Up to now', 'Uchiike', 'Other than student', 'Mao', 'No inn', 'Mama back', 'All-day study', 'Almost', 'Sometimes stealth', 'A few pages', 'Other than my husband', 'Morning master', 'Sou side', 'One ken', 'Nerve stomach weakness', 'Sometimes the same', 'Language break', 'My wife', 'The other day ball', 'The whole story', 'No matter how human', 'Mr.', 'Sora Munemori', 'January', 'Monthly salary date', 'Watercolor paint']

35. Frequency of word occurrence

code


#Flatten a two-dimensional list
words = reduce(list.__add__, docs)
#Extract only words
words = collections.Counter(map(lambda e: e['surface'], words))
#Calculate the frequency of appearance
words = words.most_common()
#Sort by frequency of appearance
words = sorted(words, key=lambda e: e[1], reverse=True)
print(words[:30])

Output result


[('of', 9546), ('。', 7486), ('hand', 7401), ('To', 7047), ('、', 6772), ('Is', 6485), ('When', 6150), ('To', 6118), ('But', 5395), ('so', 4542), ('Ta', 3975), ('「', 3238), ('」', 3238), ('Also', 3229), ('Is', 2705), ('Shi', 2530), ('Absent', 2423), ('From', 2213), ('Or', 2041), ('is there', 1729), ('Hmm', 1625), ('Nana', 1600), ('Is', 1255), ('Thing', 1214), ('To do', 1056), ('Alsoof', 1005), ('What', 998), ('soす', 978), ('You', 967), ('say', 937)]

36. Top 10 most frequent words

code


words_df = pd.DataFrame(words[:10], columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(words_df['word'], words_df['count'])
plt.show()

Output result 36.png

37. Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).

code


cats = []
for sentence in docs:
    cat_list = list(filter(lambda e: e['surface'] == 'Cat', sentence))
    if len(cat_list) > 0:
        for morpheme in sentence:
            if morpheme['surface'] != 'Cat':
                cats.append(morpheme['surface'])
cats = collections.Counter(cats)
#Calculate the frequency of appearance
cats = cats.most_common()
#Sort by frequency of appearance
cats = sorted(cats, key=lambda e: e[1], reverse=True)

cats_df = pd.DataFrame(cats[:10], columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(cats_df['word'], cats_df['count'])
# plt.show()
plt.savefig('37.png')

Output result 37.png

38. Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis is the frequency of occurrence, and the vertical axis is the number of types of words that have the frequency of appearance as a bar graph).

code


hist_df = pd.DataFrame(words, columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(hist_df['count'], range=(1, 100))
# plt.show()
plt.savefig('38.png')

Output result 38.png

The horizontal axis is limited to 100. Since the number of appearances around 1 is overwhelmingly large, if you display it without restrictions, it will be as follows and you can not visualize it properly. 38_2.png

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

code


zipf_df = pd.DataFrame(words, columns=['word', 'count'])
sns.set(font='AppleMyungjo')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_yscale('log')
ax.set_xscale('log')
ax.plot(zipf_df['count'])
# plt.show()
plt.savefig('39.png')

Output result 39.png

You can see it by making it a logarithmic graph.

in conclusion

What you can learn in Chapter 4

Recommended Posts

100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-90 (using Gensim): learning with word2vec
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting