Language processing 100 knock 2020 has been released, so I will try it immediately.
In Chapter 4, morphological analysis is performed with MeCab. However, since it is a big deal, I will do it with GiNZA (although I used it only at the beginning).
Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
code
import spacy
import pandas as pd
import pprint
from functools import reduce
import collections
import matplotlib.pyplot as plt
import seaborn as sns
with open('neko.txt') as f:
raw_text = f.read()
#Remove the extra characters at the beginning
raw_text = raw_text.replace('one\n\n ', '')
nlp = spacy.load('ja_ginza')
doc = nlp(raw_text)
with open('neko.txt.ginza', 'a') as f2:
for sent in doc.sents:
for token in sent:
#Output numbers, surface systems, uninflected words, and part of speech
f2.write(','.join([str(token.i), token.orth_, token.lemma_, token.tag_]) + '\n')
GiNZA seems to be able to output the same output as MeCab on the command line, but since I could not find the API from Python, I output it as it is.
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
code
neko_df = pd.read_csv('neko.txt.ginza', header=None)
docs = []
sentence = []
for row in neko_df.itertuples():
pos, *pos1 = row[4].split('-')
neko_dict = {
'surface': row[2],
'base': row[3],
'pos': pos,
'pos1': pos1
}
sentence.append(neko_dict)
#.. Separated by
if row[2] == '。':
docs.append(sentence)
sentence = []
pprint.pprint(docs[0])
Output result
[{'base': 'My fellow', 'pos': 'Pronoun', 'pos1': [], 'surface': 'I'},
{'base': 'Is', 'pos': 'Particle', 'pos1': ['係Particle'], 'surface': 'Is'},
{'base': 'Cat', 'pos': 'noun', 'pos1': ['普通noun', 'General'], 'surface': 'Cat'},
{'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': [], 'surface': 'so'},
{'base': 'Yes', 'pos': 'verb', 'pos1': ['Non-independent'], 'surface': 'is there'},
{'base': '。', 'pos': 'Auxiliary symbol', 'pos1': ['Kuten'], 'surface': '。'}]
Extract all the surface forms of the verb.
code
surfaces = []
for sentence in docs:
for morpheme in sentence:
surfaces.append(morpheme['surface'])
print(surfaces[:30])
Output result
['I', 'Is', 'Cat', 'so', 'is there', '。', 'name', 'Is', 'yet', 'No', '。', 'Where', 'so', 'Born', 'Ta', 'Or', 'When', 'んWhen', 'Register', 'But', 'つOr', 'Nu', '。', 'what', 'so', 'Also', 'dim', 'Damp', 'Shi', 'Ta']
Extract all the original forms of the verb.
code
bases = []
for sentence in docs:
for morpheme in sentence:
bases.append(morpheme['base'])
print(bases[:30])
Output result
['My fellow', 'Is', 'Cat', 'Is', 'Yes', '。', 'name', 'Is', '未Is', 'No', '。', 'Where', 'so', 'to be born', 'Ta', 'Or', 'When', 'うんWhen', 'Register', 'But', 'Attach', 'Zu', '。', 'what', 'so', 'Also', 'dim', 'Damp', 'Make', 'Ta']
Extract a noun phrase in which two nouns are connected by "no".
code
nouns = []
for sentence in docs:
for i in range(len(sentence) - 2):
if sentence[i]['pos'] == 'noun' and sentence[i + 1]['surface'] == 'of' and sentence[i + 2]['pos'] == 'noun':
nouns.append(sentence[i]['surface'] + sentence[i + 1]['surface'] + sentence[i + 2]['surface'])
print(nouns[:30])
Output result
['On the palm', 'Student's face', 'Seeing things', 'Should face', 'In the middle of the face', 'In the hole', 'Calligraphy palm', 'The back of the palm', 'So far', 'On the straw', 'In Sasahara', 'In front of the pond', 'On the pond', 'Thanks to Kazuki', 'Hedge hole', 'Neighboring calico cat', 'Passage of time', 'Momentary grace', 'Inside the house', 'Humans other than', 'Previous student', 'Your chance', 'Three of you', 'Chest itching', 'Housekeeper', 'Master', 'None little cat', 'Under the nose', 'My home', 'Home stuff']
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
code
nouns2 = []
for sentence in docs:
word = ''
count = 0
for morpheme in sentence:
if morpheme['pos'] == 'noun':
word += morpheme['surface']
count += 1
else:
if count >= 2:
nouns2.append(word)
word = ''
count = 0
print(nouns2[:30])
Output result
['Start', 'Timely', 'One hair', 'Rear cat', 'Up to now', 'Uchiike', 'Other than student', 'Mao', 'No inn', 'Mama back', 'All-day study', 'Almost', 'Sometimes stealth', 'A few pages', 'Other than my husband', 'Morning master', 'Sou side', 'One ken', 'Nerve stomach weakness', 'Sometimes the same', 'Language break', 'My wife', 'The other day ball', 'The whole story', 'No matter how human', 'Mr.', 'Sora Munemori', 'January', 'Monthly salary date', 'Watercolor paint']
code
#Flatten a two-dimensional list
words = reduce(list.__add__, docs)
#Extract only words
words = collections.Counter(map(lambda e: e['surface'], words))
#Calculate the frequency of appearance
words = words.most_common()
#Sort by frequency of appearance
words = sorted(words, key=lambda e: e[1], reverse=True)
print(words[:30])
Output result
[('of', 9546), ('。', 7486), ('hand', 7401), ('To', 7047), ('、', 6772), ('Is', 6485), ('When', 6150), ('To', 6118), ('But', 5395), ('so', 4542), ('Ta', 3975), ('「', 3238), ('」', 3238), ('Also', 3229), ('Is', 2705), ('Shi', 2530), ('Absent', 2423), ('From', 2213), ('Or', 2041), ('is there', 1729), ('Hmm', 1625), ('Nana', 1600), ('Is', 1255), ('Thing', 1214), ('To do', 1056), ('Alsoof', 1005), ('What', 998), ('soす', 978), ('You', 967), ('say', 937)]
code
words_df = pd.DataFrame(words[:10], columns=['word', 'count'])
sns.set(font='AppleMyungjo')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(words_df['word'], words_df['count'])
plt.show()
Output result
Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).
code
cats = []
for sentence in docs:
cat_list = list(filter(lambda e: e['surface'] == 'Cat', sentence))
if len(cat_list) > 0:
for morpheme in sentence:
if morpheme['surface'] != 'Cat':
cats.append(morpheme['surface'])
cats = collections.Counter(cats)
#Calculate the frequency of appearance
cats = cats.most_common()
#Sort by frequency of appearance
cats = sorted(cats, key=lambda e: e[1], reverse=True)
cats_df = pd.DataFrame(cats[:10], columns=['word', 'count'])
sns.set(font='AppleMyungjo')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.bar(cats_df['word'], cats_df['count'])
# plt.show()
plt.savefig('37.png')
Output result
Draw a histogram of the frequency of occurrence of words (the horizontal axis is the frequency of occurrence, and the vertical axis is the number of types of words that have the frequency of appearance as a bar graph).
code
hist_df = pd.DataFrame(words, columns=['word', 'count'])
sns.set(font='AppleMyungjo')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(hist_df['count'], range=(1, 100))
# plt.show()
plt.savefig('38.png')
Output result
The horizontal axis is limited to 100. Since the number of appearances around 1 is overwhelmingly large, if you display it without restrictions, it will be as follows and you can not visualize it properly.
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
code
zipf_df = pd.DataFrame(words, columns=['word', 'count'])
sns.set(font='AppleMyungjo')
fig = plt.figure()
ax = fig.add_subplot(111)
ax.set_yscale('log')
ax.set_xscale('log')
ax.plot(zipf_df['count'])
# plt.show()
plt.savefig('39.png')
Output result
You can see it by making it a logarithmic graph.
What you can learn in Chapter 4
Recommended Posts