Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
--This article contains the results of a student who is an amateur in both language processing and python solving 100 language processing knock 2020. .. I am very pleased to point out any mistakes or improvements, so thank you. ―― ~~ I'm following pycharm's inspection for studying python, so there may be a lot of useless code. ~~ --I'm using Atom this time because I can't solve the problem that "" cannot be entered on pycharm. --Chapter 1 to 3 will be skipped.
** Environment ** --MacBook Pro (13-inch, 2016, Thunderbolt 3 ports x 2)
Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab.
So, execute the following program to create it.
pre_processing.py
# -*- coding: utf-8 -*-
import MeCab
from MeCab import Tagger
analyser: Tagger = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
with open('./neko.txt', "r") as infile:
lines = infile.readlines()
with open('./neko.txt.mecab.txt', "w") as outfile:
for line in lines:
mors = analyser.parse(line)
outfile.write(mors)
note
--Looking at the output morpheme, "I am a cat" in the first line is identified as a proper noun. I do not care.
--The output format is (according to mecab)
Surface \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation type, Conjugation form, Prototype, Reading, Pronunciation
It should be 10 elements, but occasionally there was a line with 8 elements as shown below. (There is no 9 elements)
['Neck muscle',' Noun',' General','*','*','*','*','* \ n'] ['Girigo','noun','general','*','*','*','*','* \ n'] ['Mug','Noun','General','*','*','*','*','* \ n']
I wonder if it will be omitted.
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
k30input.py
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
import re
import sys
def input_macab(filename):
with open(filename, "r") as infile:
sentences = []
sentence =[]
for line in infile.readlines():
#Surface \ t part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Utilization type,Inflected form,Prototype,reading,pronunciation
if line == 'EOS\n':
if len(sentence) > 0:
sentences.append(sentence)
sentence =[]
continue
sline = re.split('[,\t]', line)
if len(sline) < 8:
print("###Read error:\n", sline + "\n")
sys.exit(1)
sentence.append({'surface': sline[0], 'base': sline[7], 'pos': sline[1], 'pos1': sline[2] })
print("**Loading completed**\n")
return sentences
if __name__ == '__main__':
filename = "neko.txt.mecab.txt"
ss = input_macab(filename)
print("")
print("It was run as main.")
note
--Since it is used for other problems, I made it a function. --Part of speech = pos = part of speech
Extract all the surface forms of the verb.
k31verb_surface.py
# -*- coding: utf-8 -*-
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt")
for sentence in sentences:
for mor in sentence:
if mor['pos']=="verb":
print(mor['surface'])
Extract all the original forms of the verb.
k32verb_base.py
# -*- coding: utf-8 -*-
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt")
for sentence in sentences:
for mor in sentence:
if mor['pos']=="verb":
print(mor['base'])
Extract a noun phrase in which two nouns are connected by "no".
k33noun_no_noun.py
# -*- coding: utf-8 -*-
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt")
noun_flag = 0
no_flag = 0
noun1: str
for sentence in sentences:
for mor in sentence:
if noun_flag == 0 :
if mor['pos']=="noun":
noun_flag = 1
noun1 = mor['surface']
elif noun_flag == 1 and no_flag == 0:
if mor['surface']=="of":
no_flag = 1
else:
noun1 = ""
noun_flag = no_flag = 0
elif noun_flag == 1 and no_flag == 1:
if mor['pos']=="noun":
print(noun1+"of"+mor['surface'])
noun_flag = no_flag = 0
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
k34nounoun_longest.py
# -*- coding: utf-8 -*-
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt")
nouns = []
for sentence in sentences:
for mor in sentence:
if mor['pos']=="noun":
nouns.append(mor['surface'])
else:
if len(nouns) > 1:
for i in nouns:
print(i+" ", end="")
print("")
nouns = []
note
--The output is made with a space between the nouns for easy understanding. ――I was confused because I forgot that mecab was judged as an adverb or a noun depending on the context, even if the surface system was the same "finally".
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
k35word_freq.py
# -*- coding: utf-8 -*-
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt")
nouns = []
mor_freq = dict()
for sentence in sentences:
for mor in sentence:
#The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
mor_freq.setdefault((mor['surface'], mor['pos']), 0)
mor_freq[(mor['surface'], mor['pos'])] = mor_freq[(mor['surface'], mor['pos'])] + 1
ranking = sorted(mor_freq.items(), key=lambda i: i[1], reverse=True)
for i in ranking:
print(i)
note
--Even if the morphemes have the same surface language, those with different part of speech are counted separately. --For example, the adverb "finally" and the noun "finally" are counted separately.
Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).
k36word10_graph.py
# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt")
nouns = []
mor_freq = dict()
for sentence in sentences:
for mor in sentence:
#The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
mor_freq.setdefault((mor['surface'], mor['pos']), 0)
mor_freq[(mor['surface'], mor['pos'])] = mor_freq[(mor['surface'], mor['pos'])] + 1
ranking = sorted(mor_freq.items(), key=lambda i: i[1], reverse=True)
top10 = ranking[0:10]
x = []
y = []
for i in top10:
x.append(i[0][0])
y.append(i[1])
pyplot.bar(x, y)
#Graph title
pyplot.title('Top 10 most frequent words')
#Graph axis
pyplot.xlabel('morpheme')
pyplot.ylabel('frequency')
pyplot.show()
note
--The default font of matplotlib cannot display Japanese, so I switched the font.
--Specifically,
Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).
k37co_cat.py
# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt") # neko.txt.mecab.clean.txt
nouns = []
tmp_count = dict()
co_cat_count = dict()
cat_flag = 0
for sentence in sentences:
for mor in sentence:
#The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
tmp_count.setdefault((mor['surface'], mor['pos']), 0)
tmp_count[(mor['surface'], mor['pos'])] = tmp_count[(mor['surface'], mor['pos'])] + 1
if mor['surface'] == "Cat":
cat_flag = 1
if cat_flag == 1:
for k, v in tmp_count.items():
co_cat_count.setdefault(k, 0)
co_cat_count[k] = co_cat_count[k] + v
cat_flag = 0
tmp_count = {}
ranking = sorted(co_cat_count.items(), key=lambda i: i[1], reverse=True)
top10 = ranking[0:10]
x = []
y = []
for i in top10:
x.append(i[0][0])
y.append(i[1])
pyplot.bar(x, y)
#Graph title
pyplot.title('Top 10 words that frequently co-occur with "cat"')
#Graph axis
pyplot.xlabel('morpheme')
pyplot.ylabel('frequency')
pyplot.show()
note
Co-occurrence (co-occurrence) means that when a word appears in a sentence (or sentence), another limited word frequently appears in the sentence (sentence). ^ 1
――This time, it means that the top 10 words (morphemes) that often appear in sentences that include "cat". ―― "Cat" itself should be excluded, but I left it because it is interesting as one of the criteria and it can be easily erased if you want to erase it.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
of | Is | 、 | Cat | To | To | hand | 。 | When | But |
Kuten, particles, and auxiliary verbs are also counted as words, so it's not interesting because the result doesn't feel like a result because it's a "cat." In other words, any word will co-occur with these words, so it's not interesting.
So, as a result of preparing a file neko.txt.mecab.clean.txt that excludes morpheme information of punctuation, particles, auxiliary verbs from neko.txt.mecab
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
Cat | Shi | Thing | I | of | Is | is there | To do | Human | こof |
――Although it's a little better, I don't feel that the characteristics of "cats" have been captured yet. ――I wonder if any sentence says "yes, yes, do". -"No" is judged as a noun, and it is annoying because it is ranked in without being excluded. ――What is the noun ""? --A. In the case of "field" and formal noun (example sentence: ** of me ** cannot be found), etc. [^ 2] ――If you calculate and exclude all these words and words with high co-occurrence frequency, you can find a more meaningful co-occurrence frequency, but it is not troublesome. ――The only meaningful information that can be obtained from the results of the current co-occurrence frequency is that in the famous book "I Am a Cat", "" cats "and" humans "co-occur often". ――In "I am a cat", I wonder if there are many sentences that contrast humans and cats. ――The first person of the cat may be me (a very famous fact)
Draw a histogram of the frequency of word occurrences. However, the horizontal axis represents the frequency of occurrence, and is a linear scale from 1 to the maximum value of the frequency of occurrence of words. The vertical axis is the number of different words (number of types) that have become the frequency of occurrence indicated by the x-axis.
k38histogram.py
# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input
sentences = k30input.input_macab("neko.txt.mecab.txt") # neko.txt.mecab.clean.txt
nouns = []
tmp_count = dict()
co_cat_count = dict()
cat_flag = 0
for sentence in sentences:
for mor in sentence:
#The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
tmp_count.setdefault((mor['surface'], mor['pos']), 0)
tmp_count[(mor['surface'], mor['pos'])] = tmp_count[(mor['surface'], mor['pos'])] + 1
if mor['surface'] == "Cat":
cat_flag = 1
if cat_flag == 1:
for k, v in tmp_count.items():
co_cat_count.setdefault(k, 0)
co_cat_count[k] = co_cat_count[k] + v
cat_flag = 0
tmp_count = {}
ranking = sorted(co_cat_count.items(), key=lambda i: i[1], reverse=True)
x = []
for i in ranking:
x.append(i[1])
pyplot.hist(x, range=(0,ranking[0][1]))
#Graph title
pyplot.title('Frequency of word occurrence')
#Graph axis
pyplot.xlabel('Frequency of appearance')
pyplot.ylabel('Number of types')
pyplot.show()
note
--There was another qiita contributor who used this vertical axis as the frequency of appearance, but the vertical axis is the number of words (different numbers) appearing at that frequency.
A logarithmic histogram looks like this
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
k39loglog_graph.py
# -*- coding: utf-8 -*-
from matplotlib import pyplot
import k30input
import numpy as np
sentences = k30input.input_macab("neko.txt.mecab.txt") # neko.txt.mecab.clean.txt
nouns = []
tmp_count = dict()
co_cat_count = dict()
cat_flag = 0
for sentence in sentences:
for mor in sentence:
#The key is(Surface system,Part of speech)Tuples,The value is the number of occurrences.
tmp_count.setdefault((mor['surface'], mor['pos']), 0)
tmp_count[(mor['surface'], mor['pos'])] = tmp_count[(mor['surface'], mor['pos'])] + 1
if mor['surface'] == "Cat":
cat_flag = 1
if cat_flag == 1:
for k, v in tmp_count.items():
co_cat_count.setdefault(k, 0)
co_cat_count[k] = co_cat_count[k] + v
cat_flag = 0
tmp_count = {}
ranking = sorted(co_cat_count.items(), key=lambda i: i[1], reverse=True)
y = []
for i in ranking:
y.append(i[1])
x = range(len(ranking))
print("size", len(ranking))
pyplot.title('Frequency of word occurrence')
pyplot.xlabel('Occurrence frequency ranking log(y)')
pyplot.ylabel('Number of types log(x)')
# scatter(Scatter plot)I didn't know how to make it logarithmic memory, so I cheated only here. That's why I use numpy.
pyplot.scatter(np.log(x),np.log(y))
pyplot.show()
note
--Zipf's Law
It is an empirical rule that the ratio of the element with the kth highest frequency of appearance to the whole is proportional to 1 / k. [^ 3]
――It seems that this law is not just an empirical rule of natural language, but it holds for various phenomena.
--The frequency of appearance of words on Wikipedia (30 languages). It's similar. [^ 3]
It is still in the category of studying python. However, studying the numpy, pandas, and collections modules was a hassle, and I didn't have to use them. But isn't it more difficult not to use it? Also, I wanted to make the same process a function and make it cool. Continue to next time. (I will definitely do it)
[^ 2]: --Wiktionary Japanese version [^ 3]: [Zip's Law-Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE% E6% B3% 95% E5% 89% 87)
Recommended Posts