100 Language Processing Knock 2015 "Chapter 4 Morphological Analysis (30 ~) It is a record of solving "39)".
import MeCab
import ngram
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
The text of Natsume Soseki's novel "I am a cat" ( morphologically analyze neko.txt ) using MeCab and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.
Create a function called make_analyzed_file, perform morphological analysis, and save it in a file. It is assumed that neko.txt is downloaded in advance and saved in the same folder as the executable file.
def make_analyzed_file(input_file_name: str, output_file_name: str) -> None:
"""
Morphological analysis of plain Japanese text file and save to file.
:param input_file_name Plain Japanese sentence file name
:param output_file_name Morphologically parsed text file name
"""
_m = MeCab.Tagger("-Ochasen")
with open(input_file_name, encoding='utf-8') as input_file:
with open(output_file_name, mode='w', encoding='utf-8') as output_file:
output_file.write(_m.parse(input_file.read()))
make_analyzed_file('neko.txt', 'neko.txt.mecab')
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
Simply convert the morphological analysis result from the tab-delimited character string to the dictionary type and save it in morphemes
, and save the summary for each sentence in sentences
.
def tabbed_str_to_dict(tabbed_str: str) -> dict:
"""
For example, "Gradually Shidaini Gradually adverb-Convert a string representing a morpheme with tab delimiters such as "general" to Dict type.
:param tabbed_str A string representing morphemes separated by tabs
:Morpheme represented by return Dict type
"""
elements = tabbed_str.split()
if 0 < len(elements) < 4:
return {'surface': elements[0], 'base': '', 'pos': '', 'pos1': ''}
else:
return {'surface': elements[0], 'base': elements[1], 'pos': elements[2], 'pos1': elements[3]}
def morphemes_to_sentence(morphemes: list) -> list:
"""
Group and list the list of morphemes represented by Dict type by kuten.
:param morphemes List of morphemes represented by Dict type
:return List of sentences
"""
sentences = []
sentence = []
for morpheme in morphemes:
sentence.append(morpheme)
if morpheme['pos1'] == 'symbol-Kuten':
sentences.append(sentence)
sentence = []
return sentences
with open('neko.txt.mecab', encoding='utf-8') as file_wrapper:
morphemes = [tabbed_str_to_dict(line) for line in file_wrapper]
sentences = morphemes_to_sentence(morphemes)
#Check the result
print(morphemes[::100])
print(sentences[::100])
Extract all the surface forms of the verb. Extract all the original forms of the verb. Extract all the nouns of the s-irregular connection.
It is easy if you use the morphemes
created in" 30. Reading the morphological analysis results ".
verbs_surface = [morpheme['surface'] for morpheme in morphemes if morpheme['pos1'].find('verb') == 0]
verbs_base = [morpheme['base'] for morpheme in morphemes if morpheme['pos1'].find('verb') == 0]
nouns_suru = [morpheme['surface'] for morpheme in morphemes if morpheme['pos1'] == 'noun-Change connection']
#Check the result
print(verbs_surface[::100])
print(verbs_base[::100])
print(nouns_suru[::100])
Extract a noun phrase in which two nouns are connected by "no".
def ngramed_list(lst: list, n: int = 3) -> list:
"""
Convert list to N-gram.
:param lst N List of gramming targets
:param n N (The default is N= 3)
:return N Grammed list
"""
index = ngram.NGram(N=n)
return [term for term in index.ngrams(lst)]
def is_noun_no_noun(words: list) -> bool:
"""
A list of three words is a "noun"-of-Determine if it is composed of "nouns".
:param words A list of 3 words
:return bool (True:"noun-of-It is composed of "nouns"/ False:"noun-of-It is not composed of "nouns")
"""
return (type(words) == list) and (len(words) == 3) and \
(words[0]['pos1'].find('noun') == 0) and \
(words[1]['surface'] == 'of') and \
(words[2]['pos1'].find('noun') == 0)
#"noun-of-名詞」を含むNグラムofみを抽出
noun_no_noun = [ngrams for ngrams in ngramed_list(morphemes) if is_noun_no_noun(ngrams)]
#Take out the surface layer and join
noun_no_noun = [''.join([word['surface'] for word in ngram]) for ngram in noun_no_noun]
#Check the result
print(noun_no_noun[::100])
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
def morphemes_to_noun_array(morphemes: list) -> list:
"""
Group and list the list of morphemes represented by dictionary type by separating them with morphemes other than kuten or nouns..
:param morphemes List of morphemes represented by dictionary type
:return List of noun concatenations
"""
nouns_list = []
nouns = []
for morpheme in morphemes:
if morpheme['pos1'].find('noun') >= 0:
nouns.append(morpheme)
elif (morpheme['pos1'] == 'symbol-Kuten') | (morpheme['pos1'].find('noun') < 0):
nouns_list.append(nouns)
nouns = []
return [nouns for nouns in nouns_list if len(nouns) > 1]
noun_array = [''.join([noun['surface'] for noun in nouns]) for nouns in morphemes_to_noun_array(morphemes)]
#Check the result
print(noun_array[::100])
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
def get_frequency(words: list) -> dict:
"""
Takes a list of words and returns a dictionary with words as keys and frequency as value.
:param words list of words
:return dict A dictionary with word as key and frequency as value
"""
frequency = {}
for word in words:
if frequency.get(word):
frequency[word] += 1
else:
frequency[word] = 1
return frequency
frequency = get_frequency([morpheme['surface'] for morpheme in morphemes])
#sort
frequency = [(k, v) for k, v in sorted(frequency.items(), key=lambda x: x[1], reverse=True)]
#Check the result
print(frequency[0:20])
Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph). Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph). Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
The graph system will be put out together.
fig = plt.figure(figsize=(20, 6))
# 37.Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).
words = [f[0] for f in frequency[0:10]]
x_pos = np.arange(len(words))
fp = FontProperties(fname=r'/Library/Fonts/Hiragino Maru Go ProN W4.ttc', size=14)
ax1 = fig.add_subplot(131)
ax1.bar(x_pos, [f[1] for f in frequency[0:10]], align='center', alpha=0.4)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(words, fontproperties=fp)
ax1.set_ylabel('Frequency')
ax1.set_title('Top 10 frequent words')
# 38.Draw a histogram of the frequency of occurrence of words (the horizontal axis is the frequency of occurrence, and the vertical axis is the number of types of words that have the frequency of appearance as a bar graph).
freq = list(dict(frequency).values())
freq.sort(reverse=True)
ax2 = fig.add_subplot(132)
ax2.hist(freq, bins=50, range=(0, 50))
ax2.set_title('Histogram of word count')
ax2.set_xlabel('Word count')
ax2.set_ylabel('Frequency')
# 39.Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
rank = list(range(1, len(freq) + 1))
ax3 = fig.add_subplot(133)
ax3.plot(freq, rank)
ax3.set_xlabel('Rank')
ax3.set_ylabel('Frequency')
ax3.set_title('Zipf low')
ax3.set_xscale('log')
ax3.set_yscale('log')
fig.savefig('morphological_analysis.png')
Recommended Posts