[PYTHON] 100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)

100 Language Processing Knock 2015 "Chapter 4 Morphological Analysis (30 ~) It is a record of solving "39)".

environment

Preparation

Library to use

import MeCab
import ngram
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

Save the morphologically analyzed text as a file

The text of Natsume Soseki's novel "I am a cat" ( morphologically analyze neko.txt ) using MeCab and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.

Create a function called make_analyzed_file, perform morphological analysis, and save it in a file. It is assumed that neko.txt is downloaded in advance and saved in the same folder as the executable file.

def make_analyzed_file(input_file_name: str, output_file_name: str) -> None:
    """
Morphological analysis of plain Japanese text file and save to file.
    :param input_file_name Plain Japanese sentence file name
    :param output_file_name Morphologically parsed text file name
    """
    _m = MeCab.Tagger("-Ochasen")
    with open(input_file_name, encoding='utf-8') as input_file:
        with open(output_file_name, mode='w', encoding='utf-8') as output_file:
            output_file.write(_m.parse(input_file.read()))

make_analyzed_file('neko.txt', 'neko.txt.mecab')

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

Simply convert the morphological analysis result from the tab-delimited character string to the dictionary type and save it in morphemes, and save the summary for each sentence in sentences.

def tabbed_str_to_dict(tabbed_str: str) -> dict:
    """
For example, "Gradually Shidaini Gradually adverb-Convert a string representing a morpheme with tab delimiters such as "general" to Dict type.
    :param tabbed_str A string representing morphemes separated by tabs
    :Morpheme represented by return Dict type
    """
    elements = tabbed_str.split()
    if 0 < len(elements) < 4:
        return {'surface': elements[0], 'base': '', 'pos': '', 'pos1': ''}
    else:
        return {'surface': elements[0], 'base': elements[1], 'pos': elements[2], 'pos1': elements[3]}


def morphemes_to_sentence(morphemes: list) -> list:
    """
Group and list the list of morphemes represented by Dict type by kuten.
    :param morphemes List of morphemes represented by Dict type
    :return List of sentences
    """
    sentences = []
    sentence = []

    for morpheme in morphemes:
        sentence.append(morpheme)
        if morpheme['pos1'] == 'symbol-Kuten':
            sentences.append(sentence)
            sentence = []

    return sentences


with open('neko.txt.mecab', encoding='utf-8') as file_wrapper:
    morphemes = [tabbed_str_to_dict(line) for line in file_wrapper]

sentences = morphemes_to_sentence(morphemes)

#Check the result
print(morphemes[::100])
print(sentences[::100])

31. Verb / 32. Prototype of verb / 33.

Extract all the surface forms of the verb. Extract all the original forms of the verb. Extract all the nouns of the s-irregular connection.

It is easy if you use the morphemes created in" 30. Reading the morphological analysis results ".

verbs_surface = [morpheme['surface'] for morpheme in morphemes if morpheme['pos1'].find('verb') == 0]
verbs_base = [morpheme['base'] for morpheme in morphemes if morpheme['pos1'].find('verb') == 0]
nouns_suru = [morpheme['surface'] for morpheme in morphemes if morpheme['pos1'] == 'noun-Change connection']

#Check the result
print(verbs_surface[::100])
print(verbs_base[::100])
print(nouns_suru[::100])

34. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

def ngramed_list(lst: list, n: int = 3) -> list:
    """
Convert list to N-gram.
    :param lst N List of gramming targets
    :param n N (The default is N= 3)
    :return N Grammed list
    """
    index = ngram.NGram(N=n)
    return [term for term in index.ngrams(lst)]


def is_noun_no_noun(words: list) -> bool:
    """
A list of three words is a "noun"-of-Determine if it is composed of "nouns".
    :param words A list of 3 words
    :return bool (True:"noun-of-It is composed of "nouns"/ False:"noun-of-It is not composed of "nouns")
    """
    return (type(words) == list) and (len(words) == 3) and \
           (words[0]['pos1'].find('noun') == 0) and \
           (words[1]['surface'] == 'of') and \
           (words[2]['pos1'].find('noun') == 0)


#"noun-of-名詞」を含むNグラムofみを抽出
noun_no_noun = [ngrams for ngrams in ngramed_list(morphemes) if is_noun_no_noun(ngrams)]

#Take out the surface layer and join
noun_no_noun = [''.join([word['surface'] for word in ngram]) for ngram in noun_no_noun]

#Check the result
print(noun_no_noun[::100])

35. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

def morphemes_to_noun_array(morphemes: list) -> list:
    """
Group and list the list of morphemes represented by dictionary type by separating them with morphemes other than kuten or nouns..
    :param morphemes List of morphemes represented by dictionary type
    :return List of noun concatenations
    """
    nouns_list = []
    nouns = []

    for morpheme in morphemes:
        if morpheme['pos1'].find('noun') >= 0:
            nouns.append(morpheme)
        elif (morpheme['pos1'] == 'symbol-Kuten') | (morpheme['pos1'].find('noun') < 0):
            nouns_list.append(nouns)
            nouns = []

    return [nouns for nouns in nouns_list if len(nouns) > 1]


noun_array = [''.join([noun['surface'] for noun in nouns]) for nouns in morphemes_to_noun_array(morphemes)]

#Check the result
print(noun_array[::100])

36. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

def get_frequency(words: list) -> dict:
    """
Takes a list of words and returns a dictionary with words as keys and frequency as value.
    :param words list of words
    :return dict A dictionary with word as key and frequency as value
    """
    frequency = {}
    for word in words:
        if frequency.get(word):
            frequency[word] += 1
        else:
            frequency[word] = 1

    return frequency


frequency = get_frequency([morpheme['surface'] for morpheme in morphemes])

#sort
frequency = [(k, v) for k, v in sorted(frequency.items(), key=lambda x: x[1], reverse=True)]

#Check the result
print(frequency[0:20])

37. Top 10 most frequently used words / 38. Histogram / 39. Zipf's law

Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph). Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph). Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

The graph system will be put out together.

fig = plt.figure(figsize=(20, 6))

# 37.Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).
words = [f[0] for f in frequency[0:10]]
x_pos = np.arange(len(words))
fp = FontProperties(fname=r'/Library/Fonts/Hiragino Maru Go ProN W4.ttc', size=14)

ax1 = fig.add_subplot(131)
ax1.bar(x_pos, [f[1] for f in frequency[0:10]], align='center', alpha=0.4)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(words, fontproperties=fp)
ax1.set_ylabel('Frequency')
ax1.set_title('Top 10 frequent words')

# 38.Draw a histogram of the frequency of occurrence of words (the horizontal axis is the frequency of occurrence, and the vertical axis is the number of types of words that have the frequency of appearance as a bar graph).
freq = list(dict(frequency).values())
freq.sort(reverse=True)

ax2 = fig.add_subplot(132)
ax2.hist(freq, bins=50, range=(0, 50))
ax2.set_title('Histogram of word count')
ax2.set_xlabel('Word count')
ax2.set_ylabel('Frequency')

# 39.Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
rank = list(range(1, len(freq) + 1))

ax3 = fig.add_subplot(133)
ax3.plot(freq, rank)
ax3.set_xlabel('Rank')
ax3.set_ylabel('Frequency')
ax3.set_title('Zipf low')
ax3.set_xscale('log')
ax3.set_yscale('log')

fig.savefig('morphological_analysis.png')

morphological_analysis.png

Recommended Posts

100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 language processing knock-56: co-reference analysis
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock with Python (Chapter 1)
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock (2020): 38
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3