[PYTHON] I tried 100 language processing knock 2020: Chapter 4

Introduction

I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here.

Chapter 4 Morphological analysis

No.30 Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

Answer

030.py


import pandas as pd

with open(file="neko.txt.mecab", mode="rt", encoding="utf-8") as neko:
    nekotext = neko.readlines()

nekolist = []
for str in nekotext:
    list = str.replace("\n", "").replace(" ", "").replace("\t", ",").split(",")
    if list[0] != "EOS": nekolist.append([list[0], list[7], list[1], list[2]])
    else: nekolist.append([list[0], "*", "*", "*"])
pd.set_option('display.unicode.east_asian_width', True)
df_neko = pd.DataFrame(nekolist, columns=["surface", "base", "pos", "pos1"])
print(df_neko)


# ->          surface      base     pos                      pos1
#0 11 Noun number
#1 Symbol blank
#2 I am I am a noun pronoun
#3 is a particle particle
#4 cat cat noun general...
Comments

I've put it together in pandas, but ... it's convenient, it looks like a mapping type, isn't it ?? (No)

No.31 Verb

Extract all the surface forms of the verb.

Answer

031.py


import input_neko as nk

df = nk.input()
print(df.query("pos == 'verb'")["surface"].values.tolist())

# -> ['Born', 'Tsuka', 'Shi', 'Crying',...
Comments

I am using the result of No.30 as ʻimport. The Series type and the list` type can be converted to each other, which is convenient.

No.32 Prototype of verb

Extract all the original forms of the verb.

Answer

032.py


import input_neko as nk

df = nk.input()
print(df.query("pos == 'verb'")["base"].values.tolist())

# -> ['Born', 'Tsukuri', 'To do', 'cry',...
Comments

No.31's surface has just changed to base.

No.33 "A B"

Extract a noun phrase in which two nouns are connected by "no".

Answer

033.py


import input_neko as nk

df = nk.input()
df.reset_index()
list_index = df.query("surface == 'of' & pos == 'Particle'").index
print([f"{df.iloc[item-1,1]}of{df.iloc[item+1,1]}" for item in list_index if df.iloc[item-1, 2] == df.iloc[item+1, 2] == "noun"])

# -> ['His palm', 'On the palm', 'Student's face', 'Should face',...
Comments

I thought I could output it with df.iloc [item-1: item + 1, 1], but it didn't work, so I ended up with a long code.

No.34 Noun concatenation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

Answer

034.py


import input_neko as nk

df = nk.input()
df.reset_index()
num = 0
str = ""
ans = []
for i in range(len(df)):
    if df.iloc[i, 2] == "noun":
        num = num + 1
        str = str + df.iloc[i, 0]
    else:
        if num >= 2:
            ans.append(str)
        num = 0
        str = ""
print(ans)

# -> ['In humans', 'The worst', 'Timely', 'One hair',...
Comments

We add nouns to str and add them to ʻans` when there are two or more nouns next to each other.

No.35 Word appearance frequency

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

Answer

035.py


import input_neko as nk

df = nk.input()
print(df["surface"].value_counts().to_dict())

# -> {'of': 9194, '。': 7486, 'hand': 6868, '、': 6772,...
Comments

It is said that Series is convenient because it can also be converted to dict type.

No.36 Top 10 most frequent words

Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).

Answer

036.py


import input_neko as nk
import japanize_matplotlib
import matplotlib.pyplot as plt


df = nk.input()
df_dict = df["surface"].value_counts()[:10].to_dict()
left = list(df_dict.keys())
height = list(df_dict.values())
fig = plt.figure()
plt.bar(left, height)
plt.show()
fig.savefig("036_graph.png ")

# ->

036_graph.png

Comments

It took a long time to enable the Japanese characters of matplotlib, but I solved it by inserting japanize_matplotlib.

No.37 Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of occurrence in a graph (for example, a bar graph).

Answer

037.py


from collections import defaultdict
import input_neko as nk
import japanize_matplotlib
import matplotlib.pyplot as plt

df = nk.input()
start = 0
neko_phrase = []
freq = defaultdict(int)
for i in range(len(df)):
    if df.iloc[i, 0] == "EOS":
        phrase = df.iloc[start:i, 0].to_list()
        if "Cat" in phrase:
            neko_phrase.append(phrase)
            for word in phrase:
                if word != "Cat":  freq[word] += 1
        start = i + 1
neko_relation = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10]
left = [item[0] for item in neko_relation]
height = [item[1] for item in neko_relation]
fig = plt.figure()
plt.bar(left, height)
fig.savefig("037_graph.png ")

# -> 

037_graph.png

Comments

Words that often appear in sentences that include cats are counted as having a high co-occurrence frequency. A lambda expression is used in the middle to sort by a value value of type dict.

No.38 Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

Answer

038.py


import input_neko as nk
import pandas as pd
import japanize_matplotlib
import matplotlib.pyplot as plt

df = nk.input()
df_dict = df["surface"].value_counts().to_dict()
word = list(df_dict.keys())
count = list(df_dict.values())
d = pd.DataFrame(count, index=word).groupby(0).size()
height = list(d[:10])
fig = plt.figure()
plt.bar(range(1, 11), height)
fig.savefig("038_graph.png ")

# -> 

038_graph.png

Comments

I thought I should use groupby, but I feel that converting dict, DataFrame, Series, list has made it more difficult to understand ...

No.39 Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

Answer

039.py


import input_neko as nk
import pandas as pd
import japanize_matplotlib
import matplotlib.pyplot as plt

df = nk.input()
df_dict = df["surface"].value_counts().to_dict()
word = list(df_dict.keys())
count = list(df_dict.values())
d = pd.DataFrame(count, index=word).groupby(0).size()
height = list(d[:])
fig = plt.figure()
plt.xscale("log")
plt.yscale("log")
plt.plot(range(len(height)), height)
fig.savefig("039_graph.png ")

# -> 

039_graph.png

Comments

[Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3%95%E5%89 I checked Zipf's law with% 87).

Zipf's law (Zipf's law) or Zipf's law is an empirical rule that the proportion of the kth most frequent element in the whole is proportional to 1 / k.

Ideally, if you take a logarithmic graph on both axes, it will be a straight line that descends to the right, but I feel that the output results are generally the same. It's strange that Zipf's law appears in various situations.

Recommended Posts

I tried 100 language processing knock 2020: Chapter 3
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
I tried 100 language processing knock 2020
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
I tried natural language processing with transformers.
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
[I tried] Nand2 Tetris Chapter 6