[Chapter 4] Introduction to Python with 100 knocks of language processing

This article is a sequel to my book Introduction to Python with 100 Knock. I will explain 100 knocks Chapter 4.

First, install the morphological analyzer MeCab, download neko.txt, analyze the morpheme, and check the contents.

$ mecab < neko.txt > neko.txt.mecab

"I am a cat" from Aozora Bunko.

MeCab's default dictionary system is similar to school grammar, but the adjective verbs are nouns + auxiliary verbs, and the sa-variable verbs are nouns + verbs. The output format is probably as follows:

Surface type\t part of speech,Part of speech subclassification 1,Part of speech subclassification 2,Part of speech subclassification 3,Utilization type,Inflected form,Prototype,reading,Pronunciation form

Also, the sentences are separated by ʻEOS`. By the way, many morphological analyzers assume full-width characters, so it is better to replace half-width characters when analyzing Web texts.

itertools It's a big deal, so let's get acquainted with the itertools module. There are methods to create a convenient iterator.

islice() It also appeared in Chapter 2. You can slice the iterator with ʻislice (iterable, start, stop, step). If step is omitted, it will be 1, and if startis omitted, it will bestart = 0. stop = None` means until the end. It's very convenient, so let's use it.

groupby() You can do something like the Unix command ʻuniq`.

from itertools import groupby
a = [1,1,1,0,0,1]
for k, g in groupby(a):
    print(k, list(g))
1 [1, 1, 1]
0 [0, 0]
1 [1]

If you pass ʻiterableas the first argument like this, it will create a pair of the value of that element and the iterator that returns that group. As withsort, you can also specify key as the second argument. I often use ʻoperator.itemgetter (what you know if you read Sort HOW TO in Chapter 2) Probably). We also use lambda expressions that return Boolean values. ..

from operator import itemgetter
a = [(3, 0), (4, 0), (2, 1)]
for k, g in groupby(a, key=itemgetter(1)):
    print(k, list(g))
0 [(3, 0), (4, 0)]
1 [(2, 1)]

chain.from_iterable() You can flatten a quadratic array in one dimension.

from itertools import chain
a = [[1,2], [3,4], [5,6]
print(list(chain.from_iterable(a))
[1, 2, 3, 4, 5, 6]

zip_longest() The built-in function zip () fits short iterables, but use it if you want it to fit long iterables. If you use it normally, it will be filled with None, but you can specify the value to be used for filling with the second argument fillvalue.

product() Calculate the direct product. Besides, permutations () and combinations () are hard to implement by yourself, so I think it's worth knowing.

These are just a few of ʻitertools`, so if you're interested, read the Documentation (https://docs.python.org/ja/3/library/itertools.html).

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

If you're an object-oriented language enthusiast, you'll want to use classes, but that's up to the next chapter. You can also use Pandas as in Chapter 2, but I'll stop it because it seems to deviate from the intention of the problem statement.

Below is an example of the answer.

q30.py


import argparse
from itertools import groupby, islice
from pprint import pprint
import sys


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('num', type=int)
    args = parser.parse_args()
    for sent_lis in islice(read_mecab(sys.stdin), args.num-1, args.num):
        pprint(sent_lis)

            
def read_mecab(fi):
    for is_eos, sentence in groupby(fi, lambda line: line.startswith('EOS')):
        if not is_eos:
            yield list(map(line2dic, sentence))


def line2dic(line):
    surface, info = line.rstrip().split('\t')
    col = info.split(',')
    dic = {'surface': surface,
          'pos': col[0],
          'pos1': col[1],
          'base': col[6]}
    return dic


if __name__ == '__main__':
    main()

$ python q30.py 2 < neko.txt.mecab

[{'base':'\ u3000','pos':'symbol','pos1':'blank','surface':'\ u3000'}, {'base':'I'm','pos':'noun','pos1':'pronoun','surface':'I'}, {'base':'is','pos':'particle','pos1':'particle','surface':'is'}, {'base':'cat','pos':'noun','pos1':'general','surface':'cat'}, {'base':'da','pos':'auxiliary verb','pos1':'','surface':'in'}, {'base':'Are','pos':'Auxiliary verb','pos1':'','surface':'Are'}, {'base':'. ',' pos':' sign',' pos1':' punctuation',' surface':'. '}]

main () narrows down the output. If you use pprint.pprint () instead of print (), it will output (pretty print) with line breaks adjusted.

This type of format can be written smartly by passing a function that returns whether a line is ʻEOS to the keyofgroupby (). You did yield` in Chapter 2.

The problem is list (map ()). map (func, iterable) applies the function func to each element of ʻiterableand returns an iterator. The result is the same as[line2sic (x) for x in sentence], but it seems that Python is slow to call your own function in the for sentence, so I adopted this notation ([[line2sic (x) for x in sentence]. Reference](https://qiita.com/hi-asano/items/aa2976466739f280b887)).

31. Verb

Extract all the surface forms of the verb.

Below is an example of the answer.

q31.py


from itertools import islice
import sys

from q30 import read_mecab


def main():
    for sent_lis in islice(read_mecab(sys.stdin), 5):
        for word in filter(lambda x: x['pos'] == 'verb', sent_lis):
            print(word['surface'])

                
if __name__ == '__main__':
    main()

$ python q31.py < neko.txt.mecab

Born Tsuka Shi Crying Shi Is

I try to fetch the first N sentences. ʻArgpase` is also omitted because it has a troublesome smell.

I didn't have anything else to say, so I forcibly used filter (). This is equivalent to (x for x in iterable if condition (x)) and returns only the elements that match the condition. To be honest, it's enough to use the ʻif statement, so there aren't many turns (in this case, I think it's slow to use filter () `).

32. The original form of the verb

Extract all the original forms of the verb.

Omitted because it is almost the same as 31.

33. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

There is nothing to say because it can be done by pushing. Below is an example of the answer.

q33.py


from itertools import islice
import sys


from q30 import read_mecab


def main():
    for sent_lis in islice(read_mecab(sys.stdin), 20):
        for i in range(len(sent_lis) - 2):
            if (sent_lis[i+1]['base'] == 'of' and sent_lis[i]['pos'] == 'noun'
                and sent_lis[i+2]['pos'] == 'noun'):
                print(''.join(x['surface'] for x in sent_lis[i: i+3]))

                
if __name__ == '__main__':
    main()

$ python q33.py < neko.txt.mecab

His palm On the palm Student's face Should face In the middle of the face In the hole Calligraphy palm The back of the palm

I mentioned earlier that line continuation is \ in Python, but you can freely break lines inside parentheses. The conditional expression of the if statement is uselessly enclosed in () just because I want to say that.

34. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

Just do group by () with the part of speech. Below is an example of the answer.

q34.py


import sys
from itertools import groupby, islice

from q30 import read_mecab


def main():
    for sent_lis in islice(read_mecab(sys.stdin), 20):
        for key, group in groupby(sent_lis, lambda word: word['pos']):
            if key == 'noun':
                words = [word['surface'] for word in group]
                if len(words) > 1:
                    print(''.join(words))

                
if __name__ == '__main__':
    main()

$ python q34.py < neko.txt.mecab

In humans The worst Timely One hair Then the cat one time Puupuu and smoke

35. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

Just use collections.Counter. Below is an example of the answer.

q35.py


import sys
from collections import Counter
from pprint import pprint

from q30 import read_mecab


def get_freq():
    word_freq = Counter(word['surface'] for sent_lis in read_mecab(sys.stdin) 
                            for word in sent_lis)
    return word_freq.most_common(10)


if __name__ == '__main__':
    pprint(get_freq())

$ python q35.py < neko.txt.mecab

[('No', 9194), ('。', 7486), ('Te', 6868), ('、', 6772), ('Ha', 6420), ('To', 6243), ('To', 6071), ('And', 5508), ('Ga', 5337), ('Ta', 3988)]

matplotlib Now, I will draw a graph for the next problem, so it's finally time for this guy to come into play. Let's do pip install. I don't really want to explain external modules in the article "Introduction to Python", and if I explain this seriously, I can write one book. First, read this Qiita article to understand the hierarchical structure of matplotlib. It's very messy, isn't it? It is Pyplot that sets that side appropriately. This time, I will not make fine adjustments to the appearance, so I will do it this way. Here's a simple example.

import matplotlib.pyplot as plt
#Specify graph type and data
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
#Somehow set
plt.title('example')
plt.ylabel('some numbers')
#drawing
plt.show()

output_28_0.png

The import statement means "import the submodule pyplot of the matplotlib module with the name plt ".

First, decide on the type of graph. For line graphs, plt.plot (), for horizontal bar graphs, plt.barh (), for column graphs, plt.bar (), etc. It's actually converted to an array, so it might be better to pass a numpy array).

Next, set the appearance of the kettle. plt.yticks () allows you to set the y coordinate and the characters to accompany it. You can use plt.xlim () to set the maximum and minimum values for the x-axis. plt.yscale ("log ") makes the y-axis a logarithmic scale.

Finally, drawing. I'm using Jupyter, so I'm using plt.show (). Those who are running scripts should use plt.savefig (filename) to output the file.

To be honest, this method is MATLAB-like and not Pythonic, but it is easy.

To display Japanese with matplotlib

The default font does not support Japanese, so it will be tofu. Therefore, in order to display Japanese on the graph, you can set a font that supports Japanese, but depending on your environment, you may not find the Japanese font, or you may need to install it because it is not included in the first place. It's hard. japanize-matplotlib will make your life easier.

36. Top 10 most frequent words

Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph).

from collections import Counter

from q30 import read_mecab
import matplotlib.pyplot as plt
import japanize_matplotlib

word_freq = Counter(word['base'] for sent_lis in read_mecab(open('neko.txt.mecab')) 
                            for word in sent_lis)
word, count  = zip(*word_freq.most_common(10))
len_word = range(len(word))

plt.barh(len_word, count, align='center')
plt.yticks(len_word, word)
plt.xlabel('frequency')
plt.ylabel('word')
plt.title('36.Top 10 most frequent words')
plt.show()

output_31_0.png

What is the * in the zip ()? But what I want to do here is transpose. That is, transforming data such as [[a, b], [c, d]] into [[a, c], [b, d]]. Therefore, the easiest way to transpose is to write zip (* seq). This is equivalent to zip (seq [0], seq [1], ...) (Unpack Argument List # unpacking-argument-lists)). zip ([a, b], [c, d]) is [(a, c), (b, d)], isn't it? You can also use unpacked assignment to assign to different variables at once.

(This is the end of the explanation of the alternative solution of the ngram function in Chapter 1)

Keep in mind that the top words in word frequency are function words (particles, punctuation).

37. Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cat" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).

word_freq = Counter()
for sent_lis in read_mecab(open('neko.txt.mecab')):
    for word in sent_lis:
        if word['surface'] == 'Cat':
            word_freq.update(x['base'] for x in sent_lis if x['surface'] != 'Cat')
            break
            
words, count  = zip(*word_freq.most_common(10))
len_word = range(len(words))

plt.barh(len_word, count, align='center')
plt.yticks(len_word, words)
plt.xlabel('frequency')
plt.ylabel('word')
plt.title('37.Top 10 words that frequently co-occur with "cat"')
plt.show()

output_34_0.png

You can update the Counter object withCounter.update (iterable). However, if you do not focus on independent words, the result will be completely uninteresting.

38. Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

word_freq = Counter(word['base'] for sent_lis in read_mecab(open('neko.txt.mecab')) 
                        for word in sent_lis)
data = Counter(count for count in word_freq.values())

x, y = data.keys(), data.values()

plt.bar(x, y)
plt.title("38.histogram")
plt.xlabel("frequency")
plt.ylabel("number of the words")
plt.xlim(1, 30)
plt.show()

output_34_0.png

You can get all keys with dict.keys () and all values with dict.values ().

Looking at the number of word types, we can see that many words are infrequent. It looks like it is inversely proportional. In deep learning, low-frequency word processing is also important for this reason.

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

word_freq = Counter(word['base'] for sent_lis in read_mecab(open('neko.txt.mecab')) 
                        for word in sent_lis)
_, count = zip(*word_freq.most_common())
plt.plot(range(1, len(count)+1), count)
plt.yscale("log")
plt.xscale("log")
plt.title("39.Zipf's law")
plt.xlabel("log(rank)")
plt.ylabel("log(frequency)")
plt.show()

output_37_0.png

The slope of the log-log graph is about -1. This means freq ∝ rank ^ (-1). It seems to be related to the 38th result. Let's google for more details on Zipf's law.

Summary

Next is Chapter 5

I will finally use the class. Is this the next and final introduction to Python?

(5/16) I wrote → https://qiita.com/hi-asano/items/5e18e3a5a711a752ad99

Recommended Posts

[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 language processing knocks ~ Chapter 1
Introduction to Python language
100 language processing knocks Chapter 2 (10 ~ 19)
Getting started with Python with 100 knocks on language processing
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
[Language processing 100 knocks 2020] Summary of answer examples by Python
An introduction to Python distributed parallel processing with Ray
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
[Introduction to Python3 Day 13] Chapter 7 Strings (7.1-7.1.1.1)
[Introduction to Python3 Day 14] Chapter 7 Strings (7.1.1.1 to 7.1.1.4)
Introduction to Protobuf-c (C language ⇔ Python)
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
Image processing with Python 100 knocks # 3 Binarization
[Language processing 100 knocks 2020] Chapter 7: Word vector
[Introduction to Python3 Day 15] Chapter 7 Strings (7.1.2-7.1.2.2)
10 functions of "language with battery" python
100 language processing knocks 2020: Chapter 3 (regular expression)
[Language processing 100 knocks 2020] Chapter 8: Neural network
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
[Language processing 100 knocks 2020] Chapter 9: RNN, CNN
Image processing with Python 100 knocks # 2 Grayscale
100 Language Processing Knock Chapter 1 by Python
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
[Introduction to Python3 Day 21] Chapter 10 System (10.1 to 10.5)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Language processing 100 knocks-48: Extraction of paths from nouns to roots
[Raspi4; Introduction to Sound] Stable recording of sound input with python ♪
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
Basics of binarized image processing with Python
100 language processing knocks (2020): 32
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Introduction of Python
IPynb scoring system made with TA of Introduction to Programming (Python)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
[Introduction to Python3, Day 17] Chapter 8 Data Destinations (8.1-8.2.5)
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Language processing 100 knocks-22: Extraction of category names
100 language processing knocks (2020): 26