[Python] Challenge 100 knocks! (030-034)

About the history so far

Please refer to First Post

Knock status

9/24 added

Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions. For problems 37, 38, and 39, use matplotlib or Gnuplot.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme has the key of surface form (surface), uninflected word (base), part of speech (pos), and part of speech subclassification 1 (pos1). Store it in a mapping type and express one sentence as a list of morphemes (mapping type). For the rest of the problems in Chapter 4, use the program created here.

Preparation: Creating neko.txt.mecab

file_analyze_mecab_030.py


from natto import MeCab
import codecs

def file_analyze_mecab(input_filename,output_filename):

    with codecs.open(input_filename,'r','utf-8') as f:
        text = f.read()

    m = MeCab.Tagger("mecabrc")
    wt = m.parse(text)

    with codecs.open(output_filename,'w','utf-8') as wf:
        wf.write(wt)

if __name__=="__main__":
    file_analyze_mecab('neko.txt','neko.txt.mecab')

result


One noun,number,*,*,*,*,one,Ichi,Ichi
symbol,Blank,*,*,*,*, , , 
I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
Is a particle,Particle,*,*,*,*,Is,C,Wow
(Omitted because it is long)

Impression: I heard about morphological analysis for the first time, so I investigated from there. For the parameters of mecab, I referred to MeCab: Yet Another Part-of-Speech and Morphological Analyzer. The naming of modules etc. is wonderful.

030. Read

mecab_030.py


#-*-coding:utf-8-*-

import codecs

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab",'r','utf-8') as f:
        data = f.readlines()

    mecab_list=[]
    temp_dict ={}
    for temp_word in data:
        temp_word = temp_word.replace('\t', ',')
        temp_word = temp_word.replace('\n', '')
        if(temp_word.count(',')==9 or temp_word.count(',')==7):
            temp_list = temp_word.split(',')
            temp_dict={'surface':temp_list[0],'base':temp_list[7],'pos':temp_list[1],'pos1':temp_list[2]}
            mecab_list.append(temp_dict)
        else:
            continue
    print(mecab_list)

    with codecs.open('neko.txt.mecab.analyze','w','utf-8') as wf:
        for line in mecab_list:
            wf.write(str(line)+'\n')

result


{'surface': 'object', 'base': 'object', 'pos': 'noun', 'pos1': 'General'}
(Omitted because it is long)

Impressions: The output file has been replaced with a simple character string instead of a list type for easy viewing. I was very addicted to noticing that the number of',' in the morphological analysis result column was 7 or 9. .. ..

031. Verb

Extract all the surface forms of the verb.

vurb_031.py


#-*-coding:utf-8-*-

import codecs
import re
import ast

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab.analyze",'r','utf-8') as f:
        temp_lines = f.readlines()

    pattern = re.compile(r".*verb.*")
    data = {}
    for temp_line in temp_lines:
        if pattern.match(temp_line):
            data = ast.literal_eval(temp_line)
            print(data['surface'])
        else:
            continue

result


so
is there
Born
Ta
Tsuka
(Omitted because it is long)

Impression: The analysis result file is read, the column with the keyword verb is extracted by regular expression, converted to dictionary type, and only the surface is output.

032. The original form of the verb

Extract all the original forms of the verb.

base_vurb_032.py


# -*-coding:utf-8-*-

import codecs
import re
import ast

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab.analyze", 'r', 'utf-8') as f:
        temp_lines = f.readlines()

    pattern = re.compile(r".*verb.*")
    data = {}
    for temp_line in temp_lines:
        if pattern.match(temp_line):
            data = ast.literal_eval(temp_line)
            print(data['base'])
        else:
            continue

result


Is
is there
Born
Ta
Tsukuri

Impressions: The procedure is the same as 031. The output is changed to base.

033. Sahen noun

Extract all the nouns of the s-irregular connection.

sahen_noun_033.py


# -*-coding:utf-8-*-

import codecs
import re
import ast

if __name__ == "__main__":
    with codecs.open("neko.txt.mecab.analyze", 'r', 'utf-8') as f:
        temp_lines = f.readlines()

    pattern = re.compile(r".*Change connection.*")
    data = {}
    for temp_line in temp_lines:
        if pattern.match(temp_line):
            data = ast.literal_eval(temp_line)
            print(data['surface'])
        else:
            continue

result


Register
Memory
Talk
Decoration
Protrusion
(Omitted because it is long)

Impressions: The procedure is the same as 032. I just changed the extraction condition to a change connection.

034. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

no_noun_034.py


#-*-coding:utf-8-*-

import codecs
import ast

if __name__ == "__main__":

    with codecs.open('neko.txt.mecab.analyze','r','utf-8') as f:
        temp_lines = f.readlines()

    flag = 0
    temp_list = []
    for temp_line in temp_lines:
        temp_dict = ast.literal_eval(temp_line)
        if (temp_dict['pos'] == 'noun' and flag == 0):
            temp_word = temp_dict['surface']
            flag = 1
            continue

        elif(temp_dict['surface']=='of' and temp_dict['pos']=='Particle' and flag == 1):
            temp_word += temp_dict['surface']
            flag = 2
            continue

        elif(temp_dict['pos']=='noun' and flag == 2):
            temp_word += temp_dict['surface']
            temp_list.append(temp_word)
            temp_word = ''
            flag = 0
            continue

        else:
            temp_word=''
            flag =0
            continue

    no_noun_list = set(temp_list)


    for temp in no_noun_list:
        print(temp)

result


Child of
My year
Boredom too much
Left corner
Opponent's ability
On the forehead
For those
(Omitted because it is long)

Impression: First, ngram analysis of pos information with N = 3 is performed to list the index numbers in which noun particles are arranged, and then the surface information of the matching index number is extracted and the beginning of the extracted character string. I extracted the ones that did not have a'no' at the end of the word, but did not make the subject look correct. And corrected to the current code. You have to confirm the subject properly. Reflection.

Recommended Posts

[Python] Challenge 100 knocks! (015 ~ 019)
[Python] Challenge 100 knocks! (030-034)
[Python] Challenge 100 knocks! (006-009)
[Python] Challenge 100 knocks! (000-005)
[Python] Challenge 100 knocks! (010-014)
[Python] Challenge 100 knocks! (025-029)
[Python] Challenge 100 knocks! (020-024)
python challenge diary ①
Challenge 100 data science knocks
Python
Sparta Camp Python 2019 Day2 Challenge
100 Pandas knocks for Python beginners
Challenge Python3 and Selenium Webdriver
Challenge LOTO 6 with Python without discipline
Image processing with Python 100 knocks # 3 Binarization
# 2 Python beginners challenge AtCoder! ABC085C --Otoshidama
Image processing with Python 100 knocks # 2 Grayscale
kafka python
Python basics ⑤
python + lottery 6
Python Summary
Built-in python
Python comprehension
Python technique
Python 2.7 Countdown
Python memorandum
Python FlowFishMaster
Python service
python tips
python function ①
Python basics
Python memo
ufo-> python (3)
Python comprehension
install python
Python Singleton
Python basics ④
Python Memorandum 2
python memo
Python Jinja2
Image processing with Python 100 knocks # 8 Max pooling
Python increment
atCoder 173 Python
[Python] function
Python installation
python tips
Installing Python 3.4.3.
Try python
Python memo
Python iterative
Python algorithm
Python2 + word2vec
[Python] Variables
Python functions
Python sys.intern ()
Python tutorial
Python decimals
python underscore
Python summary
Start python
[Python] Sort