[PYTHON] 100 language processing knock-30 (using pandas): reading morphological analysis results

Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 30th "Reading morphological analysis result" of .ac.jp/nlp100/#ch4). "Morphological analysis" Entering the chapter, it has become more like full-scale language processing. Morphological analysis is a method that divides sentences such as "waiting" into "waiting", "shi", "te", "ori", and "masu", and adds information such as part of speech to each. For more information "Wikipedia" Etc.

Reference link

Link Remarks
030.Reading morphological analysis results.ipynb Answer program GitHub link
100 amateur language processing knocks:30 Copy and paste source of many source parts
MeCab Official The first MeCab page to look at

environment

I'm using Python 3.8.1 from this time (3.6.9 until the last time). In Chapter 3, "Regular Expressions", collections.OrderdDict was used to support ordered dictionary types, but since Python 3.7.1, even standard dictionary types are guaranteed to be ordered. .org / ja / 3 / whatsnew / 3.7.html). There was no particular reason to stick to 3.6.9, so I renewed the environment. I forgot how to install MeCab. I installed it a year ago, but I don't remember stumbling.

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

Problem supplement (About "MeCab")

It is "MeCab" which is a standard of morphological analysis. For comparison with other morphological analyzers, I referred to the article "Comparison of morphological analyzers at the end of 2019" (comparison result "MeCab" I thought I'd do it). If you use MeCab, it will judge the information in the following format for the divided words. Note that the delimiter is a tab (\ t) and a comma (why?).

Surface form \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation type, Conjugation form, Original form, Reading, Pronunciation

For example, in the case of "Sumomomo Momomo", the output result is as follows.

No Surface type Part of speech Part of speech細分類1 Part of speech細分類2 Part of speech細分類3 Utilization type Inflected form Prototype reading pronunciation
1 Plum noun General * * * * Plum Plum Plum
2 Also Particle 係Particle * * * * Also Mo Mo
3 Peaches noun General * * * * Peaches peach peach
4 Also Particle 係Particle * * * * Also Mo Mo
5 Peaches noun General * * * * Peaches peach peach
6 of Particle Attributive * * * * of No No
7 home noun Non-independent Adverbs possible * * * home Uchi Uchi
8 EOS

Answer

Answer program (Run MeCab) [Chapter 4_ Morphological analysis.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7% B4% A0% E8% A7% A3% E6% 9E% 90 /% E7% AC% AC4% E7% AB% A0_% 20% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8 % A7% A3% E6% 9E% 90.ipynb)

It is the following MeCab execution part that is the premise of Chapter 4.

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab

It's a simple process that ends with a single command.

mecab neko.txt -o neko.txt.mecab

For the time being, I created a program using mecab-python3 (ver0.996.3) in Python as shown below, but the result is slightly different from when the command is executed. ** The sentence was not separated by EOS (End Of Statement) ** was fatal to the subsequent knock. The method of specifying options may be bad, but I don't want to dig deeper, so I haven't used the execution result of the Python program later.

import MeCab
mecab = MeCab.Tagger()

with open('./neko.txt') as in_file, \
    open('./neko.txt.mecab', mode='w') as out_file:   
    out_file.write(mecab.parse(in_file.read()))

Answer program (list creation) [030. Reading morphological analysis results.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7 % B4% A0% E8% A7% A3% E6% 9E% 90 / 030.% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90% E7% B5% 90% E6% 9E% 9C% E3% 81% AE% E8% AA% AD% E3% 81% BF% E8% BE% BC% E3% 81% BF.ipynb)

from pprint import pprint

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    #The blank is actually pos1, but it is out of alignment.
    return df[df['pos'] != 'Blank']

df = read_text()
print(df.info())

target = []
morphemes = []

for i, row in df.iterrows():
    if row['surface'] == 'EOS' \
     and len(target) != 0:
        morphemes.append(df.loc[target].to_dict(orient='records'))
        target = []
    else:
        target.append(i)

print(len(morphemes))
pprint(morphemes[:5])

Answer commentary

File reading section

The file created by MeCab is read by read_table. It's a little annoying that the delimiters are tab (\ t) and comma (, ). This is achieved by using a regular expression (OR with |) with the parameter sep and changing ʻengineto'python'. I setskiprows and skipfooter` because they were annoying to see the contents of the file.

python


def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')    
    return df

The data frame is the following information. image.png

Read_table problem with whitespace

It's hard to understand, but if the following (half-width space) comes to the beginning of the line, the column will shift when reading with the `read_table` function. Ignoring and \ t (tab), the first column is recognized as a" symbol ". I did some trial and error, such as setting the parameter skipinitialspace, but I couldn't solve it. I think it's probably a pandas bug. This time, I didn't have to be particular about it, so I'm excluding "blank" lines.

symbol,Blank,*,*,*,*, , , 

DataFrame information

The information as a DataFrame of the read file is output as follows with df.info ().

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212143 entries, 0 to 212552
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   surface  212143 non-null  object
 1   pos      202182 non-null  object
 2   pos1     202182 non-null  object
 3   base     202182 non-null  object
dtypes: object(4)
memory usage: 8.1+ MB
None

Dictionary type list output

Store each morpheme in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and express one sentence as a list of morphemes (mapping type).

Make a list of mapping type (dictionary type). However, I haven't used it in subsequent knocks ** and it's completely for Python practice (I don't need this tedious process with pandas in subsequent knocks). When EOS (End Of Statement) is issued, it is the end of one sentence, so the morphemes up to that point are output with the to_dict function.

python


target = []
morphemes = []

for i, row in df.iterrows():
    if row['surface'] == 'EOS' \
     and len(target) != 0:
        morphemes.append(df.loc[target].to_dict(orient='records'))
        target = []
    else:
        target.append(i)

Output result (execution result)

When the program is executed, the following result is output (only the first 5 sentences). By the way, the reason why "I am a cat" on the first line is a noun is because it is a proper noun of the book title. It is correct that the sentence in the book is decomposed, but it is not done so far.

Output result


[[{'base': 'I am a cat', 'pos': 'noun', 'pos1': '固有noun', 'surface': 'I am a cat'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': 'name', 'pos': 'noun', 'pos1': 'General', 'surface': 'name'},
  {'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
  {'base': 'yet', 'pos': 'adverb', 'pos1': 'Particle connection', 'surface': 'yet'},
  {'base': 'No', 'pos': 'adjective', 'pos1': 'Independence', 'surface': 'No'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': None, 'pos': None, 'pos1': None, 'surface': 'EOS'},
  {'base': 'Where', 'pos': 'noun', 'pos1': '代noun', 'surface': 'Where'},
  {'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
  {'base': 'Born', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Born'},
  {'base': 'Ta', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Ta'},
  {'base': 'Fire', 'pos': 'noun', 'pos1': 'General', 'surface': 'Katon'},
  {'base': 'When', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'When'},
  {'base': 'Register', 'pos': 'noun', 'pos1': 'Change connection', 'surface': 'Register'},
  {'base': 'But', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'But'},
  {'base': 'Tsukuri', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Tsuka'},
  {'base': 'Nu', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Nu'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': 'what', 'pos': 'noun', 'pos1': '代noun', 'surface': 'what'},
  {'base': 'But', 'pos': 'Particle', 'pos1': '副Particle', 'surface': 'But'},
  {'base': 'dim', 'pos': 'adjective', 'pos1': 'Independence', 'surface': 'dim'},
  {'base': 'Damp', 'pos': 'adverb', 'pos1': 'General', 'surface': 'Damp'},
  {'base': 'did', 'pos': 'noun', 'pos1': 'General', 'surface': 'did'},
  {'base': 'Place', 'pos': 'noun', 'pos1': 'suffix', 'surface': 'Place'},
  {'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
  {'base': 'Meow meow', 'pos': 'adverb', 'pos1': 'General', 'surface': 'Meow meow'},
  {'base': 'cry', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Crying'},
  {'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
  {'base': 'What was there', 'pos': 'noun', 'pos1': 'General', 'surface': 'What was there'},
  {'base': 'Only', 'pos': 'Particle', 'pos1': '副Particle', 'surface': 'Only'},
  {'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
  {'base': 'Memory', 'pos': 'noun', 'pos1': 'Change connection', 'surface': 'Memory'},
  {'base': 'To do', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Shi'},
  {'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
  {'base': 'Is', 'pos': 'verb', 'pos1': 'Non-independent', 'surface': 'Is'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': 'I', 'pos': 'noun', 'pos1': '代noun', 'surface': 'I'},
  {'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
  {'base': 'here', 'pos': 'noun', 'pos1': '代noun', 'surface': 'here'},
  {'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
  {'base': 'start', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'start'},
  {'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
  {'base': 'Human', 'pos': 'noun', 'pos1': 'General', 'surface': 'Human'},
  {'base': 'That', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'That'},
  {'base': 'thing', 'pos': 'noun', 'pos1': 'Non-independent', 'surface': 'thing'},
  {'base': 'To', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'To'},
  {'base': 'to see', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'You see'},
  {'base': 'Ta', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Ta'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}]]

Recommended Posts

100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
Language processing 100 knocks-40: Reading dependency analysis results (morpheme)
100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-57: Dependency Analysis
Natural language processing 1 Morphological analysis
100 language processing knock-56: co-reference analysis
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-59: Analysis of S-expressions
100 language processing knock-73 (using scikit-learn): learning
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock (2020): 28
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knocks Morphological analysis learned in Chapter 4
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-92 (using Gensim): application to analogy data
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
100 Language Processing with Python Knock 2015