Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the record of 35th "Noun Concatenation" of .ac.jp/nlp100/#ch4). Like last time, it's something that pandas can't handle easily. However, the "noun concatenation" part is a not difficult process of about 10 lines.

Reference link

Link	Remarks
035.Noun articulation.ipynb	Answer program GitHub link
100 amateur language processing knocks:35	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

35. Noun concatenation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

Problem supplement (about "")

Answer

Answer Program [035. Noun Concatenation.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90 / 035.% E5% 90% 8D% E8% A9% 9E% E3% 81% AE% E9% 80% A3% E6% 8E% A5.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[df['pos'] == 'noun']

df = read_text()

nouns = []
for index in df['surface'].index:
    nouns.append(df['surface'][index])
    
    #End of noun concatenation if there is no index one ahead
    if (index + 1) not in df.index:
        
        #Multiple(Articulation)in the case of
        if len(nouns) > 1:
            print(len(nouns), '\t', index, '\t', ''.join(nouns))
        nouns = []
    
    #Limited because there are many
    if index > 2000:
        break

Answer commentary

Read file

This time, only nouns are needed, so the entries are narrowed down to only nouns immediately after reading the file.

`python`


def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[df['pos'] == 'noun']

Noun articulation extraction

The rest is processing in a loop. The concatenation end is judged by whether the index of the next entry is +1. Noun concatenation is output only when the length of the concatenated list is 1 or more at the end of the concatenation.

`python`


nouns = []
for index in df['surface'].index:
    nouns.append(df['surface'][index])
    
    #End of noun concatenation if there is no index one ahead
    if (index + 1) not in df.index:
        
        #Multiple(Articulation)in the case of
        if len(nouns) > 1:
            print(len(nouns), '\t', index, '\t', ''.join(nouns))
        nouns = []

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


2 28
2 66 Human
2 69 The worst
2 172
2 190 Hair
2 209 Then the cat
2 222 once
2 688 House
2 860 Other than students
3 1001
2 1028 The other day
2 1031 Mima
2 1106 Midaidokoro
2 1150 as it is
2 1235 All-day study
2 1255 studyer
2 1266 Studyer
2 1288 Diligent
3 1392 Page 23
2 1515 Other than the master
2 1581 As long as I am
2 1599 Morning master
2 1690 Most mindful
2 1733 One floor
2 1781 Last hard
3 1829 Neurogastric weakness
2 1913 Language break
2 1961
3 1965 Total family

[PYTHON] 100 Language Processing Knock-35 (using pandas): Noun concatenation

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

35. Noun concatenation

Problem supplement (about "")

Answer

Answer Program [035. Noun Concatenation.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90 / 035.% E5% 90% 8D% E8% A9% 9E% E3% 81% AE% E9% 80% A3% E6% 8E% A5.ipynb)

Answer commentary

Read file

python

Noun articulation extraction

python

Output result (execution result)

Output result

`python`

`python`

`Output result`