Language processing 100 knocks 2015 ["Chapter 4: Morphological analysis"](http://www.cl.ecei.tohoku .ac.jp/nlp100/#ch4) 34th "A B" record. This time it's a challenge for combining row information, so it's not as easy as it used to be. I'm not good at pandas and SQL. However, it is not difficult because it is just a loop process.

Reference link

Link	Remarks
034."B of A".ipynb	Answer program GitHub link
100 amateur language processing knocks:34	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

34. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

Answer

Answer Program [034. "A B" .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0 % E8% A7% A3% E6% 9E% 90/034.% E3% 80% 8CA% E3% 81% AEB% E3% 80% 8D.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    # EOS,symbol,Leave a blank
    return df

df = read_text()

POS_TARGET = 'noun'

for index in df['surface'].index:
    
    #No special logic on the first and last lines
    if df['surface'][index] == 'of' \
     and df['pos'][index-1] == POS_TARGET \
     and df['pos'][index+1] == POS_TARGET:
        print(index, '\t', df['surface'][index-1] + 'of' + df['surface'][index+1])
    
    #Limited because there are many
    if index > 2000:
        break

Answer commentary

Read file

Unlike the previous knock, EOS, symbols, and blank lines are not removed. This is because I wanted to make it a condition that "B of A" is continuous, including clarity at the end of sentences and symbols.

`python`


def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    # EOS,symbol,Leave a blank
    return df

"A to B" judgment

Loop on the Pandas Series index. Then, it is judged whether the lines before and after are nouns.

`python`


for index in df['surface'].index:
    
    #No special logic on the first and last lines
    if df['surface'][index] == 'of' \
     and df['pos'][index-1] == POS_TARGET \
     and df['pos'][index+1] == POS_TARGET:
        print(index, '\t', df['surface'][index-1] + 'of' + df['surface'][index+1])

Output result (execution result)

When the program is executed, the following results will be output. Since there are so many, I only see up to 2000 lines.

`Output result`


118 his palm
144 on the palm
151 Student's face
197 Face that should be
235 in the middle of the face
248 In the hole
292 The palm of the student
294 Behind the palm
382 what
421 Essential mother
478 on the straw
484 Inside Sasahara
498 Finally thoughts
516 In front of the pond
658 Finally thing
729 Thanks to Kazuki
742 Hedge hole
752 Neighboring calico cat
758 o'clock passage
806 Momentary grace
842 Inside the house
858 His student
Humans other than 861
892 Previous student
958 Your chance
1029 San no San
1046 Chest itching
1068 Housekeeper
1089 Master
1121 under the nose
1130 My face
1192 My home
1208 My master
1249 Home stuff
1281 of which
1300 his study
On 1326 books
1341 skin color
Above 1402
1411 his every night
Other than 1516
1588 Beside my husband
1608 his knee
1610 on the lap
1659 Experience
1665 On the rice bowl
1671 On the Kotatsu
1700 out of here
1702 Our small companion
1704 Bed of small companion
1747 middle of them
1773 One of the small companions
1826 nerves
1830 Sexual Master
1839 Next room
1919 Selfish
1953 For me
2000 between the kitchen boards

[PYTHON] 100 Language Processing Knock-34 (using pandas): "A B"

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

34. "B of A"

Answer

Answer Program [034. "A B" .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0 % E8% A7% A3% E6% 9E% 90/034.% E3% 80% 8CA% E3% 81% AEB% E3% 80% 8D.ipynb)

Answer commentary

Read file

python

"A to B" judgment

python

Output result (execution result)

Output result

`python`

`python`

`Output result`