[PYTHON] 100 Language Processing Knock-34 (using pandas): "A B"

Language processing 100 knocks 2015 ["Chapter 4: Morphological analysis"](http://www.cl.ecei.tohoku .ac.jp/nlp100/#ch4) 34th "A B" record. This time it's a challenge for combining row information, so it's not as easy as it used to be. I'm not good at pandas and SQL. However, it is not difficult because it is just a loop process.

Reference link

Link Remarks
034."B of A".ipynb Answer program GitHub link
100 amateur language processing knocks:34 Copy and paste source of many source parts
MeCab Official The first MeCab page to look at

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

34. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

Answer

Answer Program [034. "A B" .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0 % E8% A7% A3% E6% 9E% 90/034.% E3% 80% 8CA% E3% 81% AEB% E3% 80% 8D.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    # EOS,symbol,Leave a blank
    return df

df = read_text()

POS_TARGET = 'noun'

for index in df['surface'].index:
    
    #No special logic on the first and last lines
    if df['surface'][index] == 'of' \
     and df['pos'][index-1] == POS_TARGET \
     and df['pos'][index+1] == POS_TARGET:
        print(index, '\t', df['surface'][index-1] + 'of' + df['surface'][index+1])
    
    #Limited because there are many
    if index > 2000:
        break

Answer commentary

Read file

Unlike the previous knock, EOS, symbols, and blank lines are not removed. This is because I wanted to make it a condition that "B of A" is continuous, including clarity at the end of sentences and symbols.

python


def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    # EOS,symbol,Leave a blank
    return df

"A to B" judgment

Loop on the Pandas Series index. Then, it is judged whether the lines before and after are nouns.

python


for index in df['surface'].index:
    
    #No special logic on the first and last lines
    if df['surface'][index] == 'of' \
     and df['pos'][index-1] == POS_TARGET \
     and df['pos'][index+1] == POS_TARGET:
        print(index, '\t', df['surface'][index-1] + 'of' + df['surface'][index+1])

Output result (execution result)

When the program is executed, the following results will be output. Since there are so many, I only see up to 2000 lines.

Output result


118 his palm
144 on the palm
151 Student's face
197 Face that should be
235 in the middle of the face
248 In the hole
292 The palm of the student
294 Behind the palm
382 what
421 Essential mother
478 on the straw
484 Inside Sasahara
498 Finally thoughts
516 In front of the pond
658 Finally thing
729 Thanks to Kazuki
742 Hedge hole
752 Neighboring calico cat
758 o'clock passage
806 Momentary grace
842 Inside the house
858 His student
Humans other than 861
892 Previous student
958 Your chance
1029 San no San
1046 Chest itching
1068 Housekeeper
1089 Master
1121 under the nose
1130 My face
1192 My home
1208 My master
1249 Home stuff
1281 of which
1300 his study
On 1326 books
1341 skin color
Above 1402
1411 his every night
Other than 1516
1588 Beside my husband
1608 his knee
1610 on the lap
1659 Experience
1665 On the rice bowl
1671 On the Kotatsu
1700 out of here
1702 Our small companion
1704 Bed of small companion
1747 middle of them
1773 One of the small companions
1826 nerves
1830 Sexual Master
1839 Next room
1919 Selfish
1953 For me
2000 between the kitchen boards

Recommended Posts

100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python