Language processing 100 knocks 2015 ["Chapter 4: Morphological analysis"](http://www.cl.ecei.tohoku .ac.jp/nlp100/#ch4) 34th "A B" record. This time it's a challenge for combining row information, so it's not as easy as it used to be. I'm not good at pandas and SQL. However, it is not difficult because it is just a loop process.
Link | Remarks |
---|---|
034."B of A".ipynb | Answer program GitHub link |
100 amateur language processing knocks:34 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Extract a noun phrase in which two nouns are connected by "no".
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
# EOS,symbol,Leave a blank
return df
df = read_text()
POS_TARGET = 'noun'
for index in df['surface'].index:
#No special logic on the first and last lines
if df['surface'][index] == 'of' \
and df['pos'][index-1] == POS_TARGET \
and df['pos'][index+1] == POS_TARGET:
print(index, '\t', df['surface'][index-1] + 'of' + df['surface'][index+1])
#Limited because there are many
if index > 2000:
break
Unlike the previous knock, EOS, symbols, and blank lines are not removed. This is because I wanted to make it a condition that "B of A" is continuous, including clarity at the end of sentences and symbols.
python
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
# EOS,symbol,Leave a blank
return df
Loop on the Pandas Series index. Then, it is judged whether the lines before and after are nouns.
python
for index in df['surface'].index:
#No special logic on the first and last lines
if df['surface'][index] == 'of' \
and df['pos'][index-1] == POS_TARGET \
and df['pos'][index+1] == POS_TARGET:
print(index, '\t', df['surface'][index-1] + 'of' + df['surface'][index+1])
When the program is executed, the following results will be output. Since there are so many, I only see up to 2000 lines.
Output result
118 his palm
144 on the palm
151 Student's face
197 Face that should be
235 in the middle of the face
248 In the hole
292 The palm of the student
294 Behind the palm
382 what
421 Essential mother
478 on the straw
484 Inside Sasahara
498 Finally thoughts
516 In front of the pond
658 Finally thing
729 Thanks to Kazuki
742 Hedge hole
752 Neighboring calico cat
758 o'clock passage
806 Momentary grace
842 Inside the house
858 His student
Humans other than 861
892 Previous student
958 Your chance
1029 San no San
1046 Chest itching
1068 Housekeeper
1089 Master
1121 under the nose
1130 My face
1192 My home
1208 My master
1249 Home stuff
1281 of which
1300 his study
On 1326 books
1341 skin color
Above 1402
1411 his every night
Other than 1516
1588 Beside my husband
1608 his knee
1610 on the lap
1659 Experience
1665 On the rice bowl
1671 On the Kotatsu
1700 out of here
1702 Our small companion
1704 Bed of small companion
1747 middle of them
1773 One of the small companions
1826 nerves
1830 Sexual Master
1839 Next room
1919 Selfish
1953 For me
2000 between the kitchen boards
Recommended Posts