Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the record of 35th "Noun Concatenation" of .ac.jp/nlp100/#ch4). Like last time, it's something that pandas can't handle easily. However, the "noun concatenation" part is a not difficult process of about 10 lines.
Link | Remarks |
---|---|
035.Noun articulation.ipynb | Answer program GitHub link |
100 amateur language processing knocks:35 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[df['pos'] == 'noun']
df = read_text()
nouns = []
for index in df['surface'].index:
nouns.append(df['surface'][index])
#End of noun concatenation if there is no index one ahead
if (index + 1) not in df.index:
#Multiple(Articulation)in the case of
if len(nouns) > 1:
print(len(nouns), '\t', index, '\t', ''.join(nouns))
nouns = []
#Limited because there are many
if index > 2000:
break
This time, only nouns are needed, so the entries are narrowed down to only nouns immediately after reading the file.
python
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[df['pos'] == 'noun']
The rest is processing in a loop. The concatenation end is judged by whether the index of the next entry is +1. Noun concatenation is output only when the length of the concatenated list is 1 or more at the end of the concatenation.
python
nouns = []
for index in df['surface'].index:
nouns.append(df['surface'][index])
#End of noun concatenation if there is no index one ahead
if (index + 1) not in df.index:
#Multiple(Articulation)in the case of
if len(nouns) > 1:
print(len(nouns), '\t', index, '\t', ''.join(nouns))
nouns = []
When the program is executed, the following results will be output.
Output result
2 28
2 66 Human
2 69 The worst
2 172
2 190 Hair
2 209 Then the cat
2 222 once
2 688 House
2 860 Other than students
3 1001
2 1028 The other day
2 1031 Mima
2 1106 Midaidokoro
2 1150 as it is
2 1235 All-day study
2 1255 studyer
2 1266 Studyer
2 1288 Diligent
3 1392 Page 23
2 1515 Other than the master
2 1581 As long as I am
2 1599 Morning master
2 1690 Most mindful
2 1733 One floor
2 1781 Last hard
3 1829 Neurogastric weakness
2 1913 Language break
2 1961
3 1965 Total family
Recommended Posts