Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. 40th "Reading of dependency analysis result (morpheme)" record of tohoku.ac.jp/nlp100/#ch5) is. Chapter 5, which starts from now on, is troublesome and time-consuming to build an algorithm as a whole, and I feel like the first demon gate of 100 language processing knocks. This time it's like a preparatory movement and it's not very difficult. Is it brand new at best to use the class for the first time with 100 knocks?
Link | Remarks |
---|---|
040.Reading the dependency analysis result (morpheme).ipynb | Answer program GitHub link |
100 amateur language processing knocks:40 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Syntax, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.
"Dependency" is a relationship between clauses. I did a little in the previous article "[Play] Syntactic analysis of Shinkalion's ton demo mail", but even with this document
You can clarify the relationship.
First, perform dependency analysis with CaboCha
.
cabocha -f1 ../04.Morphological analysis/neko.txt -o neko.txt.cabocha
The execution result is as follows. Dependency information is added to the result of MeCab. The part of * 0 -1D 0/0 0.000000
on the first line is the dependency information, the third character 0
is the segment number, and the subsequent -1
indicates the dependency. This time, there is no contact with -1
, so the example is bad.
text:neko.txt.Partial excerpt from cabocha
* 0 -1D 0/0 0.000000
One noun,number,*,*,*,*,one,Ichi,Ichi
EOS
EOS
* 0 -1D 1/1 0.000000
symbol,Blank,*,*,*,*, , ,
I am a cat noun,Proper noun,General,*,*,*,I am a cat,Wagamama High Spec,Wagamama High Spec
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
* 0 2D 0/1 -1.911675
Name noun,General,*,*,*,*,name,Namae,Namae
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 1 2D 0/0 -1.911675
Still adverb,Particle connection,*,*,*,*,yet,Mada,Mada
* 2 -1D 0/0 0.000000
No adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,No,Nai,Nai
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
EOS
* 0 1D 1/2 1.504358
symbol,Blank,*,*,*,*, , ,
Where noun,Pronoun,General,*,*,*,Where,Doco,Doco
Particles,Case particles,General,*,*,*,so,De,De
* 1 2D 0/1 1.076607
Born verb,Independence,*,*,One step,Continuous form,Born,Umale,Umale
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
* 2 4D 0/1 -0.197109
Katon noun,General,*,*,*,*,Fire,Katong,Katong
And particles,Case particles,General,*,*,*,When,To,To
* 3 4D 0/1 -0.197109
Register noun,Change connection,*,*,*,*,Register,Kentou,Kento
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
* 4 -1D 0/1 0.000000
Tsuka verb,Independence,*,*,Five-dan / Ka line,Imperfective form,Tsukuri,Tsuka,Tsuka
Nu auxiliary verb,*,*,*,Special,Uninflected word,Nu,Nu,Nu
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
So, this is the main Python program.
import re
morphs = []
sentences = []
#Delimiter
separator = re.compile('\t|,')
#Excluded lines
exclude = re.compile(r'''EOS\n # EOS,Line feed code
| # OR
\*\s\d+\s # '*,Blank,One or more numbers,Blank
''', re.VERBOSE)
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
with open('./neko.txt.cabocha') as f:
for line in f:
if not exclude.match(line):
morphs.append(Morph(line))
if line == 'EOS\n' \
and len(morphs) > 0:
sentences.append(morphs)
morphs = []
for sentence in sentences[2]:
print(sentence.__dict__)
I use the regular expressions I learned in Chapter 2 as practice. separator
is a delimiter for the morphological analysis result, and ʻexclude` is a regular expression for excluding the dependency result with EOS. For more information on regular expressions, see the article "Basics and Tips for Python Regular Expressions Learned from Zero".
python
#Delimiter
separator = re.compile('\t|,')
#Excluded lines
exclude = re.compile(r'''EOS\n # EOS,Line feed code
| # OR
\*\s\d+\s # '*,Blank,One or more numbers,Blank
''', re.VERBOSE)
This is the first class to come out with 100 knocks. __init__
is the constructor called the first time. The entire line of the morphological analysis result is received and stored in a class variable separated by tabs / commas.
python
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
By setting __dict__
, the class variable will be output as a dictionary. I didn't know it, but it's convenient.
python
for sentence in sentences[2]:
print(sentence.__dict__)
When the program is executed, the following results will be output.
Output result
{'surface': 'name', 'base': 'name', 'pos': 'noun', 'pos1': 'General'}
{'surface': 'Is', 'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle'}
{'surface': 'yet', 'base': 'yet', 'pos': 'adverb', 'pos1': 'Particle connection'}
{'surface': 'No', 'base': 'No', 'pos': 'adjective', 'pos1': 'Independence'}
{'surface': '。', 'base': '。', 'pos': 'symbol', 'pos1': 'Kuten'}
Recommended Posts