Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. tohoku.ac.jp/nlp100/#ch5) [41st "Reading of dependency analysis result (phrase / dependency)"](http://www.cl.ecei.tohoku.ac.jp/nlp100/# sec41) This is a record. Since the last time was the content of the preparatory movement, this time it is the actual performance of the dependency. Overall, Chapter 5 is not a short code that utilizes packages as in Chapter 4, "Morphological Analysis", and we have to create an algorithm. This time it's not that complicated, but it still makes me think to some extent.
Link | Remarks |
---|---|
041.Reading the dependency analysis result(Phrase / dependency).ipynb | Answer program GitHub link |
100 amateur language processing knocks:41 | Copy and paste source of many source parts |
CaboCha official | CaboCha page to look at first |
I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
CRF++ | 0.58 | It's too old and I forgot how to install(Perhapsmake install ) |
CaboCha | 0.69 | It's too old and I forgot how to install(Perhapsmake install ) |
Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.
Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)
Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.
import re
#Delimiter
separator = re.compile('\t|,')
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
class Morph:
def __init__(self, line):
#Split with tabs and commas
cols = separator.split(line)
self.surface = cols[0] #Surface type(surface)
self.base = cols[7] #Uninflected word(base)
self.pos = cols[1] #Part of speech(pos)
self.pos1 = cols[2] #Part of speech subclassification 1(pos1)
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''.join([morph.surface for morph in morphs]) #Phrase
#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
#Substitute the entrepreneur
for i, chunk in enumerate(chunks):
if chunk.dst != -1:
chunks[chunk.dst].srcs.append(i)
sentences.append(chunks)
return sentences, []
morphs = []
chunks = []
sentences = []
with open('./neko.txt.cabocha') as f:
for line in f:
dependancies = dependancy.match(line)
#If it is not EOS or dependency analysis result
if not (line == 'EOS\n' or dependancies):
morphs.append(Morph(line))
#When there is a morphological analysis result in the EOS or dependency analysis result
elif len(morphs) > 0:
chunks.append(Chunk(morphs, dst))
morphs = []
#In the case of dependency result
if dependancies:
dst = int(dependancies.group(1))
#When there is a dependency result in EOS
if line == 'EOS\n' and len(chunks) > 0:
sentences, chunks = append_sentence(chunks, sentences)
for i, chunk in enumerate(sentences[7]):
print('{}: {},Contact:{},Person in charge:{}'.format(i, chunk.phrase, chunk.dst, chunk.srcs))
I'm using a regular expression that can get the contact. (-? \ D +)
is the part to get the number of the contact. For more information on regular expressions, see the article "Basics and Tips for Python Regular Expressions Learned from Zero".
I think you can get it without using regular expressions, but I use it for practice.
python
#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
(-?\d+) #Numbers(Contact)
''', re.VERBOSE)
It also defines a variable called phrase
that is not specified for knocking. It's convenient when you output it later. srcs
is only defined, and __init__
does not assign values.
python
class Chunk:
def __init__(self, morphs, dst):
self.morphs = morphs
self.srcs = [] #List of original clause index numbers
self.dst = dst #Contact clause index number
self.phrase = ''.join([morph.surface for morph in morphs]) #Phrase
When the program is executed, the following results will be output.
Output result
0:this,Contact:1,Person in charge:[]
1:A student is,Contact:7,Person in charge:[0]
2:Sometimes,Contact:4,Person in charge:[]
3:Us,Contact:4,Person in charge:[]
4:Catch,Contact:5,Person in charge:[2, 3]
5:Boil,Contact:6,Person in charge:[4]
6:To eat,Contact:7,Person in charge:[5]
7:It's a story.,Contact:-1,Person in charge:[1, 6]
Recommended Posts