[PYTHON] Language processing 100 knocks-46: Extraction of verb case frame information

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 46th "Extraction of verb case frame information" of tohoku.ac.jp/nlp100/#ch5). Last time, only particles were output as case, but this time, phrases (each frame) are also output. Of course, it's even more troublesome ...

Reference link

Link Remarks
046.Extraction of verb case frame information.ipynb Answer program GitHub link
100 amateur language processing knocks:46 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

46. Extraction of verb case frame information

Modify the program> 45 and output the predicate and case pattern followed by the term (the clause itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications.

--The term should be a word string of the clause related to the predicate (there is no need to remove the trailing particle) --If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.

Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.

Get started here
See what I see

Task supplement (about "each frame (case grammar)")

If you are interested, please see Wikipedia "Case Grammar". You can solve it without looking. I don't understand it just by glancing at it.

Answer

Answer program [046. Extraction of verb case frame information.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5%8F % 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 046.% E5% 8B% 95% E8% A9% 9E% E3% 81% AE% E6% A0% BC% E3% 83% 95% E3% 83% AC% E3% 83% BC% E3% 83% A0% E6% 83% 85% E5% A0% B1% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.verb = ''
        self.joshi = ''
        
        for morph in morphs:            
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            if morph.pos == 'verb':
                self.verb = morph.base
            if morph.pos == 'Particle':
                self.joshi = morph.base

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

def output_file(out_file, sentence, chunk):
    #Create a list of particles
    sources = [[sentence[source].joshi, sentence[source].phrase] \
                for source in chunk.srcs if sentence[source].joshi != '']
            
    if len(sources) > 0:
        sources.sort()
        joshi = ' '.join([row[0] for row in sources])
        phrase = ' '.join([row[1] for row in sources])
        out_file.write(('{}\t{}\t{}\n'.format(chunk.verb, joshi, phrase)))

with open('./046.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            if chunk.verb != '' and len(chunk.srcs) > 0:
                output_file(out_file, sentence, chunk)

Answer commentary

Chunk class

After all, the Chunk class, which is the lifeline of Chapter 5, has been changed from the previous time. I made the instance variable phrase and put the phrase text. Everything else is the same.

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.verb = ''
        self.joshi = ''
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            if morph.pos == 'verb':
                self.verb = morph.base
            if morph.pos == 'Particle':
                self.joshi = morph.base

File output function

Since it became rather complicated, I cut out the file output part as a function. A list of particles and clauses is created with the first list comprehension, and after sorting, it is output with the join function.

python


def output_file(out_file, sentence, chunk):
    #Create a list of particles
    sources = [[sentence[source].joshi, sentence[source].phrase] \
                for source in chunk.srcs if sentence[source].joshi != '']
            
    if len(sources) > 0:
        sources.sort()
        joshi = ' '.join([row[0] for row in sources])
        phrase = ' '.join([row[1] for row in sources])
        out_file.write(('{}\t{}\t{}\n'.format(chunk.verb, joshi, phrase)))

Output result (execution result)

When the program is executed, the following results are output (only the first 10 items).

bash:046.result_python.txt


Where to be born
I have a clue
Where I was crying
The only thing I was crying
Get started here
See what I see
Listen later
Catch us
Boil and catch
Eat and boil

Recommended Posts

Language processing 100 knocks-46: Extraction of verb case frame information
100 Language Processing Knock-45: Extraction of verb case patterns
Language processing 100 knocks-22: Extraction of category names
Language processing 100 knocks-48: Extraction of paths from nouns to roots
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63