Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 47th "Mining of functional verb syntax" of tohoku.ac.jp/nlp100/#ch5). In addition to the previous knock, the extraction target becomes a more complicated condition. It takes a little time just to understand the problem statement, and of course it takes time to solve it.

Reference link

Link	Remarks
047.Functional verb syntax mining.ipynb	Answer program GitHub link
100 amateur language processing knocks:47	Copy and paste source of many source parts
CaboCha official	CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I have only a frustrated memory of trying to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get
CRF++	0.58	It's too old and I forgot how to install(Perhaps`make install`)
CaboCha	0.69	It's too old and I forgot how to install(Perhaps`make install`)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

47. Mining of functional verb syntax

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.

--Only when the phrase consisting of "Sahen connection noun + (particle)" is related to the verb --The predicate is "Sahen connection noun + is the basic form of + verb", and if there are multiple verbs in the phrase, the leftmost verb is used. --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. --If there are multiple clauses related to the predicate, arrange all the terms with spaces (align with the order of particles).

For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place."
When I reply to the letter, my husband
Save the output of this program to a file and check the following items using UNIX commands.

--Predicates that frequently appear in the corpus (sa-variant noun + + verb) --Predicates and particles that frequently appear in the corpus

Task supplement (about "functional verb")

According to "Functional verb / Idiom verb", functional verbs are as follows. In other words, it is meaningless unless you stick to a noun like "do" and do something like "eat".

A functional verb is a verb that loses its original meaning and is associated with an action noun to represent the meaning of the verb.

Answer

Answer program [047. Mining of functional verb syntax.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5%8F%97 % E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 047.% E6% A9% 9F% E8% 83% BD% E5% 8B% 95% E8% A9% 9E% E6% A7% 8B% E6% 96% 87% E3% 81% AE% E3% 83% 9E% E3% 82% A4% E3% 83% 8B% E3% 83% B3% E3% 82% B0.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.verb = ''
        self.joshi = ''
        self.sahen = '' #Sa strange+To+Whether or not it is a verb pattern target
        
        for i, morph in enumerate(morphs):
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            
            if morph.pos == 'verb' and self.verb == '':
                self.verb = morph.base
            
            if morphs[-1].pos == 'Particle':
                self.joshi = morphs[-1].base
                
            try:
                if morph.pos1 == 'Change connection' and \
                   morphs[i+1].surface == 'To':
                    self.sahen = morph.surface + morphs[i+1].surface
            except IndexError:
                pass

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

def output_file(out_file, sahen, sentence, chunk):
    #Create a list of particles
    sources = [[sentence[source].joshi, sentence[source].phrase] \
                for source in chunk.srcs if sentence[source].joshi != '']
    
    if len(sources) > 0:
        sources.sort()
        joshi = ' '.join([row[0] for row in sources])
        phrase = ' '.join([row[1] for row in sources])
        out_file.write(('{}\t{}\t{}\n'.format(sahen, joshi, phrase)))

with open('./047.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            
            if chunk.sahen != '' and \
               chunk.dst != -1 and \
               sentence[chunk.dst].verb != '':
                output_file(out_file, chunk.sahen+sentence[chunk.dst].verb, 
                            sentence, sentence[chunk.dst])

#Sort by predicate, deduplication, sort by number
cut --fields=1 047.result_python.txt | sort | uniq --count \
| sort --numeric-sort --reverse > 047.result_unix1.txt

#Sort by predicate and particle to deduplication, then sort by number
cut --fields=1,2 047.result_python.txt | sort | uniq --count \
| sort --numeric-sort --reverse > 047.result_unix2.txt

Answer commentary

Chunk class

As usual, change the lifeline Chunk class. If the value of the part of speech subclassification pos1 is" sahen connection "and the next entry is" o ", then the instance variable sahen is entered with a string that combines the two (example: reply +).

`python`


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.verb = ''
        self.joshi = ''
        self.sahen = '' #Sa strange+To+Whether or not it is a verb pattern target
        
        for i, morph in enumerate(morphs):
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            
            if morph.pos == 'verb' and self.verb == '':
                self.verb = morph.base
            
            if morphs[-1].pos == 'Particle':
                self.joshi = morphs[-1].base
                
            try:
                if morph.pos1 == 'Change connection' and \
                   morphs[i+1].surface == 'To':
                    self.sahen = morph.surface + morphs[i+1].surface
            except IndexError:
                pass

Output section

The conditional branch of the output section is changed.

`python`


with open('./047.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            
            if chunk.sahen != '' and \
               chunk.dst != -1 and \
               sentence[chunk.dst].verb != '':
                output_file(out_file, chunk.sahen+sentence[chunk.dst].verb, 
                            sentence, sentence[chunk.dst])

Output result (execution result)

Python execution result

When you execute Python Script, the following result is output.

`text:047.result_python.txt(Only the first 10)`


Decide to make a decision
In return, in memory of the return
Take a nap Take a nap
He takes a nap
Persecution by chasing after persecution
Living a family life
Talk talk talk
Make a letter to the editor Make a letter to the editor
Sometimes talk to talk
To make a sketch

UNIX command execution result

Execute UNIX command and output "predicates that frequently appear in the corpus (sa-variant noun + + verb)"

`text:047.result_unix1.txt(Only the first 10)`


29 reply
21 Say hello
16 talk
15 imitate
13 quarrel
9 Exercise
9 Ask a question
6 Be careful
6 Take a nap
6 Ask questions

Execute UNIX command and output "predicates and particle patterns that frequently appear in the corpus"

`text:047.result_unix2.txt(Only the first 10)`


14 When you reply
9 Exercise
9 Do the imitation
8 What is a reply?
7 To quarrel
6 To talk
6 When you say hello
5 to talk
5 To say hello
4 Ask a question

[PYTHON] 100 Language Processing Knock-47: Functional Verb Syntax Mining

Reference link

environment

Chapter 5: Dependency analysis

content of study

Knock content

47. Mining of functional verb syntax

Task supplement (about "functional verb")

Answer

Answer commentary

Chunk class

python

Output section

python

Output result (execution result)

Python execution result

text:047.result_python.txt(Only the first 10)

UNIX command execution result

text:047.result_unix1.txt(Only the first 10)

text:047.result_unix2.txt(Only the first 10)

`python`

`python`

`text:047.result_python.txt(Only the first 10)`

`text:047.result_unix1.txt(Only the first 10)`

`text:047.result_unix2.txt(Only the first 10)`