[PYTHON] 100 Language Processing Knock-47: Functional Verb Syntax Mining

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 47th "Mining of functional verb syntax" of tohoku.ac.jp/nlp100/#ch5). In addition to the previous knock, the extraction target becomes a more complicated condition. It takes a little time just to understand the problem statement, and of course it takes time to solve it.

Reference link

Link Remarks
047.Functional verb syntax mining.ipynb Answer program GitHub link
100 amateur language processing knocks:47 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I have only a frustrated memory of trying to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

47. Mining of functional verb syntax

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.

--Only when the phrase consisting of "Sahen connection noun + (particle)" is related to the verb --The predicate is "Sahen connection noun + is the basic form of + verb", and if there are multiple verbs in the phrase, the leftmost verb is used. --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. --If there are multiple clauses related to the predicate, arrange all the terms with spaces (align with the order of particles).

For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place."

When I reply to the letter, my husband

Save the output of this program to a file and check the following items using UNIX commands.

--Predicates that frequently appear in the corpus (sa-variant noun + + verb) --Predicates and particles that frequently appear in the corpus

Task supplement (about "functional verb")

According to "Functional verb / Idiom verb", functional verbs are as follows. In other words, it is meaningless unless you stick to a noun like "do" and do something like "eat".

A functional verb is a verb that loses its original meaning and is associated with an action noun to represent the meaning of the verb.

Answer

Answer program [047. Mining of functional verb syntax.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5%8F%97 % E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 047.% E6% A9% 9F% E8% 83% BD% E5% 8B% 95% E8% A9% 9E% E6% A7% 8B% E6% 96% 87% E3% 81% AE% E3% 83% 9E% E3% 82% A4% E3% 83% 8B% E3% 83% B3% E3% 82% B0.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.verb = ''
        self.joshi = ''
        self.sahen = '' #Sa strange+To+Whether or not it is a verb pattern target
        
        for i, morph in enumerate(morphs):
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            
            if morph.pos == 'verb' and self.verb == '':
                self.verb = morph.base
            
            if morphs[-1].pos == 'Particle':
                self.joshi = morphs[-1].base
                
            try:
                if morph.pos1 == 'Change connection' and \
                   morphs[i+1].surface == 'To':
                    self.sahen = morph.surface + morphs[i+1].surface
            except IndexError:
                pass

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

def output_file(out_file, sahen, sentence, chunk):
    #Create a list of particles
    sources = [[sentence[source].joshi, sentence[source].phrase] \
                for source in chunk.srcs if sentence[source].joshi != '']
    
    if len(sources) > 0:
        sources.sort()
        joshi = ' '.join([row[0] for row in sources])
        phrase = ' '.join([row[1] for row in sources])
        out_file.write(('{}\t{}\t{}\n'.format(sahen, joshi, phrase)))

with open('./047.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            
            if chunk.sahen != '' and \
               chunk.dst != -1 and \
               sentence[chunk.dst].verb != '':
                output_file(out_file, chunk.sahen+sentence[chunk.dst].verb, 
                            sentence, sentence[chunk.dst])
#Sort by predicate, deduplication, sort by number
cut --fields=1 047.result_python.txt | sort | uniq --count \
| sort --numeric-sort --reverse > 047.result_unix1.txt

#Sort by predicate and particle to deduplication, then sort by number
cut --fields=1,2 047.result_python.txt | sort | uniq --count \
| sort --numeric-sort --reverse > 047.result_unix2.txt

Answer commentary

Chunk class

As usual, change the lifeline Chunk class. If the value of the part of speech subclassification pos1 is" sahen connection "and the next entry is" o ", then the instance variable sahen is entered with a string that combines the two (example: reply +).

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.verb = ''
        self.joshi = ''
        self.sahen = '' #Sa strange+To+Whether or not it is a verb pattern target
        
        for i, morph in enumerate(morphs):
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            
            if morph.pos == 'verb' and self.verb == '':
                self.verb = morph.base
            
            if morphs[-1].pos == 'Particle':
                self.joshi = morphs[-1].base
                
            try:
                if morph.pos1 == 'Change connection' and \
                   morphs[i+1].surface == 'To':
                    self.sahen = morph.surface + morphs[i+1].surface
            except IndexError:
                pass

Output section

The conditional branch of the output section is changed.

python


with open('./047.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            
            if chunk.sahen != '' and \
               chunk.dst != -1 and \
               sentence[chunk.dst].verb != '':
                output_file(out_file, chunk.sahen+sentence[chunk.dst].verb, 
                            sentence, sentence[chunk.dst])

Output result (execution result)

Python execution result

When you execute Python Script, the following result is output.

text:047.result_python.txt(Only the first 10)


Decide to make a decision
In return, in memory of the return
Take a nap Take a nap
He takes a nap
Persecution by chasing after persecution
Living a family life
Talk talk talk
Make a letter to the editor Make a letter to the editor
Sometimes talk to talk
To make a sketch

UNIX command execution result

Execute UNIX command and output "predicates that frequently appear in the corpus (sa-variant noun + + verb)"

text:047.result_unix1.txt(Only the first 10)


29 reply
21 Say hello
16 talk
15 imitate
13 quarrel
9 Exercise
9 Ask a question
6 Be careful
6 Take a nap
6 Ask questions

Execute UNIX command and output "predicates and particle patterns that frequently appear in the corpus"

text:047.result_unix2.txt(Only the first 10)


14 When you reply
9 Exercise
9 Do the imitation
8 What is a reply?
7 To quarrel
6 To talk
6 When you say hello
5 to talk
5 To say hello
4 Ask a question

Recommended Posts

100 Language Processing Knock-47: Functional Verb Syntax Mining
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1