[PYTHON] 100 Language Processing Knock-45: Extraction of verb case patterns

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. This is the record of 45th "Extraction of verb case pattern" of tohoku.ac.jp/nlp100/#ch5). The number of conditional branches of ʻif` has also increased, and it is becoming more and more complicated. It's a little tedious to think about the algorithm.

Reference link

Link Remarks
045.Extraction of verb case patterns.ipynb Answer program GitHub link
100 amateur language processing knocks:45 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Syntax, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

45. Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.

--In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces.

Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.

At the beginning
To see

Save the output of this program to a file and check the following items using UNIX commands.

--Combination of predicates and case patterns that frequently appear in the corpus --The case pattern of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

Problem supplement (about "case")

I don't pay particular attention to the purpose of completing the program, but the Japanese word "case" seems to be deep. If you're curious, take a look at Wikiped it "case". I'm just looking at it. I remember when I was doing a language exchange in Australia, I was asked what the difference between "ha" and "ga" was.

Answer

Answer program [045. Extraction of verb case patterns.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5%8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 045.% E5% 8B% 95% E8% A9% 9E% E3% 81% AE% E6% A0% BC% E3% 83 % 91% E3% 82% BF% E3% 83% BC% E3% 83% B3% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.verb = ''
        self.joshi = ''
        
        for morph in morphs:            
            if morph.pos != 'symbol':
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            if morph.pos == 'verb':
                self.verb = morph.base
            if morph.pos == 'Particle':
                self.joshi = morph.base

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

with open('./045.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            if chunk.verb != '' and len(chunk.srcs) > 0:
                
                #Create a list of particles
                sources = [sentence[source].joshi for source in chunk.srcs if sentence[source].joshi != '']
            
                if len(sources) > 0:
                    sources.sort()
                    out_file.write(('{}\t{}\n'.format(chunk.verb, ' '.join(sources))))

The following is the UNIX command part. I used the grep command for the first time, but it's convenient.

UNIX command section


#Sort, deduplication and count, descending sort
sort 045.result_python.txt | uniq --count | sort --numeric-sort --reverse > "045.result_1_all.txt"

# 「(Beginning of line)To do(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^To do\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_2_To do.txt"

# 「(Beginning of line)to see(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^to see\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_3_to see.txt"

# 「(Beginning of line)give(Blank)Extract, sort, deduplication and count, sort in descending order
grep "^give\s" 045.result_python.txt | sort | uniq --count | sort --numeric-sort --reverse > "045.result_4_give.txt"

Answer commentary

Chunk class

The Chunk class stores the prototypes of verbs and particles. If there are multiple verbs in one phrase, we win second. The case particle should appear at the end of the phrase, but it has a conditional branch that takes the symbol into consideration.

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        
        self.verb = ''
        self.joshi = ''
        
        for morph in morphs:            
            if morph.pos != 'symbol':
                self.joshi = ''  #Blank for non-symbols to get particles in the last line excluding symbols
            if morph.pos == 'verb':
                self.verb = morph.base
            if morph.pos == 'Particle':
                self.joshi = morph.base

Output part

The particles of the affiliation are listed in a list comprehension notation and sorted to satisfy "Lexicographic order". And finally, the join function is used to output them separated by spaces. The nest is deep and I feel uncomfortable writing.

python


with open('./045.result_python.txt', 'w') as out_file:
    for sentence in sentences:
        for chunk in sentence:
            if chunk.verb != '' and len(chunk.srcs) > 0:

                #Create a list of particles
                sources = [sentence[source].joshi for source in chunk.srcs if sentence[source].joshi != '']

                if len(sources) > 0:
                    sources.sort()
                    out_file.write(('{}\t{}\n'.format(chunk.verb, ' '.join(sources))))

Output result (execution result)

When the program is executed, the following results will be output. Since there are many, only 10 lines are displayed here.

Python output

bash:045.result_python.txt(First 10 lines)


Be born
Tsukugato
By crying
Or
At the beginning
To see
Listen
To catch
Boil
Eat

UNIX command output

Since there are many, only 10 lines are displayed here.

bash:045.result_1_all.txt(First 10 lines)


There are 3176
1997 Tsukugato
800
721
464 to be
330
309 I think
305 see
301
Until there are 262

bash:045.result_2_To do.txt(First 10 lines)


1099
651
221
109 But
Until 86
59 What is
41
27 What is it?
Up to 24
18 as

bash:045.result_3_to see.txt(First 10 lines)


305 see
99 see
31 to see
24 Seeing
19 from seeing
11 Seeing
7 Because I see
5 to see
2 While watching
2 Just by looking

"Give" has a low frequency of appearance, and this is all.

bash:045.result_4_give.txt


7 to give
4 to give
3 What to give
Give 1 but give
1 As to give
1 to give

Recommended Posts

100 Language Processing Knock-45: Extraction of verb case patterns
Language processing 100 knocks-46: Extraction of verb case frame information
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 language processing knock-55: named entity extraction
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
100 Language Processing Knock-47: Functional Verb Syntax Mining
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 language processing knock-75 (using scikit-learn): weight of features
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
Language processing 100 knocks-48: Extraction of paths from nouns to roots
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning