[PYTHON] 100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. tohoku.ac.jp/nlp100/#ch5) [41st "Reading of dependency analysis result (phrase / dependency)"](http://www.cl.ecei.tohoku.ac.jp/nlp100/# sec41) This is a record. Since the last time was the content of the preparatory movement, this time it is the actual performance of the dependency. Overall, Chapter 5 is not a short code that utilizes packages as in Chapter 4, "Morphological Analysis", and we have to create an algorithm. This time it's not that complicated, but it still makes me think to some extent.

Reference link

Link Remarks
041.Reading the dependency analysis result(Phrase / dependency).ipynb Answer program GitHub link
100 amateur language processing knocks:41 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

41. Reading the dependency analysis result (phrase / dependency)

In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.

Answer

Answer program [041. Reading the dependency analysis result (phrase / dependency) .ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82% 8A% E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 041.% E4% BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81 % 91% E8% A7% A3% E6% 9E% 90% E7% B5% 90% E6% 9E% 9C% E3% 81% AE% E8% AA% AD% E3% 81% BF% E8% BE% BC % E3% 81% BF (% E6% 96% 87% E7% AF% 80% E3% 83% BB% E4% BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91) .ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        self.phrase = ''.join([morph.surface for morph in morphs]) #Phrase

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

for i, chunk in enumerate(sentences[7]):
    print('{}: {},Contact:{},Person in charge:{}'.format(i, chunk.phrase, chunk.dst, chunk.srcs))

Answer commentary

Regular expression for getting a contact

I'm using a regular expression that can get the contact. (-? \ D +) is the part to get the number of the contact. For more information on regular expressions, see the article "Basics and Tips for Python Regular Expressions Learned from Zero". I think you can get it without using regular expressions, but I use it for practice.

python


#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

Chunk class

It also defines a variable called phrase that is not specified for knocking. It's convenient when you output it later. srcs is only defined, and __init__ does not assign values.

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        self.phrase = ''.join([morph.surface for morph in morphs]) #Phrase

Output result (execution result)

When the program is executed, the following results will be output.

Output result


0:this,Contact:1,Person in charge:[]
1:A student is,Contact:7,Person in charge:[0]
2:Sometimes,Contact:4,Person in charge:[]
3:Us,Contact:4,Person in charge:[]
4:Catch,Contact:5,Person in charge:[2, 3]
5:Boil,Contact:6,Person in charge:[4]
6:To eat,Contact:7,Person in charge:[5]
7:It's a story.,Contact:-1,Person in charge:[1, 6]

Recommended Posts

100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
Language processing 100 knocks-40: Reading dependency analysis results (morpheme)
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)