[PYTHON] Language processing 100 knocks-48: Extraction of paths from nouns to roots

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. It is a record of 48th "Extracting the path from noun to root" of tohoku.ac.jp/nlp100/#ch5). .. It's a little simpler than the last knock. This is because there are not so many conditions and only the contact is output continuously.

Reference link

Link Remarks
048.Extracting paths from nouns to roots.ipynb Answer program GitHub link
100 amateur language processing knocks:48 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

48. Extraction of paths from nouns to roots

For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications.

--Each clause is represented by a (superficial) morpheme sequence --Concatenate the expressions of each clause with "-> " from the start clause to the end clause of the path.

From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.

I am->saw
here->Start with->Human->Things->saw
Human->Things->saw
Things->saw

Answer

Answer program [048. Extraction of path from noun to root.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 048.% E5% 90% 8D% E8% A9% 9E% E3% 81% 8B% E3% 82% 89% E6 % A0% B9% E3% 81% B8% E3% 81% AE% E3% 83% 91% E3% 82% B9% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.noun = False
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
            if morph.pos == 'noun':
                self.noun = True

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences.append(chunks)
            chunks = []

for i, sentence in enumerate(sentences):
    for chunk in sentence:
        if chunk.noun and chunk.dst != -1:
            line = chunk.phrase
            current_chunk = chunk
            while current_chunk.dst != -1:
                line = line + ' -> ' + sentence[current_chunk.dst].phrase
                current_chunk = sentence[current_chunk.dst]
            print(i, '\t',line)
    #Limited because there are many
    if i > 10:
        break

Answer commentary

Chunk class

The Chunk class is cleaner than last time. It has a flag as to whether it contains a noun.

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.dst  = dst  #Contact clause index number
        
        self.phrase = ''
        self.noun = False
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
            if morph.pos == 'noun':
                self.noun = True

Output section

It's been a lot simpler than last time, but instead it's a bit more complicated with the output below. If there is a dependency that includes a noun phrase, the while phrase connects the segments to the end of the dependency.

python


for i, sentence in enumerate(sentences):
    for chunk in sentence:
        if chunk.noun and chunk.dst != -1:
            line = chunk.phrase
            current_chunk = chunk
            while current_chunk.dst != -1:
                line = line + ' -> ' + sentence[current_chunk.dst].phrase
                current_chunk = sentence[current_chunk.dst]
            print(i, '\t',line)
    #Limited because there are many
    if i > 10:
        break

Output result (execution result)

When the program is executed, the following results will be output.

Output result


2 The name is->No
3 where->Born->Katonto->Do not use
3 Katonto->Do not use
3 I have a clue->Do not use
4 anything->dim->In tears->I remember
4 at the place->In tears->I remember
4 Only what was there->I remember
5 I am->saw
5 here->Start with->Human->Things->saw
5 human->Things->saw
5 things->saw
6 Later->When you hear->That's it
6 it is->That's it
6 Called a student->In humans->Was a race->That's it
6 in humans->Was a race->That's it
6 Ichiban->Evil->Was a race->That's it
6 Evil->Was a race->That's it
Was 6 races->That's it
7 A student is->Is a story
7 we->Catch->Boil->To eat->Is a story
8 At that time->I didn't->did not think
8 what->Thoughts->I didn't->did not think
8 Thoughts->I didn't->did not think
9 his->In the palm->Be placed->Lifted->Time->soft->Feeling->Just warmed up
9 in the palm->Be placed->Lifted->Time->soft->Feeling->Just warmed up
9 Sue->Lifted->Time->soft->Feeling->Just warmed up
9 o'clock->soft->Feeling->Just warmed up
9 Feeling->Just warmed up
10 palms->Above->calm down->I saw->Human->Of things->Will be the beginning
On 10->calm down->I saw->Human->Of things->Will be the beginning
10 students->Face->I saw->Human->Of things->Will be the beginning
10 faces->I saw->Human->Of things->Will be the beginning
10 I saw->Human->Of things->Will be the beginning
10 humans->Of things->Will be the beginning
10 things->Will be the beginning
11 o'clock->If it's a thing->thought->Feeling->Remaining
11 strange->If it's a thing->thought->Feeling->Remaining
11 things->thought->Feeling->Remaining
11 Feeling->Remaining
11 Still->Remaining

Recommended Posts

Language processing 100 knocks-48: Extraction of paths from nouns to roots
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
Language processing 100 knocks-22: Extraction of category names
Language processing 100 knocks-46: Extraction of verb case frame information
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
100 Language Processing Knock-45: Extraction of verb case patterns
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 84
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89