[PYTHON] 100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. It is a record of 49th "Extraction of dependency path between nouns" of tohoku.ac.jp/nlp100/#ch5). .. Finally, it is the turning point of 100 knocks. Here is ** a vicious knock that becomes a demon gate **. ** The problem is difficult to understand, and even if you understand it, it is difficult to solve it **. It feels like I've finished 100, but it's the most annoying knock for my own algorithm (there are other more complicated ones, but I'm basically leaving it to the package so I don't have to worry too much).

Reference link

Link Remarks
049.Extraction of dependency paths between nouns.ipynb Answer program GitHub link
100 amateur language processing knocks:49 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

49. Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the clause numbers of the noun phrase pair are * i * and * j * (* i * <* j *), the dependency path shall satisfy the following specifications.

--Similar to problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "-> " --Replace noun phrases in clauses * i * and * j * with X and Y, respectively.

In addition, the shape of the dependency path can be considered in the following two ways.

--If clause * j * exists on the path from clause * i * to the root of the syntax tree: Show the path from clause * i * to clause * j * --Other than the above, when the phrase * i * and the clause * j * intersect at a common clause * k * on the path from the clause * j * to the root of the syntax tree: The path and clause * immediately before the clause * i * to the clause * k * The path from j * to just before the clause * k * and the contents of the clause * k * are concatenated with "|" and displayed.

For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.

X is|In Y->Start with->Human->Things|saw
X is|Called Y->Things|saw
X is|Y|saw
In X->Start with-> Y
In X->Start with->Human-> Y
Called X-> Y

Problem supplement

The knock content is long and difficult to understand. Article ["Amateur language processing 100 knocks: 49"](https://qiita.com/segavvy/items/dfbf9d5dd5885e807d49#%E5%95%8F%E9%A1%8C%E3%81%AE%E6% I finally understood by looking at 84% 8F% E5% 91% B3). I will omit my explanation ...

Answer

Answer program [049. Extraction of dependency paths between nouns.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 049.% E5% 90% 8D% E8% A9% 9E% E9% 96% 93% E3% 81% AE% E4 % BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91% E3% 83% 91% E3% 82% B9% E3% 81% AE% E6% 8A% BD% E5% 87 % BA.ipynb)

There are 150 lines ... It seems that it can be made a little simpler if it takes time, but I am at a loss.

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)
NOUN_REPLACE = '@@noun@@'
SEP1 = ' -> '
SEP2 = ' | '

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.dst  = dst  #Contact clause index number
        self.dsts = []
        
        self.phrase = ''
        self.phrase_noun = ''
        self.noun = False
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                if morph.pos == 'noun':
                    
                    #Treat the first noun as a noun phrase
                    if self.noun == False:
                        self.noun = True
                        self.phrase_noun += NOUN_REPLACE
                        continue
                    
                    #If the previous one is a noun, do not replace the surface system
                    elif self.noun and self.phrase_noun.endswith(NOUN_REPLACE):
                        continue
                        
                #Except for replacement and skip, add the surface system as it is
                self.phrase_noun += morph.surface

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            
            #Add contacts(Add the last line for efficiency)
            for chunk in chunks[::-1]:
                if chunk.dst!= -1:
                    chunk.dsts.extend([chunk.dst])
                    if chunks[chunk.dst].dst != -1:
                        chunk.dsts.extend(chunks[chunk.dst].dsts)
            sentences.append(chunks)
            chunks = []

def output_result(i, chunk, sentence):
    print('\tX:', i, chunk.phrase, chunk.dsts)
    
    #Noun phrase(X)Loop to search for Y after
    for i_y, y_chunk in enumerate(sentence[i+1:], i+1):
        
        #Y for noun phrases
        if y_chunk.noun:
            
            #Y information output
            print('\t\tY:', i_y, y_chunk.phrase)
            
            result = ''
            
            #Y is the route(Contact)If included in
            if i_y in chunk.dsts:
                
                result = '\t\t\tP1:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X') + SEP1
                
                #Output from X to Y
                for i_path in chunk.dsts:
                    
                    #Finish when Y is reached
                    if i_path == i_y:
                        result += 'Y'
                        break
                    else:
                        result += sentence[i_path].phrase + SEP1
            
            #Y is the route(Contact)If not included in
            else:
                result = '\t\t\tP2:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X')
                
                #Find the intersection with the minimum value in the product set
                i_goal = min(set(chunk.dsts).intersection(set(sentence[i_y].dsts)))
                
                #From X to Y before the common phrase
                for i_path in chunk.dsts:                    
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_y].phrase_noun.replace(NOUN_REPLACE, 'Y')
                        break
                    else:
                        result += SEP1 + sentence[i_path].phrase                 
                
                #From the person in charge of Y to the common phrase
                for i_path in sentence[i_y].dsts:
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_goal].phrase
                    else:
                        result += SEP1 + sentence[i_path].phrase
            print(result)

for i_sentence, sentence in enumerate(sentences):
    
    #Statement output
    print(i_sentence, '\t', ''.join([chunk.phrase for chunk in sentence]))
    for i_chunk, chunk in enumerate(sentence):
        
        #If the phrase is a noun phrase
        if chunk.noun and chunk.dst != -1:
            output_result(i_chunk, chunk, sentence)

    #Limited because there are many
    if i_sentence > 5:
        break

Answer commentary

Chunk class

First of all, we are expanding the Chunk class significantly. Set the flag noun to True if it was a noun. Then create a phrase_noun so that" X is "and" Y is ". At this point, it can be either X or Y, so I put the fixed value NOUN_REPLACE in the X and Y parts. Also, considering the case where nouns are consecutive (eg dove + clock), ʻends with is checked to see if the end is NOUN_REPLACE(make sure thatNOUN_REPLACE` is not continuous).

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.dst  = dst  #Contact clause index number
        self.dsts = []
        
        self.phrase = ''
        self.phrase_noun = ''
        self.noun = False
        
        for morph in morphs:
            if morph.pos != 'symbol':
                self.phrase += morph.surface #For non-symbols Create clauses
                if morph.pos == 'noun':
                    
                    #Treat the first noun as a noun phrase
                    if self.noun == False:
                        self.noun = True
                        self.phrase_noun += NOUN_REPLACE
                        continue
                    
                    #If the previous one is a noun, do not replace the surface system
                    elif self.noun and self.phrase_noun.endswith(NOUN_REPLACE):
                        continue
                        
                #Except for replacement and skip, add the surface system as it is
                self.phrase_noun += morph.surface

File read loop part

What has changed from the last time is the last ʻifconditional branch "when there is a dependency result in EOS". Loop through the set of clauseschunks` to create a list of all contacts. The reason for looping backwards from the last line is that we want to create all the contacts at once and reuse the contact list created once.

python


for line in f:
    dependancies = dependancy.match(line)
    
    #If it is not EOS or dependency analysis result
    if not (line == 'EOS\n' or dependancies):
        morphs.append(Morph(line))
        
    #When there is a morphological analysis result in the EOS or dependency analysis result
    elif len(morphs) > 0:
        chunks.append(Chunk(morphs, dst))
        morphs = []
   
    #In the case of dependency result
    if dependancies:
        dst = int(dependancies.group(1))
    
    #When there is a dependency result in EOS
    if line == 'EOS\n' and len(chunks) > 0:
        
        #Add contacts(Add from the last line for efficiency)
        for chunk in chunks[::-1]:
            if chunk.dst!= -1:
                chunk.dsts.extend([chunk.dst])
                if chunks[chunk.dst].dst != -1:
                    chunk.dsts.extend(chunks[chunk.dst].dsts)
        sentences.append(chunks)
        chunks = []

Output section

For easy checking later, the entire sentence, X, and Y information is indented and output, although it is not included in the assignment instructions. The output this time can be roughly divided into the following two patterns.

--Pattern 1: X and Y are the same route: Y replaces the entire clause-> Set to P1 on output --Pattern 2: X and Y are separated by another path: |, Y replaces only the noun part-> Set to P2 at the time of output

In the case of pattern 1, since X and Y are the same path, X to Y are output in order. Pattern 2 is complicated because X and Y are different paths. I'm using the ʻintersection` function to find the shared termination of X and Y. "What I did in the 6th chapter of Chapter 1" That's right.

python


def output_result(i, chunk, sentence):
    print('\tX:', i, chunk.phrase, chunk.dsts)

    #Noun phrase(X)Loop to search for Y after
    for i_y, y_chunk in enumerate(sentence[i+1:], i+1):

        #Y for noun phrases
        if y_chunk.noun:

            #Y information output
            print('\t\tY:', i_y, y_chunk.phrase)

            result = ''

            #Y is the route(Contact)If included in
            if i_y in chunk.dsts:

                result = '\t\t\tP1:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X') + SEP1

                #Output from X to Y
                for i_path in chunk.dsts:

                    #Finish when Y is reached
                    if i_path == i_y:
                        result += 'Y'
                        break
                    else:
                        result += sentence[i_path].phrase + SEP1

            #Y is the route(Contact)If not included in
            else:
                result = '\t\t\tP2:' + '\t' + sentence[i].phrase_noun.replace(NOUN_REPLACE, 'X')

                #Find the intersection with the minimum value in the product set
                i_goal = min(set(chunk.dsts).intersection(set(sentence[i_y].dsts)))

                #From X to Y before the common phrase
                for i_path in chunk.dsts:                    
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_y].phrase_noun.replace(NOUN_REPLACE, 'Y')
                        break
                    else:
                        result += SEP1 + sentence[i_path].phrase                 

                #From the person in charge of Y to the common phrase
                for i_path in sentence[i_y].dsts:
                    if i_path == i_goal:
                        result += SEP2 + sentence[i_goal].phrase
                    else:
                        result += SEP1 + sentence[i_path].phrase
            print(result)

Output result (execution result)

When the program is executed, the following results will be output. Probably it should be. I haven't checked much.

Output result


0 one
1 I am a cat
2 No name yet
	X:0 The name is[2]
3 I have no idea where I was born
	X:0 where[1, 2, 4]
		Y:2 Katonto
			P1:In X->Born-> Y
		Y:3 I have a clue
			P2:In X->Born->Katonto|Y|Do not use
	X:2 Katonto[4]
		Y:3 I have a clue
			P2:With X|Y|Do not use
	X:3 I have a clue[4]
4 I remember only crying in a dim and damp place
	X:0 anything[1, 5, 7]
		Y:At 3
			P2:Even with X->dim|In Y|In tears->I remember
		Y:6 Only what was there
			P2:Even with X->dim->In tears|Only Y|I remember
		Y:7 remember
			P1:Even with X->dim->In tears-> Y
	X:At 3[5, 7]
		Y:6 Only what was there
			P2:In X->In tears|Only Y|I remember
		Y:7 remember
			P1:In X->In tears-> Y
	X:6 Only what was there[7]
		Y:7 remember
			P1:Only X-> Y
5 I saw human beings for the first time here
	X:0 I am[5]
		Y:1 where
			P2:X is|In Y->Start with->Human->Things|saw
		Y:3 human
			P2:X is|Called Y->Things|saw
		Y:4 things
			P2:X is|Y|saw
	X:1 where[2, 3, 4, 5]
		Y:3 human
			P1:In X->Start with-> Y
		Y:4 things
			P1:In X->Start with->Human-> Y
	X:3 human[4, 5]
		Y:4 things
			P1:Called X-> Y
	X:4 things[5]
6 Moreover, I heard later that it was the most evil race of human beings called Shosei.
	X:1 later[2, 9]
		Y:3 it is
			P2:In X->When you hear|Y is|That's it
		Y:4 Called a student
			P2:In X->When you hear|Called Y->In humans->Was a race|That's it
		Y:5 in humans
			P2:In X->When you hear|In Y->Was a race|That's it
		Y:6 Ichiban
			P2:In X->When you hear| Y ->Evil->Was a race|That's it
		Y:7 Evil
			P2:In X->When you hear|Y->Was a race|That's it
		Y:Was 8 races
			P2:In X->When you hear|Was Y|That's it
		Y:9 That's right
			P1:In X->When you hear-> Y
	X:3 it is[9]
		Y:4 Called a student
			P2:X is|Called Y->In humans->Was a race|That's it
		Y:5 in humans
			P2:X is|In Y->Was a race|That's it
		Y:6 Ichiban
			P2:X is| Y ->Evil->Was a race|That's it
		Y:7 Evil
			P2:X is|Y->Was a race|That's it
		Y:Was 8 races
			P2:X is|Was Y|That's it
		Y:9 That's right
			P1:X is-> Y
	X:4 Called a student[5, 8, 9]
		Y:5 in humans
			P1:Called X-> Y
		Y:6 Ichiban
			P2:Called X->In humans| Y ->Evil|Was a race->That's it
		Y:7 Evil
			P2:Called X->In humans|Y|Was a race->That's it
		Y:Was 8 races
			P1:Called X->In humans-> Y
		Y:9 That's right
			P1:Called X->In humans->Was a race-> Y
	X:5 in humans[8, 9]
		Y:6 Ichiban
			P2:In X| Y ->Evil|Was a race->That's it
		Y:7 Evil
			P2:In X|Y|Was a race->That's it
		Y:Was 8 races
			P1:In X-> Y
		Y:9 That's right
			P1:In X->Was a race-> Y
	X:6 Ichiban[7, 8, 9]
		Y:7 Evil
			P1:	X -> Y
		Y:Was 8 races
			P1:	X ->Evil-> Y
		Y:9 That's right
			P1:	X ->Evil->Was a race-> Y
	X:7 Evil[8, 9]
		Y:Was 8 races
			P1:X-> Y
		Y:9 That's right
			P1:X->Was a race-> Y
	X:Was 8 races[9]
		Y:9 That's right
			P1:Was X-> Y

Recommended Posts

100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
Language processing 100 knocks-48: Extraction of paths from nouns to roots
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 language processing knock-55: named entity extraction
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Language processing 100 knocks-22: Extraction of category names
100 Language Processing Knock (2020): 28
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock-96 (using Gensim): Extraction of vector for country name
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
Language processing 100 knocks-46: Extraction of verb case frame information
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock-43: Extracted clauses containing nouns related to clauses containing verbs
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 language processing knock-76 (using scikit-learn): labeling
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning