[PYTHON] 100 language processing knock-42: Display of the phrase of the person concerned and the person concerned

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. tohoku.ac.jp/nlp100/#ch5) 42nd "Display of the phrase of the person in charge and the person in charge" Record is. Since the clerk and the clerk clause are output, it feels like the actual performance of the clerk. However, technically, the output method changes a little, so it's not much different from the previous knock.

Reference link

Link Remarks
042.Display of the phrase of the person in charge and the person in charge.ipynb Answer program GitHub link
100 amateur language processing knocks:42 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I only remember being frustrated when I decided to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

42. Display of the clauses of the person concerned and the person concerned

Extract all the text of the original clause and the relationed clause in tab-delimited format. However, do not output symbols such as punctuation marks.

Answer

Answer program [042. Display of the phrase of the person concerned and the person concerned.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5 % 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 042.% E4% BF% 82% E3% 82% 8A% E5% 85% 83% E3% 81% A8% E4% BF% 82% E3% 82% 8A% E5% 85% 88% E3% 81% AE% E6% 96% 87% E7% AF% 80% E3% 81% AE% E8% A1% A8% E7% A4% BA.ipynb)

import re

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

class Morph:
    def __init__(self, line):
        
        #Split with tabs and commas
        cols = separator.split(line)
        
        self.surface = cols[0] #Surface type(surface)
        self.base = cols[7]    #Uninflected word(base)
        self.pos = cols[1]     #Part of speech(pos)
        self.pos1 = cols[2]    #Part of speech subclassification 1(pos1)

class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        self.phrase = ''.join([morph.surface for morph in morphs if morph.pos!= 'symbol']) #Phrase

#Substitute the origin and add the Chunk list to the statement list
def append_sentence(chunks, sentences):
    
    #Substitute the entrepreneur
    for i, chunk in enumerate(chunks):
        if chunk.dst != -1:
            chunks[chunk.dst].srcs.append(i)
    sentences.append(chunks)
    return sentences, []

morphs = []
chunks = []
sentences = []

with open('./neko.txt.cabocha') as f:
    
    for line in f:
        dependancies = dependancy.match(line)
        
        #If it is not EOS or dependency analysis result
        if not (line == 'EOS\n' or dependancies):
            morphs.append(Morph(line))
            
        #When there is a morphological analysis result in the EOS or dependency analysis result
        elif len(morphs) > 0:
            chunks.append(Chunk(morphs, dst))
            morphs = []
       
        #In the case of dependency result
        if dependancies:
            dst = int(dependancies.group(1))
        
        #When there is a dependency result in EOS
        if line == 'EOS\n' and len(chunks) > 0:
            sentences, chunks = append_sentence(chunks, sentences)

for si, sentence in enumerate(sentences):
    print('-----', si, '-----')
    for ci, chunk in enumerate(sentence):
        if chunk.dst != -1:
            print('{}:{}\t{}'.format(ci, chunk.phrase, sentence[chunk.dst].phrase))
    
    #Limited because there are many
    if si > 5:
        break

Answer commentary

Exclude symbols from Chunk clauses

A little different from the previous Chunk class, the symbols are excluded from the clause.

python


class Chunk:
    def __init__(self, morphs, dst):
        self.morphs = morphs
        self.srcs = []   #List of original clause index numbers
        self.dst  = dst  #Contact clause index number
        self.phrase = ''.join([morph.surface for morph in morphs if morph.pos!= 'symbol']) #Phrase

Output section

"Text is tab-delimited format" is like tab-delimited text, but I don't see it even if all of it appears, and it is easier to see if there is a sentence delimiter, so I interpret it arbitrarily and output the tab with print.

python


for si, sentence in enumerate(sentences):
    print('-----', si, '-----')
    for ci, chunk in enumerate(sentence):
        if chunk.dst != -1:
            print('{}:{}\t{}'.format(ci, chunk.phrase, sentence[chunk.dst].phrase))
    
    #Limited because there are many
    if si > 5:
        break

Output result (execution result)

When the program is executed, the following result is output (only 6 sentences are output).

Output result


----- 0 -----
----- 1 -----
----- 2 -----
0:No name
1:Not yet
----- 3 -----
0:Where was born
1:Born
2:I don't get it
3:I have no idea
----- 4 -----
0:Anything dim
1:Dim crying
2:Weeping
3:Crying where you did
4:Meow meow crying
5:I cry and remember
6:I remember only what I was
----- 5 -----
0:I saw
1:For the first time here
2:For the first time called human
3:Human beings
4:I saw something
----- 6 -----
0:And that's right
1:I will ask you later
2:I heard that
3:That's right
4:In the human being called Shosei
5:Was a race in humans
6:The worst
7:Was an evil race
8:It seems that it was a race

Recommended Posts

100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
Language processing 100 knock-86: Word vector display
100 language processing knock-29: Get the URL of the national flag image
100 Language Processing Knock-59: Analysis of S-expressions
Answers and impressions of 100 language processing knocks-Part 1
100 Language Processing Knock-91: Preparation of Analogy Data
Answers and impressions of 100 language processing knocks-Part 2
100 Language Processing Knock-26: Removal of emphasized markup
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
Python inexperienced person tries to knock 100 language processing 14-16
100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
Python inexperienced person tries to knock 100 language processing 07-09
100 language processing knock-75 (using scikit-learn): weight of features
Python inexperienced person tries to knock 100 language processing 10 ~ 13
Python inexperienced person tries to knock 100 language processing 05-06
Python inexperienced person tries to knock 100 language processing 00-04
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Overview of natural language processing and its data preprocessing
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Language Processing Knock-51: Word Clipping
Types of preprocessing in natural language processing and their power
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
The image display function of iTerm is convenient for image processing.
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
Real-time display of server-side processing progress in the browser (implementation of progress bar)
Flow of getting the result of asynchronous processing using Django and Celery