[PYTHON] 100 Language Processing Knock-44: Visualization of Dependent Tree

Language processing 100 knocks 2015 ["Chapter 5: Dependency analysis"](http://www.cl.ecei. It is a record of 44th "Visualization of Dependent Tree" of tohoku.ac.jp/nlp100/#ch5). Visualization makes it very easy to understand how the document is dependent. By visualizing the dependency, you can also do something nice as in the article "I tried to linguistically analyze Karen Takizawa's incomprehensible sentences.".

Reference link

Link Remarks
044.Visualization of dependent trees.ipynb Answer program GitHub link
100 amateur language processing knocks:44 Copy and paste source of many source parts
CaboCha official CaboCha page to look at first

environment

I installed CRF ++ and CaboCha too long ago and forgot how to install them. Since it is a package that has not been updated at all, we have not rebuilt the environment. I have only a frustrated memory of trying to use CaboCha on Windows. I think I couldn't use it on 64-bit Windows (I have a vague memory and maybe I have a technical problem).

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Mecab 0.996-5 apt-Install with get
CRF++ 0.58 It's too old and I forgot how to install(Perhapsmake install)
CaboCha 0.69 It's too old and I forgot how to install(Perhapsmake install)

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pydot 1.4.1

Chapter 5: Dependency analysis

content of study

Apply the dependency analyzer CaboCha to "I am a cat" and experience the operation of the dependency tree and syntactic analysis.

Class, Dependency Parsing, CaboCha, Clause, Dependency, Case, Functional Verb Parsing, Dependency Path, [Graphviz](http: / /www.graphviz.org/)

Knock content

Using CaboCha for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Analyze the dependency and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

44. Visualization of the dependent tree

Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and [Graphviz](http: / /www.graphviz.org/) should be used. Also, to visualize directed graphs directly from Python, use pydot.

Problem supplement (About "visualization and directed graph")

Visualization

It seems that there are two types of visualization. I'm ignoring the first method. I haven't even checked if the first method is easy. It doesn't matter because I didn't use it in "Amateur language processing 100 knocks: 44" that I always refer to in my knocks.

For visualization, convert the dependency tree to DOT language and then [Graphviz](http: //www.graphviz.org/) should be used.

This time, I used the following method. With this, all you have to do is install pydot with pip and pass it to the function in Python.

Also, to visualize directed graphs directly from Python, use pydot.

Directed graph

First, [** Graph Theory **](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%A9%E3%83%95%E7%90%86%E8% There is something called AB% 96)

Graph theory (Graph theory) is a mathematical theory of graphs consisting of a set of nodes (nodes / vertices) and a set of edges (branches / sides).

[Definition of directed graph and invalid graph as below](https://ja.wikipedia.org/wiki/%E3%82%B0%E3%83%A9%E3%83%95%E7%90%86% E8% AB% 96 #% E6% A6% 82% E8% A6% 81) is roughly (the "directed graph" has a direction). Please follow the link for details.

If you want to consider not only how to connect but also "from which to which", add an arrow to the edge. Such a graph is called a directed graph or a digraph. A graph without an arrow is called an undirected graph.

Answer

Answer program [044. Visualization of dependent trees.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/05.%E4%BF%82%E3%82%8A%E5%8F%97 % E3% 81% 91% E8% A7% A3% E6% 9E% 90 / 044.% E4% BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91% E6% 9C% A8% E3% 81% AE% E5% 8F% AF% E8% A6% 96% E5% 8C% 96.ipynb)

import re
from subprocess import run, PIPE

import pydot

#Delimiter
separator = re.compile('\t|,')

#Dependency
dependancy = re.compile(r'''(?:\*\s\d+\s) #Not subject to capture
                            (-?\d+)       #Numbers(Contact)
                          ''', re.VERBOSE)

text = input('Please enter text')

#initial value
if len(text) == 0:
    text = 'I don't remember exactly whether I said it or not, but I think I probably said it when I had a hand-wound party the other day, without feeling like I said it a little. I tried it, but I came to think that it doesn't matter whether I say it or not.'

cmd = 'echo {} | cabocha -f1'.format(text)
proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
print(proc.stdout.decode('UTF-8'))

class Chunk:
    def __init__(self, phrase, dst):
        self.phrase = phrase
        self.dst = dst  #Contact clause index number

phrase = ''
chunks = []
for line in proc.stdout.decode('UTF-8').splitlines():
    dependancies = dependancy.match(line)
    
    #If it is not EOS or dependency analysis result(Note that EOS does not have line breaks)
    if not (line == 'EOS' or dependancies):
        #Split with tabs and commas
        cols = separator.split(line)
        phrase += cols[0] #Surface type(surface)

    #When there is a morphological analysis result in the EOS or dependency analysis result
    elif phrase != '':
        chunks.append(Chunk(phrase, dst))
        phrase = ''

    #In the case of dependency result
    if dependancies:
        dst = int(dependancies.group(1))

#Changed to a format that passes something with a contact to pydot
edges = []
for i, chunk in enumerate(chunks):
    if chunk.dst != -1 and \
       chunk.phrase != '' and \
       chunks[chunk.dst].phrase != '':
        edges.append(((i, chunk.phrase), (chunk.dst, chunks[chunk.dst].phrase)))

#Save image as directed graph with pydot
if len(edges) > 0:
    graph = pydot.graph_from_edges(edges, directed=True)
    graph.write_png('044.dot.png')

Answer commentary

Text input

The "given sentence" part of the knock is given by the ʻinput` function (does it conform to the question intention?). If nothing is entered, the initial value will be used.

python


text = input('Please enter text')

#initial value
if len(text) == 0:
    text = 'I don't remember exactly whether I said it or not, but I think I probably said it when I had a hand-wound party the other day, without feeling like I said it a little. I tried it, but I came to think that it doesn't matter whether I say it or not.'

CaboCha execution part

The CaboCha execution part uses the function run of the package subprocess to execute the shell. I didn't use CaboCha's Python wrapper because it was purely annoying.

python


cmd = 'echo {} | cabocha -f1'.format(text)
proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
print(proc.stdout.decode('UTF-8'))

The first part of the content output by the print function is as follows.

Part of print result


* 0 1D 0/4 0.285960
Say verb,Independence,*,*,Godan / Wa line reminder,Continuous connection,To tell,It,It
Particles,Connection particle,*,*,*,*,hand,Te,Te
A verb,Non-independent,*,*,Five steps, La line,Continuous connection,is there,Ah,Ah
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
Ka particle,Sub-particles / parallel particles / final particles,*,*,*,*,Or,Mosquito,Mosquito
* 1 4D 0/4 2.230543
Say verb,Independence,*,*,Godan / Wa line reminder,Continuous connection,To tell,It,It
Verb,Non-independent,*,*,One step,Imperfective form,Teru,Te,Te
No auxiliary verb,*,*,*,Special Nai,Continuous connection,Absent,Naka,Naka
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
Ka particle,Sub-particles / parallel particles / final particles,*,*,*,*,Or,Mosquito,Mosquito
* 2 4D 0/3 2.418727
Which noun,Pronoun,General,*,*,*,Which,Dotch,Dotch
Auxiliary verb,*,*,*,Special,Continuous connection,Is,Dad,Dad
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
Ka particle,Sub-particles / parallel particles / final particles,*,*,*,*,Or,Mosquito,Mosquito

python


#Changed to a format that passes something with a contact to pydot
edges = []
for i, chunk in enumerate(chunks):
    if chunk.dst != -1 and \
       chunk.phrase != '' and \
       chunks[chunk.dst].phrase != '':
        edges.append(((i, chunk.phrase), (chunk.dst, chunks[chunk.dst].phrase)))

By the way, ʻedges` has such contents.

((0, 'Did you say'), (1, 'Didn't you say'))
((1, 'Didn't you say'), (4, 'I don't remember'))
((2, 'Which was'), (4, 'I don't remember'))
((3, 'Properly'), (4, 'I don't remember'))
((4, 'I don't remember'), (19, 'I thought about it,'))
((5, 'Certainly'), (7, 'Hooray'))
((6, 'A hand-wound party during this time'), (7, 'Hooray'))
((7, 'Hooray'), (8, 'Sometimes'))
((8, 'Sometimes'), (10, 'Said'))
((9, 'A little bit'), (10, 'Said'))
((10, 'Said'), (11, 'Feeling'))
((11, 'Feeling'), (12, 'Without'))
((12, 'Without'), (14, 'Nishimo'))
((13, 'Without'), (14, 'Nishimo'))
((14, 'Nishimo'), (15, 'Without'))
((15, 'Without'), (17, 'I think I said'))
((16, 'Perhaps'), (17, 'I think I said'))
((17, 'I think I said'), (19, 'I thought about it,'))
((18, 'To here'), (19, 'I thought about it,'))
((19, 'I thought about it,'), (28, 'It depends.'))
((20, 'Oh dear'), (21, 'I'll tell you'))
((21, 'I'll tell you'), (28, 'It depends.'))
((22, 'Say'), (23, 'I don't care'))
((23, 'I don't care'), (25, 'There is no problem,'))
((24, 'Up to that point'), (25, 'There is no problem,'))
((25, 'There is no problem,'), (26, 'I think'))
((26, 'I think'), (27, 'Reached'))
((27, 'Reached'), (28, 'It depends.'))

Directed graphing

Finally, use the graph_from_edges function to create a valid graph and use the write_png function to save the image. By setting directed = True at the time of directed graphing, the line between the segments becomes an arrow.

#Save image as directed graph with pydot
if len(edges) > 0:
    graph = pydot.graph_from_edges(edges, directed=True)
    graph.write_png('044.dot.png')

Output result (execution result)

When the program is executed, the following results will be output.

image.png

By the way, this is the original story of this document. Article "[Play] Synthetic analysis of Shinkalion's ton demo mail". image.png

Recommended Posts

100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-26: Removal of emphasized markup
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-99 (using pandas): visualization by t-SNE
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 Amateur Language Processing Knock: Summary
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display