[PYTHON] 100 Language Processing Knock-57: Dependency Analysis

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 57th "Dependency Analysis" of .tohoku.ac.jp/nlp100/#ch6). "Language processing 100 knocks-44: Visualization of dependent trees" This is the Stanford Core NLP version. I'm diverting a lot of code.

Reference link

Link Remarks
057.Dependency analysis.ipynb Answer program GitHub link
100 amateur language processing knocks:57 Copy and paste source of many source parts
Stanford Core NLP Official Stanford Core NLP page to look at first

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Stanford CoreNLP 3.9.2 I installed it a year ago and I don't remember in detail ...
It was the latest even after a year, so I used it as it was
openJDK 1.8.0_242 I used the JDK that was installed for other purposes as it is

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pydot 1.4.1

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

57. Dependency analysis

Visualize the collapsed-dependencies of Stanford Core NLP as a directed graph. For visualization, convert the dependency tree to DOT language and [Graphviz](http: / /www.graphviz.org/) should be used. Also, to visualize directed graphs directly from Python, use pydot.

Problem supplement (about "dependency")

Dependencies are called "Dependencies" in Stanford Core NLP, and the mechanics of Stanford Core NLP can be found in Stanford Dependencies.

It seems that there are two types, and this time the target is collapsed-dependencies. image.png

After doing it, I noticed that it was easier to understand if I added the relationship between edges (prep_on etc.) in the directed graph.

Answer

Answer Program [057. Dependency Analysis.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 057.% E4% BF% 82% E3% 82% 8A% E5% 8F% 97% E3% 81% 91% E8% A7% A3% E6% 9E% 90.ipynb)

import xml.etree.ElementTree as ET

import pydot

for i, sentence in enumerate(ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence')):
    edges = []
    for dependency in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
        
        #Punctuation exclusion
        if dependency.get('type') != 'punct':
            governor = dependency.find('./governor')
            dependent = dependency.find('./dependent')
            edges.append(((governor.get('idx'), governor.text), 
                          (dependent.get('idx'), dependent.text)))
    if len(edges) > 0:
        graph = pydot.graph_from_edges(edges, directed=True)
        graph.write_jpeg('057.graph_{}.jpeg'.format(i))
    
    if i > 5:
        break

Answer commentary

XML file path

It is the mapping of the path of the following XML file and the target source and destination. The dependencies tag of the 5th layer targets the attribute type of collapsed-dependencies.

output 1st level Second level Third level 4th level 5th level 6th level 7th level
Person in charge root document sentences sentence dependencies dep governor
Contact root document sentences sentence dependencies dep dependent

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

xml:nlp.txt.xml(Excerpt)


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <dependencies type="collapsed-dependencies">
          <dep type="root">
            <governor idx="0">ROOT</governor>
            <dependent idx="18">field</dependent>
          </dep>
          <dep type="amod">
            <governor idx="3">processing</governor>
            <dependent idx="1">Natural</dependent>
          </dep>
          <dep type="compound">
            <governor idx="3">processing</governor>
            <dependent idx="2">language</dependent>
          </dep>

Directed graph display using Pydot

The code part below. What I'm doing is the same as "Knock 100 language processing-44: Visualization of dependent tree", so I won't explain it. However, add the relationship between edges using Graphviz and networkx. I regret that I should have done it. Article "Drawing multigraphs and beautiful graphs with networkx [python]" and article "Using Graphviz on Python" Draw a beautiful graph " I think I can write it by referring to it. In the first place, pydot has not been updated since December 2018, so I am worried that it will continue in the future.

python


for i, sentence in enumerate(ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence')):
    edges = []
    for dependency in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):
        
        #Punctuation exclusion
        if dependency.get('type') != 'punct':
            governor = dependency.find('./governor')
            dependent = dependency.find('./dependent')
            edges.append(((governor.get('idx'), governor.text), 
                          (dependent.get('idx'), dependent.text)))
    if len(edges) > 0:
        graph = pydot.graph_from_edges(edges, directed=True)
        graph.write_jpeg('057.graph_{}.jpeg'.format(i))

Output result (execution result)

When the program is executed, the following result is output (only the first 3 sentences).

image.png

image.png

image.png

Recommended Posts

100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 language processing knock-56: co-reference analysis
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
Language processing 100 knocks-40: Reading dependency analysis results (morpheme)
100 Language Processing Knock-41: Reading Parsing Results (Phrase / Dependency)
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 1 Morphological analysis
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net