[PYTHON] 100 Language Processing Knock-58: Tuple Extraction

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 58th "Tuple Extraction" of .tohoku.ac.jp/nlp100/#ch6). Last time knocking was visualization of the entire dependency, but this time it is an output by extracting a specific dependency. About 80% is the same as what we are doing.

Reference link

Link Remarks
058.Extraction of tuples.ipynb Answer program GitHub link
100 amateur language processing knocks:58 Copy and paste source of many source parts
Stanford Core NLP Official Stanford Core NLP page to look at first

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Stanford CoreNLP 3.9.2 I installed it a year ago and I don't remember in detail ...
It was the latest even after a year, so I used it as it was
openJDK 1.8.0_242 I used the JDK that was installed for other purposes as it is

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

58. Extraction of tuples

Output the set of "subject predicate object" in tab-delimited format based on the result of the dependency analysis (collapsed-dependencies) of Stanford Core NLP. However, refer to the following for the definitions of subject, predicate, and object.

--Predicate: A word that has children (dependants) of nsubj and dobj relationships --Subject: A child (dependent) that has an nsubj relationship from the predicate --Object: A child (dependent) that has a dobj relationship from the predicate

Problem supplement (about "tuple")

When I heard "tuple", I thought of Python's Tuple, but this time it seems to be different. First, in Wikipedia "Tuple", it is written as follows. ** A "set of multiple components" **.

Tuple or tuple (English: tuple) is a general concept that collectively refers to a set consisting of multiple components.

Stanford CoreNLP mentions Tuple in Stanford Open Information Extraction.

Open information extraction (open IE) refers to the extraction of relation tuples, typically binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook).

And the following figure on the same page is easy to understand about "tuple". image.png

Answer

Answer Program [058. Tuple Extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 058.% E3% 82% BF% E3% 83% 97% E3% 83% AB% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

import xml.etree.ElementTree as ET

texts = []
#sentence enumeration, processing one sentence at a time
for sentence in ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence'):

    output = {}
    
    #Dependency enumeration
    for dep in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):

        #Relationship check
        dep_type = dep.get('type')
        if dep_type == 'nsubj' or dep_type == 'dobj':

            #Add to predicate dictionary
            governor = dep.find('./governor')
            index = governor.get('idx')
            if index in output:
                texts = output[index]
            else:
                texts = [governor.text, '', '']

            #Add to subject or object(If the same predicate, win later)
            if dep_type == 'nsubj':
                texts[1] = dep.find('./dependent').text
            else:
                texts[2] = dep.find('./dependent').text
            output[index] = texts

    for key, texts in output.items():
        if texts[1] != '' and texts[2] != '':
            print(sentence.get('id'), '\t', '\t'.join(texts))

Answer commentary

XML file path

It is the mapping of the path of the following XML file and the target source and destination. The dependencies tag of the 5th layer targets the attribute type of collapsed-dependencies. Also, the attribute type of the 6th layer is the one with nsubj or dobj.

output 1st level Second level Third level 4th level 5th level 6th level 7th level
Person in charge root document sentences sentence dependencies dep governor
Contact root document sentences sentence dependencies dep dependent

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

xml:nlp.txt.xml(Excerpt)


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <dependencies type="collapsed-dependencies">
          <dep type="root">
            <governor idx="0">ROOT</governor>
            <dependent idx="18">field</dependent>
          </dep>

--Omission--

          <dep type="nsubj">
            <governor idx="18">field</governor>
            <dependent idx="12">processing</dependent>
          </dep>

--Omission--

          <dep type="dobj">
            <governor idx="13">enabling</governor>
            <dependent idx="14">computers</dependent>
          </dep>

Create dictionary for output

This is where the dictionary variable ʻoutput for output is created for each sentence. The index of the predicate (governor`) is used as the key of the dictionary, the value of the dictionary is a list type, and the contents are "predicate text", "subject text", and "object text". If you want to keep multiple subjects and objects, the win-win method is used.

python


#Dependency enumeration
for dep in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):

    #Relationship check
    dep_type = dep.get('type')
    if dep_type == 'nsubj' or dep_type == 'dobj':

        #Add to predicate dictionary
        governor = dep.find('./governor')
        index = governor.get('idx')
        if index in output:
            texts = output[index]
        else:
            texts = [governor.text, '', '']

        #Add to subject or object(If the same predicate, win later)
        if dep_type == 'nsubj':
            texts[1] = dep.find('./dependent').text
        else:
            texts[2] = dep.find('./dependent').text
        output[index] = texts

Output section

If there is a subject or object, it is output.

python


for key, texts in output.items():
    if texts[1] != '' and texts[2] != '':
        print(sentence.get('id'), '\t', '\t'.join(texts))

Output result (execution result)

When the program is executed, the following results will be output.

Output result


3 	 involve	understanding	generation
5 	 published	Turing	article
6 	 involved	experiment	translation
11 	 provided	ELIZA	interaction
12 	 exceeded	patient	base
12 	 provide	ELIZA	response
14 	 structured	which	information
19 	 discouraged	underpinnings	sort
19 	 underlies	that	approach
20 	 produced	Some	systems
21 	 make	which	decisions
23 	 contains	that	errors
34 	 involved	implementations	coding
38 	 take	algorithms	set
39 	 produced	Some	systems
40 	 make	which	decisions
41 	 have	models	advantage
41 	 express	they	certainty
42 	 have	Systems	advantages
43 	 make	procedures	use
44 	 make	that	decisions

Recommended Posts

100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-55: named entity extraction
100 Language Processing Knock-82 (Context Word): Context Extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement