Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 58th "Tuple Extraction" of .tohoku.ac.jp/nlp100/#ch6). Last time knocking was visualization of the entire dependency, but this time it is an output by extracting a specific dependency. About 80% is the same as what we are doing.

Reference link

Link	Remarks
058.Extraction of tuples.ipynb	Answer program GitHub link
100 amateur language processing knocks:58	Copy and paste source of many source parts
Stanford Core NLP Official	Stanford Core NLP page to look at first

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Stanford CoreNLP	3.9.2	I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was
openJDK	1.8.0_242	I used the JDK that was installed for other purposes as it is

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

58. Extraction of tuples

Output the set of "subject predicate object" in tab-delimited format based on the result of the dependency analysis (collapsed-dependencies) of Stanford Core NLP. However, refer to the following for the definitions of subject, predicate, and object.

--Predicate: A word that has children (dependants) of nsubj and dobj relationships --Subject: A child (dependent) that has an nsubj relationship from the predicate --Object: A child (dependent) that has a dobj relationship from the predicate

Problem supplement (about "tuple")

When I heard "tuple", I thought of Python's Tuple, but this time it seems to be different. First, in Wikipedia "Tuple", it is written as follows. ** A "set of multiple components" **.

Tuple or tuple (English: tuple) is a general concept that collectively refers to a set consisting of multiple components.

Stanford CoreNLP mentions Tuple in Stanford Open Information Extraction.

Open information extraction (open IE) refers to the extraction of relation tuples, typically binary relations, from plain text, such as (Mark Zuckerberg; founded; Facebook).

And the following figure on the same page is easy to understand about "tuple".

Answer

Answer Program [058. Tuple Extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 058.% E3% 82% BF% E3% 83% 97% E3% 83% AB% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

import xml.etree.ElementTree as ET

texts = []
#sentence enumeration, processing one sentence at a time
for sentence in ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence'):

    output = {}
    
    #Dependency enumeration
    for dep in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):

        #Relationship check
        dep_type = dep.get('type')
        if dep_type == 'nsubj' or dep_type == 'dobj':

            #Add to predicate dictionary
            governor = dep.find('./governor')
            index = governor.get('idx')
            if index in output:
                texts = output[index]
            else:
                texts = [governor.text, '', '']

            #Add to subject or object(If the same predicate, win later)
            if dep_type == 'nsubj':
                texts[1] = dep.find('./dependent').text
            else:
                texts[2] = dep.find('./dependent').text
            output[index] = texts

    for key, texts in output.items():
        if texts[1] != '' and texts[2] != '':
            print(sentence.get('id'), '\t', '\t'.join(texts))

Answer commentary

XML file path

It is the mapping of the path of the following XML file and the target source and destination. The dependencies tag of the 5th layer targets the attribute type of collapsed-dependencies. Also, the attribute type of the 6th layer is the one with nsubj or dobj.

output	1st level	Second level	Third level	4th level	5th level	6th level	7th level
Person in charge	root	document	sentences	sentence	dependencies	dep	governor
Contact	root	document	sentences	sentence	dependencies	dep	dependent

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

`xml:nlp.txt.xml(Excerpt)`


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <dependencies type="collapsed-dependencies">
          <dep type="root">
            <governor idx="0">ROOT</governor>
            <dependent idx="18">field</dependent>
          </dep>

--Omission--

          <dep type="nsubj">
            <governor idx="18">field</governor>
            <dependent idx="12">processing</dependent>
          </dep>

--Omission--

          <dep type="dobj">
            <governor idx="13">enabling</governor>
            <dependent idx="14">computers</dependent>
          </dep>

Create dictionary for output

This is where the dictionary variable ʻoutput for output is created for each sentence. The index of the predicate (governor`) is used as the key of the dictionary, the value of the dictionary is a list type, and the contents are "predicate text", "subject text", and "object text". If you want to keep multiple subjects and objects, the win-win method is used.

`python`


#Dependency enumeration
for dep in sentence.iterfind('./dependencies[@type="collapsed-dependencies"]/dep'):

    #Relationship check
    dep_type = dep.get('type')
    if dep_type == 'nsubj' or dep_type == 'dobj':

        #Add to predicate dictionary
        governor = dep.find('./governor')
        index = governor.get('idx')
        if index in output:
            texts = output[index]
        else:
            texts = [governor.text, '', '']

        #Add to subject or object(If the same predicate, win later)
        if dep_type == 'nsubj':
            texts[1] = dep.find('./dependent').text
        else:
            texts[2] = dep.find('./dependent').text
        output[index] = texts

Output section

If there is a subject or object, it is output.

`python`


for key, texts in output.items():
    if texts[1] != '' and texts[2] != '':
        print(sentence.get('id'), '\t', '\t'.join(texts))

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


3 	 involve	understanding	generation
5 	 published	Turing	article
6 	 involved	experiment	translation
11 	 provided	ELIZA	interaction
12 	 exceeded	patient	base
12 	 provide	ELIZA	response
14 	 structured	which	information
19 	 discouraged	underpinnings	sort
19 	 underlies	that	approach
20 	 produced	Some	systems
21 	 make	which	decisions
23 	 contains	that	errors
34 	 involved	implementations	coding
38 	 take	algorithms	set
39 	 produced	Some	systems
40 	 make	which	decisions
41 	 have	models	advantage
41 	 express	they	certainty
42 	 have	Systems	advantages
43 	 make	procedures	use
44 	 make	that	decisions

[PYTHON] 100 Language Processing Knock-58: Tuple Extraction

Reference link

environment

Chapter 6: Processing English Text

content of study

Knock content

58. Extraction of tuples

Problem supplement (about "tuple")

Answer

Answer Program [058. Tuple Extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 058.% E3% 82% BF% E3% 83% 97% E3% 83% AB% E3% 81% AE% E6% 8A% BD% E5% 87% BA.ipynb)

Answer commentary

XML file path

xml:nlp.txt.xml(Excerpt)

Create dictionary for output

python

Output section

python

Output result (execution result)

Output result

`xml:nlp.txt.xml(Excerpt)`

`python`

`python`

`Output result`