[PYTHON] 100 Language Processing Knock-59: Analysis of S-expressions

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 59th "S-expression analysis" of .tohoku.ac.jp/nlp100/#ch6). Create a parser in a format called "S-expression". It made me think of a parser for the first time, but it is very deep. This knock took a very long time. When I finish it, it's about 50 lines, but there is a lot of room for efficiency. This time, I abandoned efficiency and made it as simple as possible.

Reference link

Link Remarks
059.Analysis of S-expressions.ipynb Answer program GitHub link
100 amateur language processing knocks:59 Copy and paste source of many source parts
Stanford Core NLP Official Stanford Core NLP page to look at first

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Stanford CoreNLP 3.9.2 I installed it a year ago and I don't remember in detail ...
It was the latest even after a year, so I used it as it was
openJDK 1.8.0_242 I used the JDK that was installed for other purposes as it is

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

59. Analysis of S-expression

Read the result of phrase structure analysis (S-expression) of Stanford Core NLP and display all noun phrases (NP) in the sentence. Display all nested noun phrases as well.

Problem supplement (about "S-expression")

According to Wikipedia "S-expression", the following explanation.

A formal description method of binary tree or list structure introduced in Lisp and mainly used in Lisp. S is derived from Symbol.

The mechanism for expressing natural language in "S-expressions" is described in Stanford Parser, and [Online Test Tool](http: / There is also /nlp.stanford.edu:8080/parser/). There was also a package that parses "S-expressions" in Python, but it seems that it is not used much, so I did my best by making it myself.

Answer

Answer program [059. Analysis of S-expression.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 059.S% E5% BC% 8F% E3% 81% AE% E8% A7% A3% E6% 9E% 90.ipynb)

import re
import xml.etree.ElementTree as ET

reg_split = re.compile(r'''
                         (      #Group start
                         \(|\)  #Group of split characters(Start parenthesis or end parenthesis)
                         )      #Group end
                         ''', re.VERBOSE)

def output_np(chunks): 
    depth = 1
    output = []

    for chunk in chunks:
        
        #The start of parentheses is the depth+1
        if chunk == '(':
            depth += 1

        #The end of the parenthesis is the depth-1
        elif chunk == ')':
            depth -= 1
        else:
            
            #If it is a set of part of speech and text, it is divided and added to the output destination.
            sets = chunk.split(' ')
            if len(sets) == 2:
                output.append(sets[1])
        
        #Output when the depth reaches 0
        if depth == 0:
            print('\t', ' '.join(output))
            break

for parse in \
 ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):
    
    depth = 0
    print(parse.text)
    
    #Separate and list at the beginning and end of parentheses(Excludes blank and empty elements)
    chunks = [chunk.strip() for chunk in reg_split.split(parse.text) 
                if chunk.strip() != '']
        
    #Output starts when you reach NP
    for i, chunk in enumerate(chunks):
        if chunk == 'NP':
            output_np(chunks[i+1:])

Answer commentary

XML file path

The following is the mapping between the XML file path and the target sexp. The contents of the S-expression are contained in the parse tag.

output 1st level Second level Third level 4th level 5th level
S-expression root document sentences sentence parse

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

xml:nlp.txt.xml(Excerpt)


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <parse>(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) </parse>

Let's indent the above sexp to make it a little easier to see. It's a relatively short sentence, but it's still long ... The text part of the part surrounded by this NP (noun phrase) is combined and output.

(ROOT 
	(S 
		(PP 
			(IN As) 
			(NP 
				(JJ such)
			)
		) 
		(, ,) 
		(NP 
			(NN NLP)
		) 
		(VP 
			(VBZ is) 
			(ADJP 
				(VBN related) 
				(PP 
					(TO to) 
					(NP 
						(NP 
							(DT the) 
							(NN area)
						) 
						(PP 
							(IN of) 
							(NP 
								(JJ humani-computer) 
								(NN interaction)
							)
						)
					)
				)
			)
		) 
	(. .)
	)
)

Search for NP in the main loop

This is where the XML is read, looped, and searched for NP. The list is divided by regular expressions, and disturbing blank and empty elements are excluded. And when NP comes, we call the output function ʻoutput_np`. When NP is found from the top, it is output, but in the case of nested NP, it is inefficient because it passes through the same logic multiple times. But I wanted to keep it simple, so I'm leaving it inefficient.

python


for parse in \
 ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):

    depth = 0
    print(parse.text)

    #Separate and list at the beginning and end of parentheses(Excludes blank and empty elements)
    chunks = [chunk.strip() for chunk in reg_split.split(parse.text) 
                if chunk.strip() != '']

    #Output starts when you reach NP
    for i, chunk in enumerate(chunks):
        if chunk == 'NP':
            output_np(chunks[i+1:])

NP output section

The depth of the S-expression is judged by the start and end of the parentheses, and when the NP part ends, it is output.

python


def output_np(chunks): 
    depth = 1
    output = []

    for chunk in chunks:

        #The start of parentheses is the depth+1
        if chunk == '(':
            depth += 1

        #The end of the parenthesis is the depth-1
        elif chunk == ')':
            depth -= 1
        else:

            #If it is a set of part of speech and text, it is divided and added to the output destination.
            sets = chunk.split(' ')
            if len(sets) == 2:
                output.append(sets[1])

        #Output when the depth reaches 0
        if depth == 0:
            print('\t', ' '.join(output))
            break

Output result (execution result)

When the program is executed, the following result is output (first excerpt).

Output result(Top excerpt)


(ROOT (S (PP (NP (JJ Natural) (NN language) (NN processing)) (IN From) (NP (NNP Wikipedia))) (, ,) (NP (NP (DT the) (JJ free) (NN encyclopedia) (JJ Natural) (NN language) (NN processing)) (PRN (-LRB- -LRB-) (NP (NN NLP)) (-RRB- -RRB-))) (VP (VBZ is) (NP (NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science)))) (, ,) (NP (JJ artificial) (NN intelligence)) (, ,) (CC and) (NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages)))))))))) (. .))) 
	 Natural language processing
	 Wikipedia
	 the free encyclopedia Natural language processing -LRB- NLP -RRB-
	 the free encyclopedia Natural language processing
	 NLP
	 a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
	 a field of computer science
	 a field
	 computer science
	 artificial intelligence
	 linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
	 linguistics
	 the interactions between computers and human -LRB- natural -RRB- languages
	 the interactions
	 computers and human -LRB- natural -RRB- languages
	 computers
	 human -LRB- natural -RRB- languages
(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) 
	 such
	 NLP
	 the area of humani-computer interaction
	 the area
	 humani-computer interaction
(ROOT (S (S (NP (NP (JJ Many) (NNS challenges)) (PP (IN in) (NP (NN NLP)))) (VP (VBP involve) (S (NP (NP (JJ natural) (NN language) (NN understanding)) (, ,) (SBAR (WHNP (WDT that)) (S (VP (VBZ is)))) (, ,)) (VP (VBG enabling) (NP (NNS computers)) (S (VP (TO to) (VP (VB derive) (NP (NN meaning)) (PP (IN from) (NP (ADJP (JJ human) (CC or) (JJ natural)) (NN language) (NN input)))))))))) (, ,) (CC and) (S (NP (NNS others)) (VP (VBP involve) (NP (JJ natural) (NN language) (NN generation)))) (. .))) 
	 Many challenges in NLP
	 Many challenges
	 NLP
	 natural language understanding , that is ,
	 natural language understanding
	 computers
	 meaning
	 human or natural language input
	 others
	 natural language generation

Recommended Posts

100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-56: co-reference analysis
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock (2020): 28
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
100 Language Processing Knock-26: Removal of emphasized markup
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-75 (using scikit-learn): weight of features
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 1 Morphological analysis
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 language processing knocks 2020: Chapter 4 (morphological analysis)
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image