Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 59th "S-expression analysis" of .tohoku.ac.jp/nlp100/#ch6). Create a parser in a format called "S-expression". It made me think of a parser for the first time, but it is very deep. This knock took a very long time. When I finish it, it's about 50 lines, but there is a lot of room for efficiency. This time, I abandoned efficiency and made it as simple as possible.

Reference link

Link	Remarks
059.Analysis of S-expressions.ipynb	Answer program GitHub link
100 amateur language processing knocks:59	Copy and paste source of many source parts
Stanford Core NLP Official	Stanford Core NLP page to look at first

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Stanford CoreNLP	3.9.2	I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was
openJDK	1.8.0_242	I used the JDK that was installed for other purposes as it is

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

59. Analysis of S-expression

Read the result of phrase structure analysis (S-expression) of Stanford Core NLP and display all noun phrases (NP) in the sentence. Display all nested noun phrases as well.

Problem supplement (about "S-expression")

According to Wikipedia "S-expression", the following explanation.

A formal description method of binary tree or list structure introduced in Lisp and mainly used in Lisp. S is derived from Symbol.

The mechanism for expressing natural language in "S-expressions" is described in Stanford Parser, and [Online Test Tool](http: / There is also /nlp.stanford.edu:8080/parser/). There was also a package that parses "S-expressions" in Python, but it seems that it is not used much, so I did my best by making it myself.

Answer

Answer program [059. Analysis of S-expression.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 059.S% E5% BC% 8F% E3% 81% AE% E8% A7% A3% E6% 9E% 90.ipynb)

import re
import xml.etree.ElementTree as ET

reg_split = re.compile(r'''
                         (      #Group start
                         \(|\)  #Group of split characters(Start parenthesis or end parenthesis)
                         )      #Group end
                         ''', re.VERBOSE)

def output_np(chunks): 
    depth = 1
    output = []

    for chunk in chunks:
        
        #The start of parentheses is the depth+1
        if chunk == '(':
            depth += 1

        #The end of the parenthesis is the depth-1
        elif chunk == ')':
            depth -= 1
        else:
            
            #If it is a set of part of speech and text, it is divided and added to the output destination.
            sets = chunk.split(' ')
            if len(sets) == 2:
                output.append(sets[1])
        
        #Output when the depth reaches 0
        if depth == 0:
            print('\t', ' '.join(output))
            break

for parse in \
 ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):
    
    depth = 0
    print(parse.text)
    
    #Separate and list at the beginning and end of parentheses(Excludes blank and empty elements)
    chunks = [chunk.strip() for chunk in reg_split.split(parse.text) 
                if chunk.strip() != '']
        
    #Output starts when you reach NP
    for i, chunk in enumerate(chunks):
        if chunk == 'NP':
            output_np(chunks[i+1:])

Answer commentary

XML file path

The following is the mapping between the XML file path and the target sexp. The contents of the S-expression are contained in the parse tag.

output	1st level	Second level	Third level	4th level	5th level
S-expression	root	document	sentences	sentence	parse

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

`xml:nlp.txt.xml(Excerpt)`


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <parse>(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) </parse>

Let's indent the above sexp to make it a little easier to see. It's a relatively short sentence, but it's still long ... The text part of the part surrounded by this NP (noun phrase) is combined and output.

(ROOT 
	(S 
		(PP 
			(IN As) 
			(NP 
				(JJ such)
			)
		) 
		(, ,) 
		(NP 
			(NN NLP)
		) 
		(VP 
			(VBZ is) 
			(ADJP 
				(VBN related) 
				(PP 
					(TO to) 
					(NP 
						(NP 
							(DT the) 
							(NN area)
						) 
						(PP 
							(IN of) 
							(NP 
								(JJ humani-computer) 
								(NN interaction)
							)
						)
					)
				)
			)
		) 
	(. .)
	)
)

Search for NP in the main loop

This is where the XML is read, looped, and searched for NP. The list is divided by regular expressions, and disturbing blank and empty elements are excluded. And when NP comes, we call the output function ʻoutput_np`. When NP is found from the top, it is output, but in the case of nested NP, it is inefficient because it passes through the same logic multiple times. But I wanted to keep it simple, so I'm leaving it inefficient.

`python`


for parse in \
 ET.parse('./nlp.txt.xml').iterfind('./document/sentences/sentence/parse'):

    depth = 0
    print(parse.text)

    #Separate and list at the beginning and end of parentheses(Excludes blank and empty elements)
    chunks = [chunk.strip() for chunk in reg_split.split(parse.text) 
                if chunk.strip() != '']

    #Output starts when you reach NP
    for i, chunk in enumerate(chunks):
        if chunk == 'NP':
            output_np(chunks[i+1:])

NP output section

The depth of the S-expression is judged by the start and end of the parentheses, and when the NP part ends, it is output.

`python`


def output_np(chunks): 
    depth = 1
    output = []

    for chunk in chunks:

        #The start of parentheses is the depth+1
        if chunk == '(':
            depth += 1

        #The end of the parenthesis is the depth-1
        elif chunk == ')':
            depth -= 1
        else:

            #If it is a set of part of speech and text, it is divided and added to the output destination.
            sets = chunk.split(' ')
            if len(sets) == 2:
                output.append(sets[1])

        #Output when the depth reaches 0
        if depth == 0:
            print('\t', ' '.join(output))
            break

Output result (execution result)

When the program is executed, the following result is output (first excerpt).

`Output result(Top excerpt)`


(ROOT (S (PP (NP (JJ Natural) (NN language) (NN processing)) (IN From) (NP (NNP Wikipedia))) (, ,) (NP (NP (DT the) (JJ free) (NN encyclopedia) (JJ Natural) (NN language) (NN processing)) (PRN (-LRB- -LRB-) (NP (NN NLP)) (-RRB- -RRB-))) (VP (VBZ is) (NP (NP (NP (DT a) (NN field)) (PP (IN of) (NP (NN computer) (NN science)))) (, ,) (NP (JJ artificial) (NN intelligence)) (, ,) (CC and) (NP (NP (NNS linguistics)) (VP (VBN concerned) (PP (IN with) (NP (NP (DT the) (NNS interactions)) (PP (IN between) (NP (NP (NNS computers)) (CC and) (NP (JJ human) (-LRB- -LRB-) (JJ natural) (-RRB- -RRB-) (NNS languages)))))))))) (. .))) 
	 Natural language processing
	 Wikipedia
	 the free encyclopedia Natural language processing -LRB- NLP -RRB-
	 the free encyclopedia Natural language processing
	 NLP
	 a field of computer science , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
	 a field of computer science
	 a field
	 computer science
	 artificial intelligence
	 linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages
	 linguistics
	 the interactions between computers and human -LRB- natural -RRB- languages
	 the interactions
	 computers and human -LRB- natural -RRB- languages
	 computers
	 human -LRB- natural -RRB- languages
(ROOT (S (PP (IN As) (NP (JJ such))) (, ,) (NP (NN NLP)) (VP (VBZ is) (ADJP (VBN related) (PP (TO to) (NP (NP (DT the) (NN area)) (PP (IN of) (NP (JJ humani-computer) (NN interaction))))))) (. .))) 
	 such
	 NLP
	 the area of humani-computer interaction
	 the area
	 humani-computer interaction
(ROOT (S (S (NP (NP (JJ Many) (NNS challenges)) (PP (IN in) (NP (NN NLP)))) (VP (VBP involve) (S (NP (NP (JJ natural) (NN language) (NN understanding)) (, ,) (SBAR (WHNP (WDT that)) (S (VP (VBZ is)))) (, ,)) (VP (VBG enabling) (NP (NNS computers)) (S (VP (TO to) (VP (VB derive) (NP (NN meaning)) (PP (IN from) (NP (ADJP (JJ human) (CC or) (JJ natural)) (NN language) (NN input)))))))))) (, ,) (CC and) (S (NP (NNS others)) (VP (VBP involve) (NP (JJ natural) (NN language) (NN generation)))) (. .))) 
	 Many challenges in NLP
	 Many challenges
	 NLP
	 natural language understanding , that is ,
	 natural language understanding
	 computers
	 meaning
	 human or natural language input
	 others
	 natural language generation

[PYTHON] 100 Language Processing Knock-59: Analysis of S-expressions

Reference link

environment

Chapter 6: Processing English Text

content of study

Knock content

59. Analysis of S-expression

Problem supplement (about "S-expression")

Answer

Answer program [059. Analysis of S-expression.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 059.S% E5% BC% 8F% E3% 81% AE% E8% A7% A3% E6% 9E% 90.ipynb)

Answer commentary

XML file path

xml:nlp.txt.xml(Excerpt)

Search for NP in the main loop

python

NP output section

python

Output result (execution result)

Output result(Top excerpt)

`xml:nlp.txt.xml(Excerpt)`

`python`

`python`

`Output result(Top excerpt)`