Language processing 100 knocks 2015 "Chapter 6: Processing English texts" .tohoku.ac.jp/nlp100/#ch6) 55th "Named entity recognition" record. This time, the named entity is extracted. Extract a person's name, which is one type of named entity. The program is simple with 3 steps. Stanford Core NLP can do this easily.

Reference link

Link	Remarks
055.Named entity recognition.ipynb	Answer program GitHub link
100 amateur language processing knocks:55	Copy and paste source of many source parts
Stanford Core NLP Official	Stanford Core NLP page to look at first

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Stanford CoreNLP	3.9.2	I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was
openJDK	1.8.0_242	I used the JDK that was installed for other purposes as it is

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

55. Named entity recognition

Extract all personal names in the input text.

Answer

Answer program [055. Named entity extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 055.% E5% 9B% BA% E6% 9C% 89% E8% A1% A8% E7% 8F% BE% E6% 8A% BD% E5% 87% BA.ipynb)

import xml.etree.ElementTree as ET

XPATH = './document/sentences/sentence/tokens/token[NER="PERSON"]'
print([ token.findtext('word') for token in ET.parse('./nlp.txt.xml').iterfind(XPATH)])

Answer commentary

XML file path

If the value of the <NER> tag is PERSON in the" named entity "below, the value of the<word>tag in the same hierarchy indicates the person's name. The mechanism of "named entity recognition" in Stanford Core NLP is described in Stanford Named Entity Recognizer (NER).

output	1st level	Second level	Third level	4th level	5th level	6th level	7th level
word	root	document	sentences	sentence	tokens	token	word
Named entity	root	document	sentences	sentence	tokens	token	NER

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

`xml:nlp.txt.xml(Excerpt)`


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <tokens>
          <token id="4">
            <word>Alan</word>
            <lemma>Alan</lemma>
            <CharacterOffsetBegin>636</CharacterOffsetBegin>
            <CharacterOffsetEnd>640</CharacterOffsetEnd>
            <POS>NNP</POS>
            <NER>PERSON</NER>
            <Speaker>PER0</Speaker>

All you have to do is pass XPATH to the ʻiterfind function in the xml` package.

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


['Alan', 'Turing', 'Joseph', 'Weizenbaum', 'MARGIE', 'Schank', 'Wilensky', 'Meehan', 'Lehnert', 'Carbonell', 'Lehnert', 'Racter', 'Jabberwacky', 'Moore']

[PYTHON] 100 language processing knock-55: named entity extraction