Language processing 100 knocks 2015 "Chapter 6: Processing English texts" .tohoku.ac.jp/nlp100/#ch6) 55th "Named entity recognition" record. This time, the named entity is extracted. Extract a person's name, which is one type of named entity. The program is simple with 3 steps. Stanford Core NLP can do this easily.
Link | Remarks |
---|---|
055.Named entity recognition.ipynb | Answer program GitHub link |
100 amateur language processing knocks:55 | Copy and paste source of many source parts |
Stanford Core NLP Official | Stanford Core NLP page to look at first |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Stanford CoreNLP | 3.9.2 | I installed it a year ago and I don't remember in detail ... It was the latest even after a year, so I used it as it was |
openJDK | 1.8.0_242 | I used the JDK that was installed for other purposes as it is |
An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.
Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions
For the English text (nlp.txt), execute the following processing.
Extract all personal names in the input text.
import xml.etree.ElementTree as ET
XPATH = './document/sentences/sentence/tokens/token[NER="PERSON"]'
print([ token.findtext('word') for token in ET.parse('./nlp.txt.xml').iterfind(XPATH)])
If the value of the <NER>
tag is PERSON
in the" named entity "below, the value of the<word>
tag in the same hierarchy indicates the person's name. The mechanism of "named entity recognition" in Stanford Core NLP is described in Stanford Named Entity Recognizer (NER).
output | 1st level | Second level | Third level | 4th level | 5th level | 6th level | 7th level |
---|---|---|---|---|---|---|---|
word | root | document | sentences | sentence | tokens | token | word |
Named entity | root | document | sentences | sentence | tokens | token | NER |
The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).
xml:nlp.txt.xml(Excerpt)
<root>
<document>
<docId>nlp.txt</docId>
<sentences>
<sentence id="1">
--Omission--
<tokens>
<token id="4">
<word>Alan</word>
<lemma>Alan</lemma>
<CharacterOffsetBegin>636</CharacterOffsetBegin>
<CharacterOffsetEnd>640</CharacterOffsetEnd>
<POS>NNP</POS>
<NER>PERSON</NER>
<Speaker>PER0</Speaker>
All you have to do is pass XPATH to the ʻiterfind function in the
xml` package.
When the program is executed, the following results will be output.
Output result
['Alan', 'Turing', 'Joseph', 'Weizenbaum', 'MARGIE', 'Schank', 'Wilensky', 'Meehan', 'Lehnert', 'Carbonell', 'Lehnert', 'Racter', 'Jabberwacky', 'Moore']
Recommended Posts