[PYTHON] 100 language processing knock-55: named entity extraction

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" .tohoku.ac.jp/nlp100/#ch6) 55th "Named entity recognition" record. This time, the named entity is extracted. Extract a person's name, which is one type of named entity. The program is simple with 3 steps. Stanford Core NLP can do this easily.

Reference link

Link Remarks
055.Named entity recognition.ipynb Answer program GitHub link
100 amateur language processing knocks:55 Copy and paste source of many source parts
Stanford Core NLP Official Stanford Core NLP page to look at first

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Stanford CoreNLP 3.9.2 I installed it a year ago and I don't remember in detail ...
It was the latest even after a year, so I used it as it was
openJDK 1.8.0_242 I used the JDK that was installed for other purposes as it is

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

55. Named entity recognition

Extract all personal names in the input text.

Answer

Answer program [055. Named entity extraction.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 055.% E5% 9B% BA% E6% 9C% 89% E8% A1% A8% E7% 8F% BE% E6% 8A% BD% E5% 87% BA.ipynb)

import xml.etree.ElementTree as ET

XPATH = './document/sentences/sentence/tokens/token[NER="PERSON"]'
print([ token.findtext('word') for token in ET.parse('./nlp.txt.xml').iterfind(XPATH)])

Answer commentary

XML file path

If the value of the <NER> tag is PERSON in the" named entity "below, the value of the<word>tag in the same hierarchy indicates the person's name. The mechanism of "named entity recognition" in Stanford Core NLP is described in Stanford Named Entity Recognizer (NER).

output 1st level Second level Third level 4th level 5th level 6th level 7th level
word root document sentences sentence tokens token word
Named entity root document sentences sentence tokens token NER

The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

xml:nlp.txt.xml(Excerpt)


<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">

--Omission--

        <tokens>
          <token id="4">
            <word>Alan</word>
            <lemma>Alan</lemma>
            <CharacterOffsetBegin>636</CharacterOffsetBegin>
            <CharacterOffsetEnd>640</CharacterOffsetEnd>
            <POS>NNP</POS>
            <NER>PERSON</NER>
            <Speaker>PER0</Speaker>

All you have to do is pass XPATH to the ʻiterfind function in the xml` package.

Output result (execution result)

When the program is executed, the following results will be output.

Output result


['Alan', 'Turing', 'Joseph', 'Weizenbaum', 'MARGIE', 'Schank', 'Wilensky', 'Meehan', 'Lehnert', 'Carbonell', 'Lehnert', 'Racter', 'Jabberwacky', 'Moore']

Recommended Posts

100 language processing knock-55: named entity extraction
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock (2020): 28
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knock-72 (using Stanford NLP): feature extraction
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-49: Extraction of Dependency Paths Between Nouns
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 language processing knock-76 (using scikit-learn): labeling
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python