[PYTHON] 100 Language Processing Knock-53: Tokenization

Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the 53rd "Tokenization" record of .tohoku.ac.jp/nlp100/#ch6). Finally, Stanford Core NLP is about to begin. This is the main subject of Chapter 6. This time, the installation is the main, and the Stanford Core NLP execution and the Python part are not a big deal.

Reference link

Link Remarks
053_1.Tokenization.ipynb AnswerprogramGitHublink(StanfordCoreNLPexecutionpartinBash)
053_2.Tokenization.ipynb AnswerprogramGitHublink(Python)
100 amateur language processing knocks:53 Copy and paste source of many source parts
Stanford Core NLP Official Stanford Core NLP page to look at first

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.16 I use pyenv because I sometimes use multiple Python environments
Python 3.8.1 python3 on pyenv.8.I'm using 1
Packages are managed using venv
Stanford CoreNLP 3.9.2 I installed it a year ago and I don't remember in detail ...
It was the latest even after a year, so I used it as it was
openJDK 1.8.0_242 I used the JDK that was installed for other purposes as it is

Chapter 6: Dependency Analysis

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

53. Tokenization

Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.

Problem supplement (About "Stanford Core NLP")

"Stanford Core NLP" is a library for natural language processing. There is a similar one called "Stanford NLP", which supports Japanese. "Stanford NLP" has been used since 70th knock. The difference is clearly described in Article "Introduction to Stanford NLP with Python". Looking at Stanford CoreNLP's Release History, it hasn't been updated much recently.

Answer

Answer program (Stanford Core NLP execution part in Bash) [053_1.Tokenization.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E% E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 053_1.Tokenization.ipynb)

I am using it according to the Official page. If you don't use the -annotators option, you'll get stuck afterwards (certainly the 57th). I am allocating 5G of memory with -Xmx5G. If it is too small, an error has occurred. When executed, the result will be output to the same location with the extension xml added to the read file nlp.txt.

java -cp "/usr/local/lib/stanford-corenlp-full-2018-10-05/*" \
 -Xmx5g \
 edu.stanford.nlp.pipeline.StanfordCoreNLP \
 -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref \
 -file nlp.txt

By the way, if you put the output XML file in the same directory as CoreNLP-to-HTML.xsl that was in /usr/local/lib/stanford-corenlp-full-2018-10-05 and read it with a browser, the following You can see the result like (I saw it in IE and Edge, but not in Firefox and Chrome).

image.png

Answer program (Python part) [053_1.Tokenization.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86 % E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 053_1.Tokenization.ipynb)

import xml.etree.ElementTree as ET

#Extract only word
for i, word in enumerate(ET.parse('./nlp.txt.xml').iter('word')):
    print(i, '\t' ,word.text)
    
    #Limited because there are many
    if i > 30:
        break

Answer commentary

XML perspective

I am using the Python standard package xml as an XML parser. It's easy to use, just read the nlp.txt.xml output by Stanford CoreNLP with the parse function and read the word tag.

python


for i, word in enumerate(ET.parse('./nlp.txt.xml').iter('word')):
    print(i, '\t' ,word.text)

The contents of xml are like this (excerpt from the beginning). The XML file is [GitHub](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3%82%AD% It is located at E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / nlp.txt.xml).

xml:nlp.txt.xml(Excerpt from the beginning)


<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
  <document>
    <docId>nlp.txt</docId>
    <sentences>
      <sentence id="1">
        <tokens>
          <token id="1">
            <word>Natural</word>
            <lemma>natural</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>7</CharacterOffsetEnd>
            <POS>JJ</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>
          <token id="2">
            <word>language</word>
            <lemma>language</lemma>
            <CharacterOffsetBegin>8</CharacterOffsetBegin>
            <CharacterOffsetEnd>16</CharacterOffsetEnd>
            <POS>NN</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>

Output result (execution result)

When the program is executed, the following results will be output.

Output result


0 	 Natural
1 	 language
2 	 processing
3 	 From
4 	 Wikipedia
5 	 ,
6 	 the
7 	 free
8 	 encyclopedia
9 	 Natural
10 	 language
11 	 processing
12 	 -LRB-
13 	 NLP
14 	 -RRB-
15 	 is
16 	 a
17 	 field
18 	 of
19 	 computer
20 	 science
21 	 ,
22 	 artificial
23 	 intelligence
24 	 ,
25 	 and
26 	 linguistics
27 	 concerned
28 	 with
29 	 the
30 	 interactions
31 	 between

By the way, -LRB- and -RRB- are parentheses, which are converted by Stanford Core NLP.

--- LRB- left bracket --- RRB- right bracket

Recommended Posts

100 Language Processing Knock-53: Tokenization
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 47
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock-24: Extract File Reference
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 language processing knock-74 (using scikit-learn): Prediction
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock-38 (using pandas): Histogram
Python beginner tried 100 language processing knock 2015 (00 ~ 04)