Use Stanford Core NLP from Python

Introduction

Stanford CoreNLP is a complete library for natural language processing of English text. This time, I will introduce how to use CoreNLP from Python.

Download and unzip Stanford Core NLP

download

Download Version 3.2.0 (released 2013-06-20) instead of the latest version from the link below. The reason why it is not the latest version will be described later. http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip

$ curl -L -O http://nlp.stanford.edu/software/stanford-corenlp-full-2013-06-20.zip

Defrost

In my case, I put it in / usr / local / lib.

$ unzip ./stanford-corenlp-full-2013-06-20.zip -d /usr/local/lib/

Install corenlp-python

corenlp-python developed by Torotoki based on dasmith's It is also registered in PyPI. However, corenlp-python registered in PyPI only supports CoreNLP Version 3.2.0 (at the time of writing this article).

Installation

$ pip install corenlp-python

Basic usage

Generate a parser by specifying the path where CoreNLP is decompressed, parse the text, and the result will be returned in JSON format.

corenlp_example.py


import pprint
import json
import corenlp

#Parser generation
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
parser = corenlp.StanfordCoreNLP(corenlp_path=corenlp_dir)

#Parse and print the result pretty
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)

Execution result:

{u'coref': [[[[u'I', 0, 0, 0, 1], [u'Alice', 0, 2, 2, 3]]]],
 u'sentences': [{u'dependencies': [[u'nsubj', u'Alice', u'I'],
                                   [u'cop', u'Alice', u'am'],
                                   [u'root', u'ROOT', u'Alice']],
                 u'parsetree': u'(ROOT (S (NP (PRP I)) (VP (VBP am) (NP (NNP Alice))) (. .)))',
                 u'text': u'I am Alice.',
                 u'words': [[u'I',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'1',
                              u'Lemma': u'I',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'PRP'}],
                            [u'am',
                             {u'CharacterOffsetBegin': u'2',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'be',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'VBP'}],
                            [u'Alice',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'Alice',
                              u'NamedEntityTag': u'PERSON',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'10',
                              u'CharacterOffsetEnd': u'11',
                              u'Lemma': u'.',
                              u'NamedEntityTag': u'O',
                              u'PartOfSpeech': u'.'}]]}]}

Narrow down the functions

By default, it does everything from parsing / morphological analysis to named entity extraction, but if you want to use only some functions, specify properties. By narrowing down the functions, the operation becomes faster (especially the ner is heavy).

For example, if you want to split words, create the following user.properties file.

user.properties


annotators = tokenize, ssplit

Pass the path of this file to the properties parameter when creating the parser.

corenlp_example2.py


import pprint
import json
import corenlp

#Parser generation
corenlp_dir = "/usr/local/lib/stanford-corenlp-full-2013-06-20/"
properties_file = "./user.properties"
parser = corenlp.StanfordCoreNLP(
    corenlp_path=corenlp_dir,
    properties=properties_file) #Set properties

#Parse and print the result pretty
result_json = json.loads(parser.parse("I am Alice."))
pprint.pprint(result_json)

Execution result:

{u'sentences': [{u'dependencies': [],
                 u'parsetree': [],
                 u'text': u'I am Alice.',
                 u'words': [[u'I',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'1'}],
                            [u'am',
                             {u'CharacterOffsetBegin': u'2',
                              u'CharacterOffsetEnd': u'4'}],
                            [u'Alice',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'10',
                              u'CharacterOffsetEnd': u'11'}]]}]}

Annotator list

In the above, only tokenize and ssplit are used, but since there are various other annotators, I will briefly summarize them.

annotator function Dependent annotator
tokenize Word split (None)
cleanxml XML tag removal tokenize
ssplit Sentence split tokenize
pos Morphological analysis(Tag details tokenize, ssplit
lemma Headline conversion tokenize, ssplit, pos
ner Named entity recognition tokenize, ssplit, pos, lemma
regexner Named entity extraction with regular expression tokenize, ssplit
sentiment Emotional word analysis (unknown)
truecase Uppercase / lowercase normalization tokenize, ssplit, pos, lemma
parse Parsing tokenize, ssplit
dcoref Demonstrative analysis tokenize, ssplit, pos, lemma, ner, parse

Recommended Posts

Use Stanford Core NLP from Python
Use thingsspeak from python
Use fluentd from python
Use MySQL from Python
Use MySQL from Python
Use BigQuery from python.
Use mecab-ipadic-neologd from python
Use MySQL from Anaconda (python)
Use e-Stat API from Python
Read and use Python files from Python
Forcibly use Google Translate from python
Use kabu Station® API from Python
Use Azure Blob Storage from Python
Use the Flickr API from Python
Use fastText trained model from Python
Use Google Analytics API from Python
Use PostgreSQL data type (jsonb) from Python
Use machine learning APIs A3RT from Python
I want to use jar from python
Use Google Cloud Vision API from Python
Use Django from a local Python script
Use C ++ functions from python with pybind11
sql from python
MeCab from Python
Firebase: Use Cloud Firestore and Cloud Storage from Python
Study from Python Hour7: How to use classes
[Bash] Use here-documents to get python power from bash
Wrap C with Cython for use from Python
Use Python in your environment from Win Automation
I want to use ceres solver from python
Let's use different versions of SQLite3 from Python3!
Wrap C ++ with Cython for use from Python
Use the nghttp2 Python module from Homebrew from pyenv's Python
Use Tor to connect from urllib2 [Python] [Mac]
Python: Use zipfile to unzip from standard input
Use R density ratio estimation package densratio from Python
Use config.ini in Python
Operate Filemaker from Python
[Python] Use JSON with Python
Use dates in Python
Access bitcoind from python
Changes from Python 3.0 to Python 3.5
Changes from Python 2 to Python 3.0
Python from or import
Use Valgrind in Python
Use mecab with Python3
Use LiquidTap Python Client ③
Run python from excel
Install python from source
Use DynamoDB with Python
Execute command from Python
I wanted to use the Python library from MATLAB
Operate neutron from Python!
Use Python 3.8 with Anaconda
[Python] format methodical use
Use python with docker
Let's start Python from Excel. I don't use VBA.
Operate LXC from Python
Use LiquidTap Python Client ②
Manipulate riak from python
Force Python from Fortran