[PYTHON] Train Stanford NER Tagger with your own data

NER is an abbreviation for Named Entity Recognition and is one of the tasks of natural language processing called named entity recognition. Stanford NER Tagger is a tool for solving this task. This time, I will train this myself.

Advance preparation

First, download an example of training data. https://github.com/synalp/NER/blob/master/corpus/CoNLL-2003/eng.train

Then download Stanford NER Tagger. https://nlp.stanford.edu/software/CRF-NER.shtml#Download

Then install jdk.

apt install default-jdk

Preparation of training data

Format the downloaded eng.train.

out = []
with open("eng.train", "r") as f:
    for line in f:
        line = line.split()
        if len(line) > 2:
            out.append(str(line[0])+"\t"+str(line[-1]).replace("I-","").replace("B-","")+"\n")
        else:
            out.append("\n")

with open("train.tsv") as f:
    f.write(''.join(out))

Training

Training preparation

  1. Unzip the stanford ner tagger and enter the unzipped directory.
  2. Put train.tsv in that directory.

Creating a property file

train.prop


trainFile = train.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

Save this as a file named train.prop.

Execution of training

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop train.prop

Then, the model will be created as a file named ner-model.ser.gz.

Use model from python

You can use the model from python with nltk's Stanford NER Tagger wrapper.


import nltk
from nltk.tag.stanford import StanfordNERTagger
sent = "Balack Obama kills people by AK47"
model = "./ner-model.ser.gz"
jar = "./stanford-ner.jar"
tagger = StanfordNERTagger(model, jar, encoding='utf-8')
print(tagger.tag(sent.split()))

[output]

[('Balack', 'PER'),
 ('Obama', 'PER'),
 ('kills', 'O'),
 ('people', 'O'),
 ('by', 'O'),
 ('AK47', 'O')]

reference

[0] https://nlp.stanford.edu/software/crf-faq.html#a [1] https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486

Recommended Posts

Train Stanford NER Tagger with your own data
Train UGATIT with your own dataset
Annotate your own data to train Mask R-CNN
Solve your own maze with Q-learning
Manage your data with AWS RDS
Solve your own maze with DQN
[Reinforcement learning] DQN with your own library
Create your own DNS server with Twisted
Put your own image data in Deep Learning and play with it
Create your own Composite Value with SQLAlchemy
To import your own module with jupyter
Publish your own Python library with Homebrew
How to access data with object ['key'] for your own Python class
[Machine learning] Create a machine learning model by performing transfer learning with your own data set
Try to make your own AWS-SDK with bash
Argument implementation (with code) in your own language
Make your own module quickly with setuptools (python)
Divide your data into project-like units with Django
Make your own music player with Bottle0.13 + jPlayer2.5!
Steps to install your own library with pip
Import your own functions on AWS Glue
Overwrite data in RDS with AWS Glue
Flow of creating your own package with setup.py with python
Memo to create your own Box with Pepper's Python
Call your own C library with Go using cgo
Create your own Big Data in Python for validation
Train MNIST data with a neural network in PyTorch
Write your own activation function with Pytorch (hard sigmoid)
Let's call your own C ++ library with Python (Preferences)
Define your own distance function with k-means of scikit-learn