[PYTHON] With me, NER and Flair

This article is the 23rd day of MYJLab Advent Calendar.

Introduction

Hello. This is marutaku from MYJLab m1. This time, I had the opportunity to use a technique called named entity recognition, so I would like to introduce the library I used at that time.

About named entity recognition (NER)

Named Entity Recognition is a technology that acquires proper nouns such as company names and place names with labels such as company and place names. As an example,

Boeing, the US aircraft giant, announced the immediate resignation of CEO Muillenberg on the 23rd.
It is believed that he took responsibility for the problem of the state-of-the-art aircraft "737 MAX", which had been suspended due to two crashes.
Chairman Calhoun will assume the post of CEO on January 13, next year.

https://news.yahoo.co.jp/pickup/6346205 At this time, you can see that the person's name is Mui Lenberg, Calhoun and the organization name is Boeing. In this way, the task of named entity extraction is to acquire the proper nouns that exist in the sentence.

data set

It is organized in this repository. I think CoNLL 2003 and Ontonotes v5 are the most commonly used. Others include medical data and wikipedhia annotated data. The Japanese dataset is not very public, but it seems to be on sale.

Where it is difficult to extract named entities

Recent named entity extraction papers are very difficult to implement. As you can see from Paper With Code, you can stack a large number of models and use them, or have multiple pre-learned models. It's very annoying because I want to use the word expression of the model. LSTM-CRF + ELMo + BERT + Flair or Mounani Itternokawa Karanai. At that time, I came across a library called Flair that makes it easy to implement a model that achieves NER's State of the art.

What is Flair?

Flair is a library for natural language processing designed to make it easy to implement the State of the art model. With a wealth of trained models, you can quickly incorporate past SoTA models into your system. As mentioned in the tutorial, when you try the named entity extraction model trained in Flair, you can do it with just the following code.

from flair.data import Sentence
from flair.models import SequenceTagger

# make a sentence
sentence = Sentence('I love Berlin .')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

In addition to the ner model used here, there are various models such as the ner-fast model that operates at high speed even on a CPU.

Self-made model using Flair

Flair can not only run trained models, but also create your own models. Flair makes it easy to handle embedded representations of well-known models as well as trained models. (https://github.com/flairNLP/flair/blob/master/resources/docs/STRUCT_4_ELMO_BERT_FLAIR_EMBEDDING.md) An example of my own model is shown below.

from flair.data import Corpus
from flair.datasets import WNUT_17
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from typing import List

# 1. get the corpus
corpus: Corpus = WNUT_17().downsample(0.1)
print(corpus)

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

# 4. initialize embeddings
embedding_types: List[TokenEmbeddings] = [

    WordEmbeddings('glove'),

    # comment in this line to use character embeddings
    # CharacterEmbeddings(),

    # comment in these lines to use flair embeddings
    # FlairEmbeddings('news-forward'),
    # FlairEmbeddings('news-backward'),
]

embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

# 5. initialize sequence tagger
from flair.models import SequenceTagger

tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type,
                                        use_crf=True)

# 6. initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# 7. start training
trainer.train('resources/taggers/example-ner',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=150)

# 8. plot weight traces (optional)
from flair.visual.training_curves import Plotter
plotter = Plotter()
plotter.plot_weights('resources/taggers/example-ner/weights.txt')

The above model uses the embedded expression of the model called glove inside the model, but by increasing this, you can use multiple word embedded expressions. I haven't tried it yet, but it seems that fine tuning of embedded expressions is possible.

At the end

This time, we talked about NER and Flair. Flair can be applied to tasks such as negative / positive judgment and text classification in addition to NER. If you want to use natural language processing but find it difficult, please try it.

References

https://www.rondhuit.com/apache-opennlp-1-9-0-ja-ner.html
https://qiita.com/Hironsan/items/62b493d92712e862e5aa