[PYTHON] [Tutorial] Make a named entity extractor in 30 minutes using machine learning

Introduction

** Named entity recognition ** is a technology that extracts proper nouns such as ** personal names ** and ** place names ** that appear in texts, and numerical expressions such as ** date ** and ** time **. .. Named entity recognition is also used as an elemental technology for applied applications that use natural language processing such as ** question answering system **, ** dialogue system **, and ** information extraction **.

This time, I will make a named entity extractor using ** machine learning technology **.

※Notes No theoretical story comes out. If you want to know the theory, please hit the other. </ font>

Target audience

  • Those who know a little about named entity extraction
  • Those who want to make a named entity extractor
  • Those who can read Python code

What is named entity recognition?

This section provides an overview and method of named entity recognition.

Overview

Named entity extraction is a technology that extracts proper nouns such as personal names and place names that appear in texts, and numerical expressions such as dates and times. Let's look at a concrete example. Let's extract the named entity from the following sentence.

Taro went to see Hanako at 9 am on May 18th.

Extracting the named entities contained in the above sentence, as ** personal name **, ** Taro ** and ** Hanako **, ** date **, ** May 18 **, ** time ** ** 9 am ** can be extracted.

In the above example, the person name, date, and time were extracted as named entity classes. In general, the following eight classes (Information Retrieval and Extraction Exercise (IREX) Named Entity Extraction Task Definition) in: //nlp.cs.nyu.edu/irex/NE/) is often used.

class Example
ART unique product name Nobel Prize in Literature, Windows 7
LOC place name Chiba, USA
ORG organization Liberal Democratic Party, NHK
PSN personal name Shinzo Abe, Merkel
DAT date January 29, 2016/01/29
TIM time 3 pm, 10:30
MNY amount 241 yen, $ 8
PNT percentage 10%, 30%

Method

One way to extract named entities is to label sentences that have been morphologically analyzed. The following is an example of labeling the sentence "Taro is at 9 am on May 18th ..." after morphological analysis.

スクリーンショット 2016-01-28 14.35.17.png

The labels B-XXX and I-XXX indicate that these strings are named entities. B-XXX means the beginning of the named entity string, and I-XXX means that the named entity string continues. The XXX part contains named entity classes such as ORG and PSN. Parts that are not named entities are labeled O.

Labeling can be done using rules, but this time it will be done using ** machine learning technology **. That is, it creates a model from pre-labeled training data and uses that model to label unlabeled sentences. Specifically, we will learn using an algorithm called CRF.

Let's actually move our hands.

Installation

Start by installing the required Python modules. Execute the following command in the terminal to install the module. I have CRFsuite installed as a CRF library.

pip install numpy
pip install scipy
pip install sklearn
pip install python-crfsuite

Once installed, import the required modules. Execute the following code.

from itertools import chain
import pycrfsuite
import sklearn
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelBinarizer

Data used to build named entity extractor

Since CRF is supervised learning, we need data tagged with teacher data. This time, I prepared the tagged data in advance. Please download from here. The file name is "hironsan.txt".

Now, let's first define a class to read the downloaded data.

import codecs

class CorpusReader(object):

    def __init__(self, path):
        with codecs.open(path, encoding='utf-8') as f:
            sent = []
            sents = []
            for line in f:
                if line	== '\n':
                    sents.append(sent)
                    sent = []
                    continue
                morph_info = line.strip().split('\t')
                sent.append(morph_info)
        train_num = int(len(sents) * 0.9)
        self.__train_sents = sents[:train_num]
        self.__test_sents = sents[train_num:]

    def iob_sents(self, name):
        if name == 'train':
            return self.__train_sents
        elif name == 'test':
            return self.__test_sents
        else:
            return None

Next, load the downloaded data using the created class. The number of training data is 450 sentences and the number of test data is 50 sentences.

c = CorpusReader('hironsan.txt')
train_sents = c.iob_sents('train')
test_sents = c.iob_sents('test')

The format of the read data is as follows. The IOB2 tag is attached after performing morphological analysis with the morphological analyzer "MeCab". The data is divided into sentences, and each sentence consists of a collection of multiple morpheme information.

>>> train_sents[0]
[['2005', 'noun', 'number', '*', '*', '*', '*', '*', 'B-DAT'],
 ['Year', 'noun', 'suffix', 'Classifier', '*', '*', '*', 'Year', 'Nen', 'Nen', 'I-DAT'],
 ['7', 'noun', 'number', '*', '*', '*', '*', '*', 'I-DAT'],
 ['Month', 'noun', 'General', '*', '*', '*', '*', 'Month', 'Moon', 'Moon', 'I-DAT'],
 ['14', 'noun', 'number', '*', '*', '*', '*', '*', 'I-DAT'],
 ['Day', 'noun', 'suffix', 'Classifier', '*', '*', '*', 'Day', 'Nichi', 'Nichi', 'I-DAT'],
 ['、', 'symbol', 'Comma', '*', '*', '*', '*', '、', '、', '、', 'O'],
...
]

Next, I will explain the features used for named entity extraction.

Features to use

Here, we will give an overview of the features to be used and then code it.

Overview

Next, I will explain the features to be used. This time, we will use two-letter words before and after, subclass of part of speech, character type, and named entity tag. An example of using these features is shown below. The part surrounded by the frame is the feature used. ner.png

The classification of character types is as follows. There are 7 types in all.

Character type tag Description
ZSPACE Blank
ZDIGIT Arabic numerals
ZLLET Lowercase letters
ZULET Uppercase letters
HIRAG Hiragana
KATAK Katakana
OTHER Other

The character type used as a feature is a combination of all the character types contained in a word. For example, the word "many" includes kanji and hiragana. The hiragana character type tag is HIRAG, and the kanji character type tag is OTHER. Therefore, the character type of the word "many" is "HIRAG-OTHER".

Coding of feature extraction

Judgment of character type

The code for determining the character type is as follows. All character types contained in the string are combined with a- (hyphen).

def is_hiragana(ch):
    return 0x3040 <= ord(ch) <= 0x309F

def is_katakana(ch):
    return 0x30A0 <= ord(ch) <= 0x30FF

def get_character_type(ch):
    if ch.isspace():
        return 'ZSPACE'
    elif ch.isdigit():
        return 'ZDIGIT'
    elif ch.islower():
        return 'ZLLET'
    elif ch.isupper():
        return 'ZULET'
    elif is_hiragana(ch):
        return 'HIRAG'
    elif is_katakana(ch):
        return 'KATAK'
    else:
        return 'OTHER'

def get_character_types(string):
    character_types = map(get_character_type, string)
    character_types_str = '-'.join(sorted(set(character_types)))

    return character_types_str

Extraction of part of speech subclassification

The code to extract the part of speech subclassification from the morpheme information is as follows.

def extract_pos_with_subtype(morph):
    idx = morph.index('*')

    return '-'.join(morph[1:idx])

Feature extraction from sentences

Based on the above, the code to extract the features for each word is as follows. It's a bit verbose, but you can see.

def word2features(sent, i):
    word = sent[i][0]
    chtype = get_character_types(sent[i][0])
    postag = extract_pos_with_subtype(sent[i])
    features = [
        'bias',
        'word=' + word,
        'type=' + chtype,
        'postag=' + postag,
    ]
    if i >= 2:
        word2 = sent[i-2][0]
        chtype2 = get_character_types(sent[i-2][0])
        postag2 = extract_pos_with_subtype(sent[i-2])
        iobtag2 = sent[i-2][-1]
        features.extend([
            '-2:word=' + word2,
            '-2:type=' + chtype2,
            '-2:postag=' + postag2,
            '-2:iobtag=' + iobtag2,
        ])
    else:
        features.append('BOS')

    if i >= 1:
        word1 = sent[i-1][0]
        chtype1 = get_character_types(sent[i-1][0])
        postag1 = extract_pos_with_subtype(sent[i-1])
        iobtag1 = sent[i-1][-1]
        features.extend([
            '-1:word=' + word1,
            '-1:type=' + chtype1,
            '-1:postag=' + postag1,
            '-1:iobtag=' + iobtag1,
        ])
    else:
        features.append('BOS')

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        chtype1 = get_character_types(sent[i+1][0])
        postag1 = extract_pos_with_subtype(sent[i+1])
        features.extend([
            '+1:word=' + word1,
            '+1:type=' + chtype1,
            '+1:postag=' + postag1,
        ])
    else:
        features.append('EOS')

    if i < len(sent)-2:
        word2 = sent[i+2][0]
        chtype2 = get_character_types(sent[i+2][0])
        postag2 = extract_pos_with_subtype(sent[i+2])
        features.extend([
            '+2:word=' + word2,
            '+2:type=' + chtype2,
            '+2:postag=' + postag2,
        ])
    else:
        features.append('EOS')

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [morph[-1] for morph in sent]


def sent2tokens(sent):
    return [morph[0] for morph in sent]

Extract features from sentences with sent2features. The features that are actually extracted are as follows.

>>> sent2features(train_sents[0])[0]
['bias',
 'word=2005',
 'type=ZDIGIT',
 'postag=noun-number',
 'BOS',
 'BOS',
 '+1:word=Year',
 '+1:type=OTHER',
 '+1:postag=noun-suffix-Classifier',
 '+2:word=7',
 '+2:type=ZDIGIT',
 '+2:postag=noun-number']

It turned out that the features can be extracted from the data. Extract the features and labels for the training and test data from the data for later use.

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Model learning

To train the model, create a pycrfsuite.Trainer object, load the training data, and then call the train method. First, create a Trainer object and read the training data.

trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Next, set the learning parameters. Originally, it should be decided using development data, but this time it will be fixed.

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

Now that we're ready, let's train the model. Specify the file name and execute the train method.

trainer.train('model.crfsuite')

When the execution is finished, a file with the specified file name will be created. The trained model is stored in this.

Test data prediction

To use the trained model, create a pycrfsuite.Tagger object, load the trained model, and use the tag method. First, create a Tagger object and load the trained model.

tagger = pycrfsuite.Tagger()
tagger.open('model.crfsuite')

Now, let's actually tag the sentence.

example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)))

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

You should get the following result: Predicted is the tag sequence predicted using the created model, and Correct is the correct tag sequence. In the case of this sentence, the expected result of the model and the correct answer data matched.

In October last year, 34 people were killed in an explosion in Taba, Egypt, near the site.
Predicted: B-DAT I-DAT I-DAT O O O O O O O O O O O O B-LOC O B-LOC O O O O O O O O O O
Correct:   B-DAT I-DAT I-DAT O O O O O O O O O O O O B-LOC O B-LOC O O O O O O O O O O

This completes the construction of the named entity extractor.

Model evaluation

I created a model, but I don't know if this is good or bad. Therefore, it is important to evaluate the model you created. Now let's evaluate the created model. Evaluation is based on precision, recall, and F-number. Below is the code to evaluate.

def bio_classification_report(y_true, y_pred):
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

Tag statements in the test data set for use in evaluation.

y_pred = [tagger.tag(xseq) for xseq in X_test]

The data tagged using the trained model and the correct answer data are passed to the evaluation function and the result is displayed. For each category, the precision rate, recall rate, F value, and number of tags are displayed.

>>> print(bio_classification_report(y_test, y_pred))
             precision    recall  f1-score   support

      B-ART       1.00      0.89      0.94         9
      I-ART       0.92      1.00      0.96        12
      B-DAT       1.00      1.00      1.00        12
      I-DAT       1.00      1.00      1.00        22
      B-LOC       1.00      0.95      0.97        55
      I-LOC       0.94      0.94      0.94        17
      B-ORG       0.75      0.86      0.80        14
      I-ORG       1.00      0.90      0.95        10
      B-PSN       0.00      0.00      0.00         3
      B-TIM       1.00      0.71      0.83         7
      I-TIM       1.00      0.81      0.90        16

avg / total       0.95      0.91      0.93       177

I think the result is a little too good, but the data used probably contained similar statements.

※Caution You may get an UndefinedMetricWarning. It seems that it is not possible to define the precision rate etc. for labels that do not exist in the predicted sample. Because the number of data prepared is small ...

in conclusion

This time, I was able to easily create a named entity extractor by using the Python library crfsuite. Eight kinds of named entities are added to the tagging based on the definition of IREX. However, the definition of IRX is often rough for practical use. Therefore, if you want to use named entity recognition for some task, you need to prepare the data with the necessary tags according to the task.

You may also want to look for better features and model parameters.

reference

Recommended Posts

[Tutorial] Make a named entity extractor in 30 minutes using machine learning
A story about simple machine learning using TensorFlow
Data supply tricks using deques in machine learning
Get a glimpse of machine learning in Python
Replace the named entity in the read text file with a label (using GiNZA)
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine learning tutorial summary
Machine learning beginners try to make a decision tree
Memo for building a machine learning environment using Python
MALSS, a tool that supports machine learning in Python
Machine learning A story about people who are not familiar with GBDT using GBDT in Python
Launching a machine learning environment using Google Compute Engine (GCE)
I tried using Tensorboard, a visualization tool for machine learning
I tried to make a stopwatch using tkinter in python
Machine learning in Delemas (practice)
Pepper Tutorial (5): Using a Tablet
Make a bookmarklet in Python
Build a machine learning environment
Used in machine learning EDA
How about Anaconda for building a machine learning environment in Python?
Make a face recognizer using TensorFlow
Automate routine tasks in machine learning
Chainer Machine Learning Introductory Tutorial Memorandum
Classification and regression in machine learning
Inversely analyze a machine learning model
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
Make matplotlib Japanese compatible in 3 minutes
Creating a learning model using MNIST
Random seed research in machine learning
Deploy Django in 3 minutes using docker-compose
Make a curtain generator in Blender
Preprocessing in machine learning 4 Data conversion
Application development using Azure Machine Learning
Tweet in Chama Slack Bot ~ How to make a Slack Bot using AWS Lambda ~
How to make a model for object detection using YOLO in 3 hours
Aiming to become a machine learning engineer from sales positions using MOOCs
I tried to classify guitar chords in real time using machine learning
I wrote FizzBuzz in python using a support vector machine (library LIVSVM).
Bringing machine learning to a practical level in one month # 1 (Starting edition)
How to create a face image data set used in machine learning (1: Acquire candidate images using WebAPI service)