Perform entity analysis using spaCy / GiNZA in Python

What?

Let's try using spaCy / GiNZA, which is very convenient for natural language processing, or "entity analysis", which is the real pleasure of text analysis. This is the GiNZA page. https://megagonlabs.github.io/ginza/

Entity analysis is a technology to find a catamaly (entity) such as "Preste = game machine" and "FINAL FANTASY VII REMAKE = game name" when you say "** Play FINAL FANTASY VII REMAKE on Sony PlayStation **".

Creating a dictionary of game names is very difficult. The number of games will increase infinitely. We will find the entity while guessing it from the context before and after.

First try using GiNZA

First, let's use GiNZA. GiNZA is simply a library for Japanese analysis that has been learned and has all the necessary items.

Anyway, it's easy enough to use.

First, install it with pip.

pip install -U ginza

Actually, it took me a long time to trip over various places until pip install succeeded, but ... once it's done, it may actually pass in one shot. If you don't pass, even in the comments ...

Sample code

First is the first simple code.

import spacy

nlp = spacy.load('ja_ginza')  

doc = nlp(""FINAL FANTASY VII Remake" is a game software released by Square Enix. It was pre-sold on PlayStation 4 and is an exclusive title until April 2021. Initially scheduled to be released worldwide on March 3, 2020, the release was postponed on April 10, 2020.")

print("*** token ***")
for token in doc:
    print(token.i, token.orth_, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.i)

print("*** entity ***")
for ent in doc.ents:
    print(ent.text, ent.label_) 

The result looks like this. The detailed meaning is omitted here, but you can see that each word is analyzed while doing "morphological analysis".

*** token ***
0 "" PUNCT auxiliary symbol-Open parentheses punct 4
1 Final Final NOUN Noun-Appellative-General compound 4
2 fantasy fantasy NOUN noun-Appellative-General compound 4
3 VII vii NOUN noun-Appellative-General compound 4
4 Remake Remake NOUN noun-Appellative-Changeable ROOT 4
5 』” PUNCT auxiliary symbol-Parentheses closed punct 4
6 is the ADP particle-Particle case 4
7, ,, PUNCT auxiliary symbol-Comma punct 4
8 Square Enix Square Enix PROPN Noun-Proper noun-General compound 10
:(abridgement)

*** entity ***
FINAL FANTASY VII Remake Book
Square Enix Person
PlayStation 4 Product_Other
April 2021 Date
March 3, 2020 Date
April 10, the same year Date

It feels pretty good.

So-called morphological analysis is performed neatly as a token, and Square Enix is also recognized as a proper noun.

The entity I wanted to do this time is also recognized as "FINAL FANTASY VII Remake". The word Book is a little strange, but ... it's general dictionary data, so it can't be helped, and even a slightly tricky way of writing a date such as "April 10 of the same year" recognizes it as a Date. If you want to retrieve common words, this should be enough.

Create a custom dictionary

However, there are times when you want to change "** FINAL FANTASY VII Remake " to " Game_Title **".

When doing natural language processing in actual work, I think there are technical terms for each of our business domains. For example, I want to treat the title as a title. I would like to do something about it.

Now that I want to learn for myself, I will study with spaCy's original ja instead of GiNZA ja_ginza. The code is almost the same as the Spacy sample code, but it looks like this.

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# new entity label
LABEL = "Game_Title"

TRAIN_DATA = [
    (
        ""FINAL FANTASY VII Remake" is a game software released by Square Enix.",
        {"entities": [(1, 20, LABEL)]}
    ),
    (
        "This is the official website of the remake work of "FINAL FANTASY VII Remake".",
        {"entities": [(1, 20, LABEL)]}
    ),
    (
        "FINAL FANTASY VII Remake-PS4 is always a bargain at the game store.",
        {"entities": [(0, 19, LABEL)]}
    )
]

random.seed(0)
nlp = spacy.blank("ja")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label(LABEL)
optimizer = nlp.begin_training()
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):
    for itn in range(30):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = minibatch(TRAIN_DATA, size=compounding(1.0, 4.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
        print("Losses", losses)

print()

test_text = "Following "FINAL FANTASY VII Remake", "FINAL FANTASY II"!"
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.text, ent.label_) 

output_dir = Path(r"Appropriate folder name")
nlp.meta["name"] = "GameTitleModel"
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

The result looks like this. For the time being, it was recognized as "FINAL FANTASY VII Remake Game_Title".

Losses {'ner': 36.54436391592026}
Losses {'ner': 28.74292328953743}
Losses {'ner': 16.96098183095455}
   :
Entities in 'Following "FINAL FANTASY VII Remake", "FINAL FANTASY II"!'
FINAL FANTASY VII Remake Game_Title
FINAL FANTASY II Game_Title

I was able to do it for the time being. "FINAL FANTASY II" that I haven't learned is also a Game_Title.

So, I'm sorry, it's a little rough, so I will update it from time to time after it is released.

Recommended Posts

Perform entity analysis using spaCy / GiNZA in Python
[Environment construction] Dependency analysis using CaboCha in Python 2.7
Association analysis in Python
Regression analysis in Python
Axisymmetric stress analysis in Python
Simple regression analysis in Python
Data analysis using python pandas
Translate using googletrans in Python
Using Python mode in Processing
Morphological analysis using Igo + mecab-ipadic-neologd in Python (with Ruby bonus)
EEG analysis in Python: Python MNE tutorial
Precautions when using pit in Python
Perform Scala-like collection operations in Python
Try using LevelDB in Python (plyvel)
Using global variables in python functions
Let's see using input in python
Infinite product in Python (using functools)
Edit videos in Python using MoviePy
Planar skeleton analysis in Python (2) Hotfix
Handwriting recognition using KNN in Python
Try using Leap Motion in Python
Depth-first search using stack in Python
When using regular expressions in Python
GUI creation in python using tkinter 2
Mouse operation using Windows API in Python
Notes using cChardet and python3-chardet in Python 3.3.1.
Try using the Wunderlist API in Python
GUI creation in python using tkinter part 1
Get Suica balance in Python (using libpafe)
(Bad) practice of using this in Python
Slowly hash passwords using bcrypt in Python
Try using the Kraken API in Python
Using venv in Windows + Docker environment [Python]
[FX] Hit oanda-API in Python using Docker
Tweet using the Twitter API in Python
[Python] [Windows] Serial communication in Python using DLL
I tried using Bayesian Optimization in Python
Recommendation tutorial using association analysis (python implementation)
Log in to Slack using requests in Python
Get Youtube data in Python using Youtube Data API
Using physical constants in Python scipy.constants ~ constants e ~
Scraping a website using JavaScript in Python
Develop slack bot in python using chat.postMessage
Python: Negative / Positive Analysis: Twitter Negative / Positive Analysis Using RNN-Part 1
Write python modules in fortran using f2py
Draw a tree in Python 3 using graphviz
Notes for using python (pydev) in eclipse
Disease classification in Random Forest using Python
Download files in any format using Python
Parallel task execution using concurrent.futures in Python
Residual analysis in Python (Supplement: Cochrane rules)
Notes on using code formatter in Python
Meaning of using DI framework in Python
Perform "diagonalization of symmetric matrix A using orthogonal matrix U" in Python (eigenvalue decomposition)
Replace the named entity in the read text file with a label (using GiNZA)
Time variation analysis of black holes using python
Email attachments using your gmail account in python.
Creating numbering process using python in DynamoDB Local Numbering process
Try using the BitFlyer Ligntning API in Python
Get image URL using Flickr API in Python
Notes on using dict in python [Competition Pro]