Recently, when I was looking for various things because I wanted to use Named Entity Recognition (NER) in Japanese easily with Python, I found out about `` `GiNZA```, so I tried using it. Was. By the way, named entity recognition is one of the tasks of natural language processing, which detects specific words such as dates and people as shown in the figure below.
GiNZA is natural language processing(Natural Language Processing; NLP)In one of the libraries for doing,You can perform various tasks other than named entity recognition..
To be precise, there is a natural language processing library called `` `spaCy```, but it seems that it is in charge of the Japanese processing part. Therefore, those who understand how to use `` `spaCy``` I think it's early.
For details, please check the following. This time, we will introduce only named entity extraction.
GitHub:[megagonlabs/ginza](https://github.com/megagonlabs/ginza)
[4th Natural Language Processing Using spaCy / GiNZA](https://www.ogis-ri.co.jp/otc/hiroba/technical/similar-document-search/part4.html)
# Preparation
First, prepare `` `GiNZA``` and `` `spaCy```.
```python
!pip install -U ginza
import spacy
from spacy import displacy
nlp = spacy.load('ja_ginza')
Now that the model of `` GiNZA``` has been loaded on the
spaCy
side, it is ready for processing.
If you are using Jupyter Notebook,I heard that you may stumble here,In that case[GitHub](https://github.com/megagonlabs/ginza)Please refer to.
# Data preparation
Next, prepare the sample data to be analyzed.
This time, we will use the livedoor news data published by Ronwitt.
Since the data is news divided into nine genres, this time we will extract news texts one by one and extract named entities for the nine news texts.
It is speculated that this will help you understand the strengths and weaknesses of the text genre.
Download the data from here and unzip it.
[livedoor news corpus](https://www.rondhuit.com/download.html)
```python
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!tar zxvf ldcc-20140209.tar.gz
I wrote the code to read the path of the text file and analyze it. As a test, I analyzed one article of German communication.
#Read text
filepath = "./text/dokujo-tsushin/dokujo-tsushin-4778030.txt"
with open(filepath) as f:
s = f.read()
#Various processing is done here
doc = nlp(s)
#Drawing the result of named entity extraction
displacy.render(doc, style="ent", jupyter=True)
The result looks like this.
You can extract it nicely! However, although it is a matter of appearance, it is a little difficult to see because the color is specified only for PERSON and TIME. So I will change the color as an option.
#Specify a color for the extracted entity type
colors = {"COUNTRY":"#00cc00", "CITY":"#00cc00", "GPE_OTHER":"#00cc00","OCCASION_OTHER":"#00cc00",
"LOCATION":"#00cc00", "LOCATION_OTHER":"#00cc00","DOMESTIC_REGION":"#00cc00","PROVINCE":"#00cc00",
"STATION":"#00cc00", "CONTINENTAL_REGION":"#00cc00","THEATER":"#00cc00",
"TIME":"#adff2f","DATE":"#adff2f","DAY_OF_WEEK":"#adff2f",
"PERIOD_YEAR":"#adff2f", "PERIOD_MONTH":"#adff2f", "PERIOD_DAY":"#adff2f",
"FLORA":"#adff99","FLORA_PART":"#adff99",
"DISH":"#ffeb99","FOOD_OTHER":"#ffeb99",
"AGE":"#3385ff","N_PERSON":"#3385ff","N_EVENT":"#3385ff","N_LOCATION_OTHER":"#3385ff","RANK":"#3385ff",
"N_PRODUCT":"#3385ff","":"#3385ff","":"#3385ff","":"#3385ff","MEASUREMENT_OTHER":"#3385ff","PERCENT":"#3385ff",
"N_ORGANIZATION":"#3385ff", "ORDINAL_NUMBER":"#3385ff", "N_FACILITY":"#3385ff","SPEED":"#3385ff",
"PHONE_NUMBER":"#3385ff",
"MONEY":"#ffff00",
"COMPANY":"#99c2ff", "SCHOOL":"#99c2ff", "INTERNATIONAL_ORGANIZATION":"#99c2ff",
"GOE_OTHER":"#99c2ff", "SHOW_ORGANIZATION":"#99c2ff","CORPORATION_OTHER":"#99c2ff",
"CLOTHING":"#ff66a3",
"PRODUCT_OTHER":"#ff66a3",
"PERSON":"#c266ff",
"POSITION_VOCATION":"#ebccff",
"MUSIC":"#ff7f50", "MOVIE":"#ff7f50", "GAME":"#ff7f50", "SPORT":"#ff7f50", "BOOK":"#ff7f50",
"BROADCAST_PROGRAM":"#ff7f50",
"ANIMAL_DISEASE":"#cd5c5c"
}
options = {"colors": colors}
displacy.render(doc, style="ent", options=options, jupyter=True)
Place words are green, tissue words are blue, and so on. There is a lot of room for improvement. Since this is just an experiment, I'm happy with this, and I used it to extract nine news articles with named entities. The results are as follows.
livedoor HOMME
MOVIE ENTER
Peachy
Sports Watch
Setting the color makes it easier to see than the default. (Although I feel that my eyes are a little flickering ...) Depending on the genre of news, it is obvious that there are many people in the entertainment system and many names of products and companies in the gadget system.
The accuracy is important, but depending on the genre, there are some words that are of concern, but it can be said that they are generally well extracted.
Personally, I was worried that "Gecko" was classified as FOOD in the IT Lifehack article.
GitHub:megagonlabs/ginza Object Square: 4th Natural Language Processing Using spaCy / GiNZA Ronwitt Co., Ltd .: livedoor news corpus spaCy:Visualizing the entity recognizer spaCy:Named Entity Recognition