[PYTHON] I tried to extract named entities with the natural language processing library GiNZA

Purpose of this article

Recently, when I was looking for various things because I wanted to use Named Entity Recognition (NER) in Japanese easily with Python, I found out about `` `GiNZA```, so I tried using it. Was. By the way, named entity recognition is one of the tasks of natural language processing, which detects specific words such as dates and people as shown in the figure below.

image.png

About GiNZA

GiNZA is natural language processing(Natural Language Processing; NLP)In one of the libraries for doing,You can perform various tasks other than named entity recognition..


 To be precise, there is a natural language processing library called `` `spaCy```, but it seems that it is in charge of the Japanese processing part. Therefore, those who understand how to use `` `spaCy``` I think it's early.


 For details, please check the following. This time, we will introduce only named entity extraction.
GitHub:[megagonlabs/ginza](https://github.com/megagonlabs/ginza)
 [4th Natural Language Processing Using spaCy / GiNZA](https://www.ogis-ri.co.jp/otc/hiroba/technical/similar-document-search/part4.html)


# Preparation

 First, prepare `` `GiNZA``` and `` `spaCy```.

```python
!pip install -U ginza
import spacy
from spacy import displacy
nlp = spacy.load('ja_ginza')

Now that the model of `` GiNZA``` has been loaded on the spaCy side, it is ready for processing.

If you are using Jupyter Notebook,I heard that you may stumble here,In that case[GitHub](https://github.com/megagonlabs/ginza)Please refer to.



# Data preparation

 Next, prepare the sample data to be analyzed.
 This time, we will use the livedoor news data published by Ronwitt.
 Since the data is news divided into nine genres, this time we will extract news texts one by one and extract named entities for the nine news texts.
 It is speculated that this will help you understand the strengths and weaknesses of the text genre.

 Download the data from here and unzip it.
 [livedoor news corpus](https://www.rondhuit.com/download.html)

```python
!wget https://www.rondhuit.com/download/ldcc-20140209.tar.gz
!tar zxvf ldcc-20140209.tar.gz

Named entity recognition

I wrote the code to read the path of the text file and analyze it. As a test, I analyzed one article of German communication.

#Read text
filepath = "./text/dokujo-tsushin/dokujo-tsushin-4778030.txt"
with open(filepath) as f:
    s = f.read()

#Various processing is done here
doc = nlp(s) 

#Drawing the result of named entity extraction
displacy.render(doc, style="ent", jupyter=True)

The result looks like this. image.png

You can extract it nicely! However, although it is a matter of appearance, it is a little difficult to see because the color is specified only for PERSON and TIME. So I will change the color as an option.

#Specify a color for the extracted entity type
colors = {"COUNTRY":"#00cc00", "CITY":"#00cc00", "GPE_OTHER":"#00cc00","OCCASION_OTHER":"#00cc00",
          "LOCATION":"#00cc00", "LOCATION_OTHER":"#00cc00","DOMESTIC_REGION":"#00cc00","PROVINCE":"#00cc00",
          "STATION":"#00cc00", "CONTINENTAL_REGION":"#00cc00","THEATER":"#00cc00",

          "TIME":"#adff2f","DATE":"#adff2f","DAY_OF_WEEK":"#adff2f",
          "PERIOD_YEAR":"#adff2f", "PERIOD_MONTH":"#adff2f", "PERIOD_DAY":"#adff2f",

          "FLORA":"#adff99","FLORA_PART":"#adff99",
          "DISH":"#ffeb99","FOOD_OTHER":"#ffeb99",
          
          "AGE":"#3385ff","N_PERSON":"#3385ff","N_EVENT":"#3385ff","N_LOCATION_OTHER":"#3385ff","RANK":"#3385ff",
          "N_PRODUCT":"#3385ff","":"#3385ff","":"#3385ff","":"#3385ff","MEASUREMENT_OTHER":"#3385ff","PERCENT":"#3385ff",
          "N_ORGANIZATION":"#3385ff", "ORDINAL_NUMBER":"#3385ff", "N_FACILITY":"#3385ff","SPEED":"#3385ff",
          "PHONE_NUMBER":"#3385ff",

          "MONEY":"#ffff00",

          "COMPANY":"#99c2ff", "SCHOOL":"#99c2ff", "INTERNATIONAL_ORGANIZATION":"#99c2ff",
          "GOE_OTHER":"#99c2ff", "SHOW_ORGANIZATION":"#99c2ff","CORPORATION_OTHER":"#99c2ff",

          "CLOTHING":"#ff66a3",
          "PRODUCT_OTHER":"#ff66a3",

          "PERSON":"#c266ff",
          "POSITION_VOCATION":"#ebccff",

          "MUSIC":"#ff7f50", "MOVIE":"#ff7f50", "GAME":"#ff7f50", "SPORT":"#ff7f50", "BOOK":"#ff7f50", 
          "BROADCAST_PROGRAM":"#ff7f50", 
          
          "ANIMAL_DISEASE":"#cd5c5c"
          }

options = {"colors": colors}
displacy.render(doc, style="ent", options=options, jupyter=True)

Place words are green, tissue words are blue, and so on. There is a lot of room for improvement. Since this is just an experiment, I'm happy with this, and I used it to extract nine news articles with named entities. The results are as follows.

German communication

image.png

IT life hack

image.png

Home Appliance Channel

image.png

livedoor HOMME

image.png

MOVIE ENTER

image.png

Peachy

image.png

Esmax

im01.png

Sports Watch

image.png

Topic news

img01.png

Setting the color makes it easier to see than the default. (Although I feel that my eyes are a little flickering ...) Depending on the genre of news, it is obvious that there are many people in the entertainment system and many names of products and companies in the gadget system.

The accuracy is important, but depending on the genre, there are some words that are of concern, but it can be said that they are generally well extracted.

Personally, I was worried that "Gecko" was classified as FOOD in the IT Lifehack article.

reference

GitHub:megagonlabs/ginza Object Square: 4th Natural Language Processing Using spaCy / GiNZA Ronwitt Co., Ltd .: livedoor news corpus spaCy:Visualizing the entity recognizer spaCy:Named Entity Recognition

Recommended Posts

I tried to extract named entities with the natural language processing library GiNZA
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried natural language processing with transformers.
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to automatically extract the movements of PES players with software
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to save the data with discord
[Python] I played with natural language processing ~ transformers ~
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried Hello World with 64bit OS + C language without using the library
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to learn the sin function with chainer
I tried to extract features with SIFT of OpenCV
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to solve the problem with Python Vol.1
I tried to identify the language using CNN + Melspectogram
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.
I tried the changefinder library!
I tried 100 language processing knock 2020
I tried to compare the processing speed with dplyr of R and pandas of Python
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried using the Python library from Ruby with PyCall
I tried to find the average of the sequence with TensorFlow
I tried to notify the train delay information with LINE Notify
I tried to illustrate the time and time in C language
I tried to divide the file into folders with Python
I tried to divide with a deep learning language model
Study natural language processing with Kikagaku
[Natural language processing] Preprocessing with Japanese
I tried 100 language processing knock 2020: Chapter 3
I tried 100 language processing knock 2020: Chapter 1
Preparing to start natural language processing
I tried to move the ball
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
I tried to estimate the interval.
I tried to describe the traffic in real time with WebSocket
I tried to solve the ant book beginner's edition with python
I tried to automate the watering of the planter with Raspberry Pi
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried natural number expression and arithmetic processing only with list processing
I tried to process the image in "sketch style" with OpenCV
I tried to get started with Bitcoin Systre on the weekend
Dockerfile with the necessary libraries for natural language processing in python
I tried to process the image in "pencil style" with OpenCV
I tried to expand the size of the logical volume with LVM
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
I tried to improve the efficiency of daily work with Python
I tried to create serverless batch processing for the first time with DynamoDB and Step Functions
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!