[PYTHON] Extract and list personal names and place names in the text

Problem awareness

  1. Named entity recognition as a text sentence analysis method is a highly reliable method that is "withered" in a good sense.
  2. As a technique for extracting meaning from text sentences, there is an automatic summarization model of sentences, and a huge number of papers are proposed every day, such as an automatic generation model of summarization sentences using deep learning.
  3. The automatic summarization technology of sentences is wonderful, but I think that the generated summary often omits the specific "person name" or "place name" contained in the original sentence. I will.
  4. By outputting a list of all the words of "personal name" contained in the text sentence that you are interested in, the document refers to "who" and does not mention anyone. You can check at a glance.
  5. In this way, there may be cases where the "information" that is leaked from the summary sentence can be scooped out by extracting the named entity.

__ (Reference) __ There are two types of automatic sentence summarization models: * Extractrive summarization * and * Abstractive summarization * models.

@Koreyou's Qiita article "Introduction of thesis: Neural Latent Extractive Document Summarization" ・ [[DL Round Reading] Abstractive Summarization of Reddit Posts with Multi-level Memory Networks](https://www.slideshare.net/DeepLearningJP2016/dlabstractive-summarization-of-reddit-posts-with-multilevel-memory-networks- 132350977)

__ So, this time, I defined a method that returns only the corresponding word in the array list from the target text when the label name of the named entity is specified. __


( Advance preparation )

Terminal


pip install spacy
pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"

Self-made method defined this time

extract_words_by_entity_label

Python3


def extract_words_by_entity_label(text, label):
    if label in ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW", "LANGUAGE", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"]:
            text = text.replace("\n", "")
            doc = nlp(text)
            words_list = [ent.text for ent in doc.ents if ent.label_ == label]

    else:
            print("Its named entity label does not exist.")
            words_list = []

    return words_list

Named entity label type that can be specified in __ * label * __

__ The following * spaCy * official website has a list of * Entity * label names defined by * spaCy * __ -Spacy * Named Entity Recognition *

Label type


PERSON	People, including fictional.
NORP	Nationalities or religious or political groups.
FAC	Buildings, airports, highways, bridges, etc.
ORG	Companies, agencies, institutions, etc.
GPE	Countries, cities, states.
LOC	Non-GPE locations, mountain ranges, bodies of water.
PRODUCT	Objects, vehicles, foods, etc. (Not services.)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART	Titles of books, songs, etc.
LAW	Named documents made into laws.
LANGUAGE	Any named language.
DATE	Absolute or relative dates or periods.
TIME	Times smaller than a day.
PERCENT	Percentage, including ”%“.
MONEY	Monetary values, including unit.
QUANTITY	Measurements, as of weight or distance.
ORDINAL	“first”, “second”, etc.
CARDINAL	Numerals that do not fall under another type.

( Example of use )

__ Text to prepare __

Chief Cabinet Secretary Kato emphasized that he would make every effort to return to Japan as soon as possible by recording a radio program that the government and others are broadcasting to abductees in North Korea, saying, "Hug each other with your family. Please continue to have a strong feeling that the day will come and survive. "
On the 16th, Secretary of State Kato, who also serves as the minister in charge of the abduction issue, is investigating the so-called specific disappearances who cannot be ruled out by the government and North Korea. We recorded a radio program that is being broadcast to the abductees in Japan.
In this, Chief Cabinet Secretary Kato said, "The abduction issue is regarded as the most important issue in the Suga Cabinet. I met with my family with Prime Minister Suga and shared the earnest desire to" get results at all costs. " ".
He said, "We are still determined to break the shell of mutual distrust, settle the unfortunate past, and normalize diplomatic relations with North Korea." The government is working together to return the abductees as soon as possible. He emphasized that he would do his best.
A And he said, "Keep in mind that the day will come when you will step on the soil of your country again and hug your family who are waiting for you to return home. Please take good care of yourself and survive."

Python3


>>> text = """Chief Cabinet Secretary Kato emphasized that he would make every effort to return to Japan as soon as possible by recording a radio program broadcast by the government and others to the abductees in North Korea, saying, "Hug each other with your family. Please continue to have a strong feeling that the day will come and survive. "
On the 16th, Secretary of State Kato, who also serves as the minister in charge of the abduction issue, is investigating the so-called specific disappearances who cannot be ruled out by the government and North Korea. We recorded a radio program that is being broadcast to the abductees in Japan.
In this, Chief Cabinet Secretary Kato said, "The abduction issue is regarded as the most important issue in the Suga Cabinet. I met with my family with Prime Minister Suga and shared the earnest desire to" get results at all costs. " ".
Then, "Break the shell of mutual distrust. Then," Break the shell of mutual distrust. Then "Break the shell of mutual distrust. Then" Break the shell of mutual distrust. Then "Break the shell of mutual distrust." Break the shell, then "break the shell of mutual distrust, and then" break the shell of mutual distrust, and then "phase....On the 16th, Secretary of State Kato, who also serves as the minister in charge of the abduction issue, is investigating the so-called specific disappearances who cannot be ruled out by the government and North Korea. We recorded a radio program that is being broadcast to the abductees in Japan.
...In this, Chief Cabinet Secretary Kato said, "The abduction issue is positioned as the most important issue in the Suga Cabinet. I met with my family with Prime Minister Suga and shared my earnest desire to produce results at all costs." I did. "
...Then, "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Keep a strong feeling of companionship, please take good care of your body and survive." I called."""
>>>
>>> text = text.replace("\n", "")
>>> text
'Chief Cabinet Secretary Kato emphasized that he would make every effort to return to Japan as soon as possible by recording a radio program broadcast by the government and others to the abductees in North Korea. Please continue to have a strong feeling that the day will come and survive. " On the 16th, Secretary of State Kato, who also serves as the minister in charge of the abduction issue, is investigating the so-called specific disappearances who cannot be ruled out by the government and North Korea. We recorded a radio program that is being broadcast to the abductees in Japan. In this, Chief Cabinet Secretary Kato said, "The abduction issue is regarded as the most important issue in the Suga Cabinet. I met with my family with Prime Minister Suga and shared the earnest desire to" get results at all costs. " ". Then, "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," "Breaking the shell of mutual distrust," After breaking the shell, "Breaking the shell of mutual distrust," Breaking the shell of mutual distrust, "Keep a strong feeling of mutuality, please take good care of your body and survive." I did.'

(Named entity recognition included in the above text)

Python3


>>> import spacy
>>> from spacy.matcher import Matcher
>>> nlp = spacy.load('ja_ginza')
>>>
>>> tmp = ["Label name:  {label}word: {word}".format(label=ent.label_, word= ent.text) for ent in doc.ents]
>>> tmp
['Label name:PERSON word:Chief Cabinet Secretary Kato', 'Label name:LOC word:North Korea', 'Label name:PERSON word:Chief Cabinet Secretary Kato', 'Label name:DATE word:16th', 'Label name:LOC word:North Korea', 'Label name:LOC word:North Korea', 'Label name:PERSON word:Chief Cabinet Secretary Kato', 'Label name:PERSON word:Suga', 'Label name:PERSON word:Suga']
>>>
>>> from pprint import pprint
>>> pprint(tmp)
['Label name:PERSON word:Chief Cabinet Secretary Kato',
 'Label name:LOC word:North Korea',
 'Label name:PERSON word:Chief Cabinet Secretary Kato',
 'Label name:DATE word:16th',
 'Label name:LOC word:North Korea',
 'Label name:LOC word:North Korea',
 'Label name:PERSON word:Chief Cabinet Secretary Kato',
 'Label name:PERSON word:Suga',
 'Label name:PERSON word:Suga']
>>>

(Behavior of the method at the beginning)

Python3


>>> words_list = extract_words_by_entity_label(text, "aaa")
Its named entity label does not exist.
>>>
>>> print(words_list)
[]
>>>
>>> label = "LOC"
>>> words_list = extract_words_by_entity_label(text, label)
>>> print(words_list)
['North Korea', 'North Korea', 'North Korea']
>>>
>>> for label in ["LOC", "DATE", "PERSON"]:
...     print(label, " : ", extract_words_by_entity_label(text, label))
...
LOC  :  ['North Korea', 'North Korea', 'North Korea']
DATE  :  ['16th']
PERSON  :  ['Chief Cabinet Secretary Kato', 'Chief Cabinet Secretary Kato', 'Chief Cabinet Secretary Kato', 'Suga', 'Suga']
>>>

(Reference site)

  1. @ moriyamanaoto's Qiita article "Simplify rule-based description using spaCy!"

__ (application) __

__ It may be good to extract a word with a specific thing attribute (* Entity *) from the target text and then perform the following processing. __

  1. Visualize the context in which the noticeable word of a person's name or place appears in the sentence __ by __dependence analysis __.
  2. Of the place name words, visualize the latitude and longitude of the place name that stands out, the type of facility, etc. __ by referring to the following article.

-Geocoding tool that returns the facility type and address etc. when you enter the location name

Recommended Posts

Extract and list personal names and place names in the text
Extract the product name and price from the product list in the Yodobashi.com purchase statement email.
Implemented List and Bool in Python and SQLite3 (personal note)
Extract the Azure service list
Methods available in the list
Sort and output the elements in the list as elements and multiples in Python.
Extract multiple list duplicates in Python
Difference between list () and [] in Python
Extract text from images in Python
Dig the directory and create a list of directory paths + file names
Reading and writing text in Python
Extract every n elements from an array (list) in Python and Ruby
Look up the names and data of free variables in function objects
Text mining: Probability density distribution on the hypersphere and text clustering in KMeans
Put together consecutive values in the list
OR the List in Python (zip function)
Change the list in a for statement
Difference between append and + = in Python list
Find it in the procession and edit it
Add lines and text on the image
Get the EDINET code list in Python
Function to extract the maximum and minimum values ​​in a slice with Go
[Python] Precautions when retrieving data by scraping and putting it in the list
plot the coordinates of the processing (python) list and specify the number of times in draw ()
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
Save the specified channel ID in text and load it at the next startup
The place to be evaluated may be different between map and list comprehension notation
Receives and processes n objects in a list
[Python] Sort the list of pathlib.Path in natural sort
Extract only the file name excluding the directory in the directory
12. Save the first column in col1.txt and the second column in col2.txt
[Automation] Extract the table in PDF with Python
Make a copy of the list in Python
About the difference between "==" and "is" in python
Get only the subclass elements in a list
I can't enter characters in the text area! ?? !! ?? !! !! ??
When the axis and label overlap in matplotlib
In print, the non-ascii characters in the list look great
Search by the value of the instance in the list
Extract the lyrics information in the MP3 / MP4 file and save it in the lyrics file (* .lrc) for Sony walkman.
[Python] Change the text color and background color of a specific keyword in print output
I tried to extract the text in the image file using Tesseract of the OCR engine
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
I investigated the calculation time of "X in list" (linear search / binary search) and "X in set"