[PYTHON] I applied SIF Rank to a Japanese document and tried to extract key phrases

As of August 18, 2020, there was no article that I tried to apply SIF Rank to Japanese documents, so I will write it down to the point where I actually extract key phrases. I think there are some rough edges, so I would appreciate it if you could point out.

Introduction

The paper that proposed SIF Rank and the original repository are here. • SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Modelsunyilgdx/SIFRank

The code used this time is stored in the following repository. • tanajp/SIFRank_ja_model

environment

• Google Colaboratory • Python 3.6.9 • allennlp 0.8.4 • nltk 3.4.3 • torch 1.2.0 • stanza 1.0.0

Initial setting

First, clone the here repository at hand. Then, download the Japanese version of ELMo from the AllenNLP site and place it under auxiliary_data in the SIF Rank_ja_model folder. (Here, you can download only weights.)

This time, we will place a folder under My Drive and proceed. Please put the cloned folder under My Drive of Google Drive. Next, for the setting of Google Colab, select "Change runtime type" from "Runtime" on the upper left to change the GPU. Please select and save.

Mount on Google Drive

from google.colab import drive
drive.mount('/content/drive')

If the output is as follows, it is successful.

Enter your authorization code:
··········
Mounted at /content/drive

Installation

Install the required libraries.

!pip install -r '/content/drive/My Drive/SIFRank_ja_model/requirements.txt'

Download wordnet and the Japanese model of stanza.


import nltk
import stanza

nltk.download('wordnet')
stanza.download('ja')

Implementation of SIF Rank

test.py



import sys
sys.path.append('/content/drive/My Drive/SIFRank_ja_model')
sys.path.append('/content/drive/My Drive/SIFRank_ja_model/embeddings')
import stanza
import sent_emb_sif, word_emb_elmo
from model.method import SIFRank, SIFRank_plus

#download from https://allennlp.org/elmo
options_file = "https://exawizardsallenlp.blob.core.windows.net/data/options.json"
weight_file = "/content/drive/My Drive/SIFRank_ja_model/auxiliary_data/weights.hdf5"

ELMO = word_emb_elmo.WordEmbeddings(options_file, weight_file, cuda_device=0)
SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=1.0)
ja_model = stanza.Pipeline(
    lang="ja", processors={}, use_gpu=True
)
elmo_layers_weight = [0.0, 1.0, 0.0]

text = "Please enter the text here."
keyphrases = SIFRank(text, SIF, ja_model, N=5,elmo_layers_weight=elmo_layers_weight)
keyphrases_ = SIFRank_plus(text, SIF, ja_model, N=5, elmo_layers_weight=elmo_layers_weight)

print(keyphrases)
print(keyphrases_)

Execution result

As an example, [ANA, 500 billion yen capital raising talks-Wikinews](https://ja.wikinews.org/wiki/ANA%E3%80%815000%E5%84%84%E5%86%86 % E8% A6% 8F% E6% A8% A1% E3% 81% AE% E8% B3% 87% E6% 9C% AC% E8% AA% BF% E9% 81% 94% E5% 8D% 94% E8 I tried to extract the key phrase by inputting the text of% AD% B0).

2020-08-17 17:21:13 INFO: Loading these models for language: ja (Japanese):
=======================
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
=======================

2020-08-17 17:21:13 INFO: Use device: gpu
2020-08-17 17:21:13 INFO: Loading: tokenize
2020-08-17 17:21:13 INFO: Loading: pos
2020-08-17 17:21:14 INFO: Loading: lemma
2020-08-17 17:21:14 INFO: Loading: depparse
2020-08-17 17:21:15 INFO: Done loading processors!
(['Development Bank of Japan', 'Capital raising', 'ana holdings', 'Loan', 'Private financial institution'], [0.8466373488741734, 0.8303728302151282, 0.7858931046897192, 0.7837600983935882, 0.7821878670623081])
(['Development Bank of Japan', 'Nihon Keizai Shimbun', 'All Nippon Airways', 'Capital raising', 'ana holdings'], [0.8480482653338678, 0.8232344465718657, 0.8218706097094447, 0.8100789955114978, 0.8053839380458278])

This is the result of extracting with N = 5. The final output is the key phrases and their scores. The top is the output result of SIF Rank and the bottom is the output result of SIF Rank +. It is an article that ANA has started discussions on financing with the Development Bank of Japan and private financial institutions, so it seems that the extraction of key phrases is successful.

By the way, tokenize, pos, lemma, and depparse refer to tokenization, POS tagging, heading wording, and dependency structure analysis, respectively, and are in the process of pipeline processing by stanza.

in conclusion

I made SIF Rank applicable to Japanese documents and actually used it. The parser can be anything that has a Japanese model, but this time I used stanza. In addition, the Japanese stopword dictionary uses Slothlib. .. You can edit the stopwords by rewriting japanese_stopwords.txt under auxiliary_data in the SIFRank_ja_model folder.

References

SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Modelsunyilgdx/SIFRankAllenNLPIntroduction of ELMo (using MeCab) model that learned large-scale Japanese business news corpusUsage and accuracy comparison verification of ELMo (using MeCab) model that learned large-scale Japanese business news corpusLoad learned ELMo with AllenNLP-Rittanzu! -Slothlib

Recommended Posts

I applied SIF Rank to a Japanese document and tried to extract key phrases
I tried to make a periodical process with Selenium and Python
I tried to create Bulls and Cows with a shell program
I tried to extract players and skill names from sports articles
I tried to create a linebot (implementation)
I tried to create a linebot (preparation)
I tried to extract and illustrate the stage of the story using COTOHA
I tried to make a Web API
I tried to create a sample to access Salesforce using Python and Bottle
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
[ES Lab] I tried to develop a WEB application with Python and Flask ②
I tried to extract a line art from an image with Deep Learning
I tried to make a simple image recognition API with Fast API and Tensorflow
I tried to build a super-resolution method / ESPCN
I tried to build a super-resolution method / SRCNN ①
I implemented DCGAN and tried to generate apples
I tried to generate a random character string
I tried to build a super-resolution method / SRCNN ③
I tried to build a super-resolution method / SRCNN ②
[Introduction to PID] I tried to control and play ♬
I tried to make a ○ ✕ game using TensorFlow
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to make a "fucking big literary converter"
I tried to create a table only with Django
I tried to extract features with SIFT of OpenCV
I tried to read and save automatically with VOICEROID2 2
I tried to implement and learn DCGAN with PyTorch
I made a library to separate Japanese sentences nicely
I tried adding post-increment to CPython. Overview and summary
I tried to draw a route map with Python
I tried to implement a pseudo pachislot in Python
I tried to implement a recommendation system (content-based filtering)
I tried to generate ObjectId (primary key) with pymongo
I tried adding system calls and scheduler to Linux
[Go + Gin] I tried to build a Docker environment
I tried to automatically generate a password with Python3
I made a python library to do rolling rank
I tried to implement Grad-CAM with keras and tensorflow
I tried Kaokore, a Japanese classic dataset, on EfficientNet.
Python: I tried a liar and an honest tribe
I tried to install scrapy on Anaconda and couldn't
I tried to draw a configuration diagram using Diagrams
I made a tool that makes it a little easier to create and install a public key.
When I tried to install PIL and matplotlib in a virtualenv environment, I was addicted to it.
When I tried to build a Rails environment on WSL2 (Ubuntu 20.04LTS), I stumbled and fell.
I tried to make a document search slack command using Kendra announced at re: Invent 2019.
I made a server with Python socket and ssl and tried to access it from a browser
I also tried to imitate the function monad and State monad with a generator in Python
I tried to make a bot that randomly acquires Wikipedia articles and tweets once a day
[Graph drawing] I tried to write a bar graph of multiple series with matplotlib and seaborn
I tried to debug.
I tried to paste
I tried to implement a volume moving average with Quantx
I tried to predict and submit Titanic survivors with Kaggle
I tried to extract characters from subtitles (OpenCV: tesseract-ocr edition)
I tried to implement a one-dimensional cellular automaton in Python
I tried to automatically create a report with Markov chain
I tried to get Web information using "Requests" and "lxml"
[Markov chain] I tried to read a quote into Python.
I tried to solve a combination optimization problem with Qiskit
I tried "How to get a method decorated in Python"