Easy keyword extraction with TermExtract for Python

TermExtract seems to be a module for extracting technical terms from text data.

Technical term (keyword) automatic extraction system

Until now, it was only provided as a Perl module, but it seems that a beta version was released for Python at the end of last year. I thought it could be used as a countermeasure against unknown words when analyzing text, so I tried using it.

Installation

Simply download the zip file from Official, unzip it to a suitable location, and execute the following.

python setup.py install

Unfortunately, it doesn't seem to be installed from pip or conda.

Keyword extraction using the output result of the morphological analyzer

Officially

Receives the morphological analysis result of Wakame, a Japanese morphological analysis software, and returns a list of compound words (blank-separated single nouns) or a dictionary (compound word is the key and the number of occurrences of the compound word is the value).

There is.

The morphological analysis result to be passed seems to be passed in the following format (the official sample text below).

sample.txt


Natural language processing nouns,General,*,*,*,*,dummy,dummy,dummy
(Symbol,Open parentheses,*,*,*,*,(,(,(
Auxiliary verb,*,*,*,Literary language,Uninflected word,Ri,Li,Li
, Symbol,Comma,*,*,*,*,、,、,、
English noun,General,*,*,*,*,dummy,dummy,dummy
Word noun,General,*,*,*,*,dummy,dummy,dummy
:noun,Change connection,*,*,*,*,*
natural noun,General,*,*,*,*,*
language noun,General,*,*,*,*,*
processing noun,General,*,*,*,*,*
・
・
・

The morphological analysis result in Mecab is divided into lines. Load this with the following Python script.

import termextract.mecab
import termextract.core
import collections

#Read the file
tagged_text = open("sample.txt", "r", encoding="utf-8").read()

#Extract compound words and calculate importance
frequency = termextract.mecab.cmp_noun_dict(tagged_text)
LR = termextract.core.score_lr(frequency,
         ignore_words=termextract.mecab.IGNORE_WORDS,
         lr_mode=1, average_rate=1
     )
term_imp = termextract.core.term_importance(frequency, LR)

#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
    print(termextract.core.modify_agglutinative_lang(cmp_noun), value, sep="\t")

The output looks like this:

Natural language processing 31.843366656181313
they  11.618950038622252
Meaning 10.392304845413264
English 10.059467437463484
Basic technology 9.361389277282864
Statistical natural language processing 9.085602964160698
Analysis 8.485281374238571
・
・
・

The result is fair, and I can extract compound words as it is (although there are many cases where it is obviously strange ...).

However, ** input is Mecab's morphological analysis result **, which is subtly difficult to use. I felt that it would be easier to use if I could pass plain text or divided text.

Japanese stop word method keyword extraction

It seems that another method of extraction is also provided. One of them, ** Japanese stopword method terminology extraction **, is officially explained as follows.

Receives plain text in Japanese and returns a list of compound words (blank-separated single nouns) or a dictionary (compound word is the key, compound word occurrence count is the value). The compound word is cut out by dividing the sentence by "hiragana" and "symbol".

I think it means tokenizing with hiragana and symbols as delimiters (sorry, I haven't read it properly ...)

This just loads plain text.

import collections
import termextract.japanese_plaintext
import termextract.core

#Read the file
text = open("sample.txt", "r", encoding="utf-8").read()

#Extract compound words and calculate importance
frequency = termextract.japanese_plaintext.cmp_noun_dict(text)
LR = termextract.core.score_lr(frequency,
         ignore_words=termextract.japanese_plaintext.IGNORE_WORDS,
         lr_mode=1, average_rate=1
     )
term_imp = termextract.core.term_importance(frequency, LR)

#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
    print(termextract.core.modify_agglutinative_lang(cmp_noun), value, sep="\t")

The output looks like this:

Artificial intelligence 1226.4288753047445
Human 277.1173032591193
Intelligence 185.75930965317852
Development 88.6373649378885
Awareness 60.00624902367479
Artificial 57.917332434843445
Possible 55.20783921098894
・
・
・

I hope this is easy to use.

Comparison of both methods

In the above, we introduced two methods, the morphological analysis result method and the stopword method, but let's look at the top 20 scores in each method.

Morphological analysis method

Intelligence 12.649110640673518
Computational intelligence 5.029733718731742
Fighter 4.7381372205375865
Combat 4.58257569495584
For fighter pilots 4.4406237146062955
Calculator 4.426727678801286
Artificial intelligence 4.355877174692862
Study 4.0
Calculation 4.0
Autopilot 3.9359793425308607
Learning 3.872983346207417
Automatic combat system 3.802742902833608
Artificial intelligence technology 3.7719455481170785
Logical operation 3.7224194364083982
Machine learning 3.6628415014847064
Symbolic AI 3.6342411856642793
Autopilot possible 3.5254687665352296
Logic 3.4641016151377544
Machine 3.4641016151377544
Mechanical calculator 3.413473673690155

Stop word method

Artificial intelligence 1226.4288753047445
Human 277.1173032591193
Intelligence 185.75930965317852
Development 88.6373649378885
Awareness 60.00624902367479
Artificial 57.917332434843445
Possible 55.20783921098894
Study 51.27978102078589
Learning 49.31317739277511
Japanese Society for Artificial Intelligence 48.855373993311964
Realization 48.748063633179314
Theory 40.51490946041508
Announcement 39.39438441683934
Computational intelligence 35.98098913381863
Possibility 34.82443169313786
Method 34.6517883306879
Use 32.82677759681713
Intellectual 31.52620185751426
Operation 30.582796407248203
Action 30.582796407248203
Appearance 29.146786564179294

At first glance, it seems that the morphological analysis method can extract more important keywords with a high score.

Impressions

May I consider it as an easy way to deal with unknown words that cannot be detected by Mecab + Neologd? However, in the case of the morphological analysis method, it is difficult to use as a module, so it seems that you need to make a thin wrapper-like one yourself. Also, proper verification is required.

Recommended Posts

Easy keyword extraction with TermExtract for Python
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Easy folder synchronization with Python
Keyword extraction by MeCab (python)
Easy Python compilation with NUITKA-Utilities
Easy HTTP server with Python
Keyword arguments for Python functions
[Python] Easy parallel processing with Joblib
Text extraction with AWS Textract (Python3.6)
Getting Started with Python for PHPer-Classes
Easy Python + OpenCV programming with Canopy
Easy email sending with haste python3
Bayesian optimization very easy with Python
Easy data visualization with Python seaborn.
Easy parallel execution with python subprocess
Easy modeling with Blender and Python
Getting Started with Python for PHPer-Functions
Style conversion with python Easy memorandum with git clone [For intermediate users]
[Python] Super easy test with assert statement
INSERT into MySQL with Python [For beginners]
Python3> in keyword> True with partial match?
WEB scraping with Python (for personal notes)
[Python] Easy argument type check with dataclass
Manually ssh registration for coreserver with python
Use DeepL with python (for dissertation translation)
Memo to ask for KPI with python
Amplify images for machine learning with python
Tips for dealing with binaries in Python
Easy introduction of speech recognition with Python
Tips for using python + caffe with TSUBAME
[Shakyo] Encounter with Python for machine learning
Process multiple lists with for in Python
[Easy Python] Reading Excel files with openpyxl
Getting Started with Python for PHPer-Super Basics
Easy web app with Python + Flask + Heroku
Debug for mysql connection with python mysql.connector
[Python] Read images with OpenCV (for beginners)
Easy image processing in Python with Pillow
WebApi creation with Python (CRUD creation) For beginners
[Easy Python] Reading Excel files with pandas
Easy web scraping with Python and Ruby
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
Preparation for scraping with python [Chocolate flavor]
[For beginners] Try web scraping with Python
2016-10-30 else for Python3> for:
[Python] Easy introduction to machine learning with python (SVM)
python [for myself]
Csv output from Google search with [Python]! 【Easy】
Scraping with Python
Python is easy
Statistics with python
Python for super beginners Python for super beginners # Easy to get angry
Causal reasoning and causal search with Python (for beginners)
Get a ticket for a theme park with python
Text extraction with GCP Cloud Vision API (Python3.6)
Scraping with Python
Edge extraction with python + OpenCV (Sobel filter, Laplacian filter)
Python with Go
[Translation] Getting Started with Rust for Python Programmers
Create a LINE BOT with Minette for Python
Building an Anaconda environment for Python with pyenv