[PYTHON] I tried summarizing sentences with summpy

I tried summpy, a text summarization tool published by Recruit Technologies.

summpy https://github.com/recruit-tech/summpy

The environment is Ubuntu 16.4. Requires Python 2.7 to work. Since it is not included by default, we will prepare an environment of 2.7 with anaconda.

$ conda create -n 2.7 python=2.7 anaconda

Check if it was installed properly

$ source activate 2.7
(2.7)$
(2.7)$ conda info -e
# conda environments:
#
base                     /home/croso/anaconda3
2.7                   *  /home/croso/anaconda3/envs/2.7
3.5                      /home/croso/anaconda3/envs/3.5
3.6                      /home/croso/anaconda3/envs/3.6

Install Mecab-python because MeCab or janome is required for morphological analysis

(2.7)$ pip install mecab-python

Then install summpy with pip

(2.7)$ pip install summpy 

Create a sample script


# -*- coding: utf-8 -*-
from summpy.lexrank import summarize

text=u'''
The unemployment rate (seasonally adjusted) for September announced by the Ministry of Internal Affairs and Communications was 2.4%, down 0.2 points from the previous month.

According to a Reuters survey, 2.3% was expected.

The unemployment rate has been below 2.5% since January 2018.

The Ministry of Internal Affairs and Communications summed up, "Although the unemployment rate has risen, the level has remained at the lowest level in about 26 years, and the employment situation is steadily improving," said an executive.

The number of employees (seasonally adjusted) was 67.3 million, a decrease of 50,000 from the previous month.

The number of unemployed (same as above) was 1.67 million, an increase of 130,000 from the previous month.

The increase in the number of unemployed people is the first in six months.

Looking at the breakdown, the number of involuntary turnovers was the same as the previous month, but the number of voluntary turnovers (self-convenience) increased by 10,000, and the number of new job seekers increased by 90,000. "The number of people who want to work anew is increasing," he said.

According to the original figures, the number of employees increased by 530,000 from the same month of the previous year to 67.68 million.

It has increased for 81 consecutive months, the highest ever since 1953, which is comparable.

The employment rate for 15-64 years old is 77.9%, the highest ever in Thailand.

The active job openings-to-applicants ratio (seasonally adjusted) for September announced by the Ministry of Health, Labor and Welfare was 1.57 times, down from the previous month.

According to a Reuters survey, it was expected to be 1.59 times.
'''

sentences, debug_info = summarize(
    text, sent_limit=2
)

for sent in sentences:
    print sent.strip().encode('utf-8')


Let's analyze the article extracted from the news site. sent_limit How many lines do you put the results together? It looks like . Up to this point, the README.md of summpy is traced as it is.

An error occurred when it was operated.

"error": "add_edge() takes exactly 3 arguments (4 given)"

When I looked it up, it was a version mismatch with networkx. https://teratail.com/questions/114565 I will match the version.

(2.7)$ pip install multiqc==1.2
(2.7)$ pip install networkx==1.11

Install multiqc first. When you install multiqc, networkx also installs 2.2 automatically, so you can not reproduce the environment well unless you overwrite 1.11 of networkx on it.

The completed sentence is below

The number of employees (seasonally adjusted) was 67.3 million, a decrease of 50,000 from the previous month.
The number of unemployed (same as above) was 1.67 million, an increase of 130,000 from the previous month.

What is it like? It can be read from the summary that the Japanese economy has cooled because the number of employees has decreased and the number of unemployed has increased. However, I feel that I have overlooked the sentence that can be said to be the subject of "0.2 points worse than the previous month."

I also misunderstood, but it seems that he does not "summary the text". It seems correct to say a tool that extracts only important lines from a sentence.

By the way, I think that few people install mecab and use it as it is. If you do not add a dictionary, it will be useless. So, install the following.

https://github.com/neologd/mecab-ipadic-neologd

It is a mecab dictionary that supports the latest words. After installation, the dictionary information should be included below.

(2.7)$ ls /usr/local/lib/mecab/dic/mecab-ipadic-neologd
char.bin  dicrc  left-id.def  matrix.bin  pos-id.def  rewrite.def  right-id.def  sys.dic  unk.dic

I'd like the dictionary to be read automatically, but summpy didn't have such a function, so I rewrote a part of the source to handle it. (It may be more correct to modify mecab-python ...)

(2.7)$ vi ~/anaconda3/envs/2.7/lib/python2.7/site-packages/summpy/misc/mecab_segmenter.py 

8th line

_mecab = MeCab.Tagger()

To

_mecab = MeCab.Tagger('-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

Replaced with. This should be a little more accurate ...

It's irrelevant, but machine learning requires a huge amount of learning data. Among them, I thought that sentence summarization was a relatively easy task to collect learning data. If you use the article title for the correct answer data and the article body for the learning data from many news sites, you can find as many samples as you want on the net, so I thought it might be a good subject for studying.

Recommended Posts

I tried summarizing sentences with summpy
I tried using Summpy
I tried scraping with Python
I tried Learning-to-Rank with Elasticsearch!
I tried clustering with PyCaret
I tried gRPC with Python
I tried scraping with python
I tried trimming efficiently with OpenCV
I tried machine learning with liblinear
I tried web scraping with python.
I tried moving food with SinGAN
I tried implementing DeepPose with PyTorch
I tried face detection with MTCNN
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried learning LightGBM with Yellowbrick
I tried face recognition with OpenCV
I tried multiple regression analysis with polynomial regression
I tried sending an SMS with Twilio
I tried using Amazon SQS with django-celery
I tried to implement Autoencoder with TensorFlow
I tried linebot with flask (anaconda) + heroku
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried a functional language with Python
I tried batch normalization with PyTorch (+ note)
I tried recursion with Python ② (Fibonacci sequence)
I tried implementing DeepPose with PyTorch PartⅡ
I tried to implement CVAE with PyTorch
I tried playing with the image with Pillow
I tried to solve TSP with QAOA
I tried simple image recognition with Jupyter
I tried CNN fine tuning with Resnet
I tried natural language processing with transformers.
#I tried something like Vlookup with Python # 2
I tried scraping
I tried PyQ
I tried AutoKeras
I tried to summarize various sentences using the automatic summarization API "summpy"
I tried papermill
I tried django-slack
I tried Django
I tried spleeter
I tried cgo
I tried handwriting recognition of runes with scikit-learn
I tried to predict next year with AI
I tried "smoothing" the image with Python + OpenCV
I tried hundreds of millions of SQLite with python
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
I tried image recognition of CIFAR-10 with Keras-Learning-
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I tried "differentiating" the image with Python + OpenCV