[PYTHON] I tried to summarize various sentences using the automatic summarization API "summpy"

What is summpy

It is an automatic summarization API of sentences published by Recruit Technologies. It summarizes the entered text with the specified number of lines.

Published GitHub https://github.com/recruit-tech/summpy

This time, I put various sentences into this API and I tried to verify what the result would be.

Verification environment

EC2(Amazon Linux release 2) python2.7

Installation

Install pip, summpy, mecab-python3

For mecab-python3, if you do not specify version 0.996.5, Since the error "no such file or directory: / usr / local / etc / mecabrc" is displayed, the version is specified.

$ sudo easy_install pip
$ sudo pip install summpy
$ sudo pip install mecab-python3==0.996.5

At the same time, set the networkx version to 1.11. If you do not do this, you will get an "error": "add_edge () takes exactly 3 arguments (4 given)" "error at runtime.

$ sudo pip install multiqc==1.2
$ sudo pip install networkx==1.11

Server execution

Starts on port 8080. Nohup is added to run it in the background.

nohup python -m summpy.server -h 127.0.0.1 -p 8080 &

Source code

summpy_test.py


#!/usr/bin/env python2
# coding:utf-8
import requests

limit = 3 #Here, specify the number of lines you want to summarize
text = 'Enter the text you want to summarize here.'

p = {'sent_limit':limit, 'text':text}

r = requests.get('http://localhost:8080/summarize', params=p)

print(r.text)

Execution result

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "Enter the text you want to summarize here."
  ]
}

Since the text is one line, the result is also one line. I would like to change the text here in various ways.

I tried with various sentences

From here, I will summarize various sentences according to each theme. The text used for the abstract uses the content of the following article.

How do you interpret "Adler Psychology" from an engineer's perspective? https://qiita.com/keki/items/0542d9d121cf89d6154e

First of all, summarize without thinking about anything

First, let's summarize the following sentences. In addition, after deleting the line breaks, I put it in the summary API.

Original

At this age, people become more interested in feelings, feelings, and ways of thinking.

Meanwhile, I came across "Adler Psychology" in this title a few years ago.

Engineers sometimes concentrate on the specialized work of programming, and it is often said that they are not good at communicating with people and that they are not good at joining the circle of teams.

Also, I'm worried about relationships and suffering from depression....I think that there are many cases of this.

I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.

This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field.

By the way, it is quite a long sentence.
I hope you will see it with the intention of reading a small book.

Source code

summpy_test.py


#!/usr/bin/env python2
# coding:utf-8
import requests

limit = 3 #Here, specify the number of lines you want to summarize
text = 'At this age, people become more interested in feelings, feelings, and ways of thinking. Meanwhile, I came across "Adler Psychology" in this title a few years ago. Engineers are programming
Sometimes it is said that I am not good at communicating with people and I am not good at joining the team because I concentrate on my specialized work. Also, I'm worried about relationships and suffering from depression....Jobs that are often>I think it's a seed. I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships. This time, I am an engineer in the field of such "Adler psychology", but from the engineer's perspective
I would like to write an article about what it means when interpreted in. By the way, it is quite a long sentence. I hope you will see it with the intention of reading a small book.'

p = {'sent_limit':limit, 'text':text}

r = requests.get('http://localhost:8080/summarize', params=p)

print(r.text)

Execution result

$ python summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field."
  ]
}

Hmm. The connection between the 1st and 2nd lines is difficult to understand, but it is summarized in 3 lines. Also, it seems that the original text is not processed, but the text is simply extracted and selected.

Try removing punctuation

In order to investigate what punctuation means in summpy, I dare to remove all punctuation.

Original

At this age, I'm interested in people's feelings and emotional thinking.

A few years ago, I met "Adler Psychology," which is also in this title.

Engineers are not good at communicating with people because they concentrate on the specialized work of programming, and I think it is sometimes said that they are not good at joining the circle of teams.

Also, I'm worried about relationships and suffering from depression....I think that there are many cases of

I personally feel that "Adler Psychology" is the idea itself for solving such problems of human relationships.

This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, who is actually an engineer in the field.

By the way, it will be quite a long sentence
I hope you will see it with the intention of reading a little book.

Execution result

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, I'm interested in people's feelings and emotional thinking. I met a few years ago in "Adler Psychology," which is also in this title. Engineers specialize in programming. I'm not good at communicating with people because I concentrate on my work. I think it's sometimes said that I'm not good at joining the circle of teams. Also, I'm worried about relationships and suffering from depression....I think that there are many cases of this, and I feel that "Adler Psychology" is the idea itself to solve such problems of human relationships. I would like to write an article about what it would be like if I interpret it from an engineer's point of view. By the way, it will be quite a long sentence. I hope you will read a little book."
  ]
}

It has been summarized in one line. Apparently, they consider punctuation as a sentence break.

Adjust the number of summary lines

Then what if we increase the number of lines to summarize? I tried to summarize the above sentence in 100 lines.

Execution result

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "Meanwhile, I came across "Adler Psychology" in this title a few years ago.",
    "Engineers sometimes concentrate on the specialized work of programming, and it is often said that they are not good at communicating with people and that they are not good at joining the circle of teams.",
    "Also, I'm worried about relationships and suffering from depression....I think that there are many cases of this.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field.",
    "By the way, it is quite a long sentence.",
    "I hope you will see it with the intention of reading a small book."
  ]
}

The original text is as it is. The comma (,) is not used as a sentence break, You can see that they are separated by kuten (.). Besides, it seems to be separated by dots (.), Question marks (?), And exclamation marks (!).

Now, let's gradually reduce the number of lines.

Execution result (7 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "Meanwhile, I came across "Adler Psychology" in this title a few years ago.",
    "Engineers sometimes concentrate on the specialized work of programming, and it is often said that they are not good at communicating with people and that they are not good at joining the circle of teams.",
    "Also, I'm worried about relationships and suffering from depression....I think that there are many cases of this.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field.",
    "By the way, it is quite a long sentence."
  ]
}

Execution result (6 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "Meanwhile, I came across "Adler Psychology" in this title a few years ago.",
    "Also, I'm worried about relationships and suffering from depression....I think that there are many cases of this.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field.",
    "By the way, it is quite a long sentence."
  ]
}

Execution result (5 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "Meanwhile, I came across "Adler Psychology" in this title a few years ago.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field.",
    "By the way, it is quite a long sentence."
  ]
}

Execution result (4 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field.",
    "By the way, it is quite a long sentence."
  ]
}

Execution result (3 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "At this age, people become more interested in feelings, feelings, and ways of thinking.",
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field."
  ]
}

Execution result (2 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "I personally feel that "Adler Psychology" is the way of thinking and thought itself to solve such problems of human relationships.",
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field."
  ]
}

Execution result (1 line)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "This time, I would like to write an article about what it would be like to interpret such "Adler psychology" from the perspective of an engineer, as I am an engineer in the field."
  ]
}

Gradually, sentences that are judged to be insignificant are being deleted. I don't know what the criteria are, but After all, as a behavior,

  1. Separate sentences with punctuation
  2. Output the highest importance ○ lines (for the specified number of lines) of the text as a result

There seems to be no mistake in the form.

Looking at the summary results, I personally think that the three-line summary is the easiest and most relevant. However, as the amount of text increases, I feel that I don't know what the story is with just three lines. I also feel that it is necessary to find an appropriate number of lines setting according to the amount of text.

Summarize disorganized sentences

To verify what would happen if you summarized the unconnected sentences Let's summarize the "table of contents" of the above article.

Original

Reference book
Premise
1.People can change
1-1.There is no trauma
1-2.Don't be afraid to get hurt
1-3.Harm that occurs when the feeling of inferiority becomes too strong
1-4.Accept self
2.Separation of issues
2-1.You don't have to meet the expectations of others
2-2.Don't step into the challenges of others
2-3.Separation of issues
3.How to interact with others
3-1.Do not compete with others
3-2.Admit non-defeat = not lose
4.About raising people
4-1 Don't be scolded, don't praise
4-2 Thank you, not evaluate
5.Community sense
Finally

Execution result (3 lines)

$ python ./summpy_test.py
{
  "debug_info": {},
  "summary": [
    "2-1.You don't have to meet the expectations of others.",
    "2-2.Don't step into the challenges of others.",
    "3.How to relate to others."
  ]
}

Originally, it is a sentence that does not have much context, so it is natural that the summary result is also disorganized, It is interesting that not only the major categories were selected, but the middle categories (2-1 and 2-2) were selected.

Let's take a look inside the API

I was wondering what kind of logic was summarized, so I took a quick look at the source code of the API published on GitHub.

Perhaps the part that summarizes (the part that corresponds to the core logic) is as follows, https://github.com/recruit-tech/summpy/blob/master/summpy/lexrank.py

I'm using DictVectorizer and pairwise_distances, so After separating sentences, feature extraction is performed, and the distance matrix of the feature is obtained. It looks like you're scoring the result ...

Summary

--The text is not processed on the summpy side. To the last, separate the original text with punctuation marks, etc., and extract the sentences with high importance ――As long as the amount of text and the number of lines are balanced, it will be summarized properly (I don't know what you're saying ...) ――The summary of disorganized sentences is NG. For example, if there is bulleted information such as "table of contents" in the text, it seems that it is better to remove it.

At the end

Thank you for watching till the end.

"What happens if you summarize this sentence?" "I want you to summarize this sentence!"

If you have a request such as, I would be grateful if you could comment.

Recommended Posts

I tried to summarize various sentences using the automatic summarization API "summpy"
I moved the automatic summarization API "summpy" with python3.
I tried using the checkio API
[First COTOHA API] I tried to summarize the old story
I tried to get various information from the codeforces API
I tried to summarize the graphical modeling.
I tried to touch the COTOHA API
I tried using the BigQuery Storage API
[Python] I tried to get various information using YouTube Data API!
I tried using the Google Cloud Vision API
I tried to touch the API of ebay
LeetCode I tried to summarize the simple ones
I tried to summarize the settings for various databases of Django (MySQL, PostgreSQL)
I tried to summarize the basic form of GPLVM
I tried to approximate the sin function using chainer
I tried using the API of the salmon data project
I tried to identify the language using CNN + Melspectogram
I tried to complement the knowledge graph using OpenKE
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to summarize SparseMatrix
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to score the syntax that was too humorous and humorous using the COTOHA API.
I tried to search videos using Youtube Data API (beginner)
I tried to simulate ad optimization using the bandit algorithm.
I tried to summarize the code often used in Pandas
I tried to summarize the commands often used in business
[TF] I tried to visualize the learning result using Tensorboard
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to approximate the sin function using chainer (re-challenge)
I tried to output the access log to the server using Node.js
[For beginners] I tried using the Tensorflow Object Detection API
I tried to summarize how to use the EPEL repository again
I tried to create Quip API
I tried the Naro novel API 2
I tried summarizing sentences with summpy
[For those who want to use TPU] I tried using the Tensorflow Object Detection API 2
I tried to touch Tesla's API
I tried the Naruro novel API
I tried to move the ball
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to estimate the interval.
[Linux] I tried to summarize the command of resource confirmation system
I tried using the COTOHA API (there is code on GitHub)
I tried to analyze my favorite singer (SHISHAMO) using Spotify API
I tried to digitize the stamp stamped on paper using OpenCV
I tried to summarize the commands used by beginner engineers today
I tried to solve the shift scheduling problem by various methods
I tried to summarize the frequently used implementation method of pytest-mock
[Qiita API] [Statistics • Machine learning] I tried to summarize and analyze the articles posted so far.
I tried to summarize Python exception handling
I tried using Azure Speech to Text.
I tried using Twitter api and Line api
I tried to recognize the wake word
I tried using YOUTUBE Data API V3
Python3 standard input I tried to summarize
I tried to classify text using TensorFlow
I tried using UnityCloudBuild API from Python
I tried to estimate the pi stochastically
I tried to make a Web API