[PYTHON] Let's summarize Donald Trump's speech in three lines with the automatic summarization algorithm LexRank

This is the article on the 15th day of Nextremer Advent Calender 2016.

2016 is almost over, but there was a lot of news this year as well. Isn't Donald Trump's victory in the presidential election one of the most impactful events?

I often see it on TV, but I haven't heard the speech properly, so I'll read the speech. However, I am not good at English, so I use the summary algorithm to reduce it to about 3 lines.

Introduction

There are two main ways to summarize.

The generative summaries are very difficult, and most of the main researches currently being conducted are extractive reservations. LexRank implemented this time also belongs to the extractive summary and extracts important sentences in the sentence.

What is LexRank?

LexRank is a summarization algorithm inspired by PageRank. PageRank is an algorithm devised by Google founders L.Page and S.Brin to determine the importance of web pages. The importance of the page was determined by grasping the link structure on the Internet as the following graph.

Node: Web page
Edge: Link
Node importance:
	1.Pages linked from many pages are important pages
	2.Pages linked from important pages are important pages

Like PageRank, LexRank also captures sentences as a graph.

Node: statement
Edge: Similarity between two sentences(1 | 0)
Node importance:
	1.Sentences that are similar to many sentences are important sentences
	2.Sentences similar to important sentences are important sentences

The degree of similarity here is simply how much the two sentences have a common word.

LexRank evaluates the importance only by the graph structure, but when it was invented in 2004, it boasted the best accuracy in DUC2004. It is very interesting to be able to summarize with high accuracy without semantic analysis.

The formula for LexRank is as follows.

Input:Array S consisting of n statements,Cosine similarity threshold t
Output:Array L containing LexRank scores for each sentence

Array CosineMatrix[n][n];
Array Degree[n];
Array L[n];

for i <- 1 to n do
	for j <- 1 to n do
		CosineMatrix[i][j] = idf-modified-cosine(S[i], S[j]);
		if CosineMatrix[i][j] > t then
			CosineMatrix[i][j] = 1;
			Degree[i]++;
		end
		else
			CosineMatrix[i][j] = 0;
		end
	end
end
for i <- 1 to n do
	for j<- 1 to n do
		CosineMatrix[i][j] = CosineMatrix[i][j] / Degree[i]
	end
end

L = PowerMethod(CosineMatrix, n, ε);

return L

(Quote: "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" Alogorithm 3)

ʻIdf-modified-cosineis a similarity calculation between two sentences. PowerMethod` is a PageRank calculation. is.

Now, let's actually implement this algorithm in Python.

Implementation

import math
import numpy

def lex_rank(sentences, n, t):
	"""
Summarize the text with LexRank.
	@param	sentences: list
Sentence([[w1,w2,w3],[w1,w3,w4,w5],..]Sentence list like)
	@param	n: int
Number of sentences contained in a sentence
	@param	t: float
Cosine similarity threshold(default 0.1)
	@return	: list
		LexRank
	"""
    cosine_matrix = numpy.zeros((n, n))
    degrees = numpy.zeros((n,))
    l = []
	
	 # 1.Creating an adjacency matrix
    for i in range(n):
        for j in range(n):
            cosine_matrix[i][j] = idf_modified_cosine(sentences, sentences[i], sentences[j])
            if cosine_matrix[i][j] > t:
                cosine_matrix[i][j] = 1
                degrees[i] += 1
            else:
                cosine_matrix[i][j] = 0

    # 2.LexRank calculation
    for i in range(n):
        for j in range(n):
            cosine_matrix[i][j] = cosine_matrix[i][j] / degrees[i]
            
    ratings = power_method(cosine_matrix, n)
    
    return zip(sentences, ratings)

This program is by function

  1. Creating an adjacency matrix
  2. LexRank calculation It is divided into 2 blocks.

Here, we will explain each block.

1. Creating an adjacency matrix

The adjacency matrix is a matrix for graph representation. The graph is expressed by substituting 1 if there is an edge between the nodes and 0 if it does not exist. In LexRank, if the similarity between two sentences is equal to or more than the threshold value t, they are connected by an edge.

The implementation of similarity is as follows.

def idf_modified_cosine(sentences, sentence1, sentence2):
	"""
Calculate cosine similarity between two sentences
	@param	sentence1: list
Sentence 1([w1,w2,w3]Word list like)
	@param	sentence2: list
Sentence 2([w1,w2,w3]Word list like)
	@param	sentences: list
Sentence([[w1,w2,w3],[w1,w3,w4,w5],..]Word list like)
	@return : float
Cosine similarity
	"""
    tf1 = compute_tf(sentence1)
    tf2 = compute_tf(sentence2)
    idf_metrics = compute_idf(sentences)
    return cosine_similarity(sentence1, sentence2, tf1, tf2, idf_metrics)

TF is the frequency of occurrence of words in a sentence. (The more important it is) IDF is the frequency of occurrence of words throughout a document. (The less it is, the more important it is) Cosine similarity is the cosine of the similarity between vectors.

I think that many people know about these, so I will briefly post only the implementation.

from collection import Counter


def compute_tf(sentence):
	"""
Calculate TF
	@param	sentence: list
Sentence([w1,w2,w3]Word list like)
	@return : list
TF list
	"""
    tf_values = Counter(sentence)
    tf_metrics = {}
    
    max_tf = find_tf_max(tf_values)
    
    for term, tf in tf_values.items():
        tf_metrics[term] = tf / max_tf

    return tf_metrics


def find_tf_max(terms):
	"""
Search for the maximum frequency of occurrence of words
	@param	terms: list
Frequency of word occurrence
	@return : int
Maximum frequency of occurrence of words
	"""
    return max(terms.values()) if terms else 1


def compute_idf(sentences):
	"""
Calculate the IDF value of a word in a sentence
	@param sentences: list
Sentence([[w1,w2,w3],[w1,w3,w4,w5],..]Word list like)
	@return: list
IDF list
	"""
    idf_metrics = {}
    sentences_count = len(sentences)

    for sentence in sentences:
        for term in sentence:
            if term not in idf_metrics:
                n_j = sum(1 for s in sentences if term in s)
                idf_metrics[term] = math.log(sentences_count / (1 + n_j))

    return idf_metrics
    
    
def cosine_similarity(sentence1, sentence2, tf1, tf2, idf_metrics):
	"""
Calculate cosine similarity
	@param	sentence1: list
Sentence 1([w1,w2,w3]Word list like)
	@param	sentence2: list
Sentence 2([w1,w2,w3]Word list like)
	@param	tf1: list
TF list of sentence 1
	@param	tf2: list
TF list of sentence 2
	@param	idf_metrics: list
IDF list of sentences
	@return : float
Cosine similarity
	"""
    unique_words1 = set(sentence1)
    unique_words2 = set(sentence2)
    common_words = unique_words1 & unique_words2

    numerator = sum((tf1[t] * tf2[t] * idf_metrics[t] ** 2) for t in common_words)
    denominator1 = sum((tf1[t] * idf_metrics[t]) ** 2 for t in unique_words1)
    denominator2 = sum((tf2[t] * idf_metrics[t]) ** 2 for t in unique_words2)

    if denominator1 > 0 and denominator2 > 0:
        return numerator / (math.sqrt(denominator1) * math.sqrt(denominator2))
    else:
        return 0.0    

Finally, if the cosine similarity is t or greater, substitute 1 to complete the adjacency matrix.

Next, we will find the LexRank based on the matrix created here.

2. LexRank calculation

From here, LexRank is calculated based on the adjacency matrix obtained in 1. First, divide each element of the adjacency matrix by degree (degree of node) and convert it to a stochastic matrix.

Then, based on this stochastic matrix, we will find the Markov steady distribution. Since Markov chains are aperiodic and guaranteed to converge, the transpose matrix is multiplied by the power method and iteratively repeated until it is less than or equal to ε. The eigenvector finally calculated here is the LexRank.

The implementation is as follows.

def power_method(cosine_matrix, n, e):
	"""
Do the power method
	@param	scosine_matrix: list
Stochastic matrix
	@param	n: int
Number of sentences in a sentence
	@param	e: float
Tolerance ε
	@return: list
		LexRank
	"""
    transposed_matrix = cosine_matrix.T
    sentences_count = n
    
    p_vector = numpy.array([1.0 / sentences_count] * sentences_count)
    
    lambda_val = 1.0
    
    while lambda_val > e:
        next_p = numpy.dot(transposed_matrix, p_vector)
        lambda_val = numpy.linalg.norm(numpy.subtract(next_p, p_vector))
        p_vector = next_p
    
    return p_vector

This completes the importance calculation by LexRank. The larger the corresponding ratings value, the more important the sentence.

Summary result

Now, let's actually summarize. To summarize [Mr. Trump's speech at the time of the presidential election victory](http://www.vox.com/policy-and-politics/2016/11/9/13569124/donald-trump-wins-2016-presidential- election-victory-speech-transcript).

# lex_rank(document)
Great people.
Great brothers, sisters, great, unbelievable parents.
We're going to get to work immediately for the American people, and we're going to be doing a job that hopefully you will be so proud of your president.

#Great people.
#Great, incredible parents, great brothers and sisters.
#We will work soon for the American people. We will work in the hope that you will be very proud of the President.

Yeah, well, I'll do my best for the president! You can feel the enthusiasm.

At the end

In this article, we implemented an automatic summarization algorithm to summarize Mr. Trump's speech. LexRank seems to be useful for text analysis because you can capture sentences with graphs. Tomorrow I will continue to write something in summary + translation.

References

LexRank: Graph-based Lexical Centrality as Salience in Text Summarization miso-belica/sumy Document Summarization PART-2: LEXRank (Modified Pagerank) based Document Summarization Random walk on graph and Google page rank Automatic summarization (wikipedia) Automatic text summarization method using natural language processing

Recommended Posts

Let's summarize Donald Trump's speech in three lines with the automatic summarization algorithm LexRank
Summarizing the commercial value of an EC site using the automatic summarization algorithm LexRank
Sort in Python. Next, let's think about the algorithm.
I moved the automatic summarization API "summpy" with python3.