In this article, the procedure of ** LexRank **, which is a document summarization algorithm (and a method for extracting important sentences) (a method that applies the PageRank method that was initially adopted by Google to measure the importance of each site). I will explain why linear algebra eigenvectors and eigenvalue decompositions appear.

Introduction

On August 1, 2020 「Deep Learning Digital Conference」 is held.

https://dllab.connpass.com/event/178714/

In the [Individual Session] of the above event ** [I tried to analyze the "free comment section of the G test passer questionnaire" by making full use of AI and deep learning] ** I will announce the LT as a colleague (Mr. Mitarai, a member who joined the company halfway from February).

(Overview) The opinions received in the free comment section of the "AI / DL main questionnaire recommended by those who passed the 2nd G test" and "Opinions / requests to the Deep Learning Association" conducted in May 2020 are deep. Like the Learning Association, we will introduce the results of analysis using machine learning and deep learning. In this presentation, we will give an overview of natural language processing technologies such as word cloud, descriptive XAI, clustering, summarization, and ALBERT, and actually introduce these technologies to G-passers' opinions on "Deep Learning Association". We will introduce what kind of analysis results were obtained when applied to the "request" data.

I (Ogawa) is a member of the Japan Deep Learning Association, so

Analyze the free comments of the questionnaire and I need to report and kaizen your opinions to the management team. It is difficult to analyze free comments by hand.

Therefore, ** Let's make full use of machine learning like a deep learning engineer! ** ** (Although I did the manual arrangement properly ...).

LexRank

During the analysis of free comments, there is a scene where the document is automatically summarized.

This time, I decided to use the ** LexRank ** algorithm, which is a simple method, as the summarization method.

LexRank is a document summarization (and extraction of important sentences) algorithm that applies the "PageRank method of measuring the importance of each site with early Google".

The following is an explanation example of LexRank.

https://ohke.hateblo.jp/entry/2018/11/17/230000

https://www.ai-shift.co.jp/techblog/938

Since I was in the middle of this corona with Mr. Mitarai, who joined the company halfway in February, I have been pair pros from home all the time.

Even though I'm a pair pro, I tell him to do his best, while I do another job, Ask me if you don't understand So Rather than being a pair pro, it feels like you can always ask me questions. .. ..

So, he also implemented LexRank, Now, when I was making the presentation slides, I asked him three questions:

** [A] Why does eigenvalue decomposition appear in the algorithm in LexRank, which applies the PageRank method that measures the importance of each site? ** **

** [B] The importance of each sentence, which corresponds to PageRank, is called the eigenvector, but how many eigenvalues are there? ** **

** [C] Why does the eigenvector correspond to PageRank or importance in each sentence in LexRank? ** **

is.

And the explanation from me is as follows.

Why eigenvalue decomposition appears in LexRank of document summary and Google PageRank algorithm

This is about linear algebra eigenvalues, eigenvectors, and eigenvalue decomposition.

First of all, please read the following site.

Why Learn Linear Algebra? Concept of eigenvalues and eigenvectors used in Google's PageRank https://math-fun.net/20180809/1195/

And the information on the above site alone cannot answer my [A]-[C] questions (I think) Supplementally, it is as follows.

[1] PageRank is the value of the eigenvector of eigenvalue 1 of matrix A, which represents the transition probability between pages as a matrix.

[2] Since it is an eigenvector with an eigenvalue of 1, even if that vector is multiplied by the transition probability matrix A between pages again, it will be a vector with the same value again.

In other words vec_after = A * eigen_vec

Here, vec_after is the same as eigen_vec (because the eigenvalue is 1, so it is the same value even if it is multiplied by the eigenvalue)

[3] In other words, the value in each row of the eigenvector with eigenvalue 1 indicates the "power of each site". And what pattern the percentage of power of all sites is (ie what eigenvectors) Doesn't it change even if the transition is made with the transition probability matrix A? The state in which the site power does not change even if it transitions in the transition matrix A indicates the static power (page rank) of each site.

The number of people who are viewing the site is the number of people who are difficult to understand because of the power of the site. The number of people viewing the site does not change even after converting with the transition matrix, In other words, the idea is that the number of static customers of the site ≒ the importance of the site.

[4] This [3] is the reason why the eigenvector with eigenvalue 1 corresponds to PageRank.

[5] This time it's LexRank, so The transition matrix between pages (matrix of the number of backlinks) and the strength of the homepage (page rank) in the search are Each is treated as a document-to-document similarity matrix and the strength (importance) of each document.

That's why LexRank has eigenvectors and eigenvalue decompositions, the first three questions,

[A] Why does eigenvalue decomposition appear in the algorithm in LexRank, which applies the PageRank method that measures the importance of each site? [B] The importance of each sentence, which corresponds to PageRank, is called the eigenvector, but how many eigenvalues are there? [C] Why does the eigenvector correspond to PageRank or importance in each sentence in LexRank?

Will be the answer.

Then, why is the transition matrix A guaranteed to have a maximum eigenvalue of 1? This raises the question, but that is too mathematical, so I will omit it in this article. If you are interested, please read around equation (4) below, using Gerschgorin's theorem (Gershgorin's theorem). http://www.math.sci.hokudai.ac.jp/~ishikawa/Numazu-Shizuoka/fujii-222.pdf

Summary

As mentioned above, in this article, why linear algebra is applied to LexRank's algorithm, which is an algorithm for document summarization (and extraction of important sentences), which applies the "PageRank method for measuring the importance of each site with early Google". I explained whether eigenvectors and eigenvalue decompositions will appear.

Recently, I have posted on Twitter information about AI, business, and management, such as articles and sites that I found interesting, and impressions of books I read.

Yutaro Ogawa @ISID_AI_team https://twitter.com/ISID_AI_team

The information I'm looking at is interesting and important! I am sharing what I thought. We would appreciate it if you could use this account as well.

[Remarks] The AI Technology Department development team, which I lead, is looking for members. If you are interested, please click here

[Disclaimer] The content of this article itself is the opinion / transmission of the author, not the official opinion of the company to which the author belongs.

[PYTHON] Why do eigenvectors appear in LexRank for document summarization and Google PageRank algorithms?

Introduction

Why eigenvalue decomposition appears in LexRank of document summary and Google PageRank algorithm

Summary