[PYTHON] Summarizing the commercial value of an EC site using the automatic summarization algorithm LexRank

Introduction

Research and development of artificial intelligence has become a boom in recent years, and various results have been achieved in various fields. Automatic document summarization technology is also in the field of AI in a large framework, and technically in the field of natural language processing. Document summaries are often used for electric bulletin boards on the Shinkansen and headlines for web news, but I don't think their usage is limited to neat documents such as news.

In this article, I would like to use the document summarization algorithm LexRank to quickly visualize typical reviews of buyers by summarizing product reviews on EC sites (Rakuten).

LexRank LexRank is a document summarization algorithm proposed by Erkan et al. Based on the concept of PageRank.

Original paper: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

I will omit the details, but there are only two important concepts in LexRank.

--Sentences that are similar to many sentences are important sentences --A sentence similar to an important sentence is an important sentence

The graph below shows the above two concepts. (dXsY: Yth sentence of document X)

Kobito.LIwsUr.png Figure: Example of similarity graph (extracted from [Erkan 04])

Nodes with thick edges are more important, and nodes with edges from higher importance are also more important.

EC site review summary

When using Amazon or Rakuten, I often feel this.

――The number of reviews is extremely large and it seems to be highly credible, but I do not know which review to refer to! ――You can vote whether it was helpful, but the helpful reviews are long ...

Therefore, this time, the goal is to ** output a summary sentence including the part that everyone is reviewing in n sentences from a huge number of review sentences **. This should save you a lot of energy and get a feel for the reviews, whether you have hundreds of reviews or long reviews!

Dependent environment

This script depends on the following packages. All can be easily installed with pip install / conda install.

Implementation

The main code of the created LexRank is as follows. This algorithm is based on the original paper Algorithm3.

lexrank.py


def lexrank(sentences, N, threshold, vectorizer):

    CosineMatrix = np.zeros([N, N])
    degree = np.zeros(N)
    L = np.zeros(N)

    if vectorizer == "tf-idf":
        vector = tfidf.compute_tfidf(sentences)
    elif vectorizer == "word2vec":
        vector = tfidf.compute_word2vec(sentences)

    # Computing Adjacency Matrix                                                                                                                                         
    for i in range(N):
        for j in range(N):
            CosineMatrix[i,j] = tfidf.compute_cosine(vector[i], vector[j])
            if CosineMatrix[i,j] > threshold:
                CosineMatrix[i,j] = 1
                degree[i] += 1
            else:
                CosineMatrix[i,j] = 0

    # Computing LexRank Score                                                                                                                                            
    for i in range(N):
        for j in range(N):
            CosineMatrix[i,j] = CosineMatrix[i,j] / degree[i]

    L = PowerMethod(CosineMatrix, N, err_tol=10e-6)

    return L

The LexRank score L refers to the eigenvector. In the original paper, it is calculated using the Power Method (power method).

PowerMethod


def PowerMethod(CosineMatrix, N, err_tol):

    p_old = np.array([1.0/N]*N)
    err = 1

    while err > err_tol:
        err = 1
        p = np.dot(CosineMatrix.T, p_old)
        err = np.linalg.norm(p - p_old)
        p_old = p

    return p

Execution result

This time, I would like to summarize the reviews of the following game consoles on Rakuten. (For the time being, I will hide which product is just the model name)

tf-idf model Here is a summary of the model that created an adjacency matrix by vectorizing sentences using tf-idf. By the way, in the original paper, the adjacency matrix is created by the tf-idf model. (There was no word2vec at that time)

1: I had a video game from another manufacturer a long time ago, and I'm surprised that it has evolved so much! Not to mention the beautiful image, it has various functions, so I thought that this price could not be helped. This shop has a lot of points and coupons, so it was cheaper than I expected and it was good to buy it. 2: I bought it here. 3: My family bought it, but I was surprised at its high performance.

1: I ordered Nintendo New3DS for my brother's Christmas present. It arrived earlier than expected after I ordered it. Since it is a game machine, I thought that I would like the packaging to be a little tighter, but it is okay because there were no initial defects or scratches on the box (laugh) Also, if there is anything, thank you. 2: I had a 3DS, but the LCD broke and I repaired it once, but it broke again, and I had to buy a new 3DS. I bought LL because I thought it would be better to buy one with a larger screen next time. 3: We purchased for Christmas present of child.

The feature of PS4 is high performance, and it is understood that 3DS is often purchased as a gift for children. On the other hand, since the review of the first sentence is redundant for both products, it seems necessary to consider a little more segments. Also, since the information in the second sentence of PS4 does not have any particularly useful information, it is considered that it should not be extracted in this task.

word2vec model Here is a summary of the model that created an adjacency matrix by vectorizing sentences using word2vec. The center of gravity of all the words included in the sentence is used as the sentence vector by the pre-learned word vector.

1: I used points but felt like I lost something 2: Can be used without problems ♪ 3: The screen on the TV is pretty clean.

1: For children, it would be nice if there was a foldable type like the 3DS, with a simple function like the 2DS that came out a while ago, and a little cheaper. 2: I am glad that you shipped it immediately. 3: It is a replacement for LL in a hurry.

Compared to the tf-idf model summary, I get the impression that it is a more concise summary. However, some summaries are not product reviews. Since the score of sentences that are semantically similar to many sentences is high, it can be said that many of the output summaries are plausible in a sense, but it seems that some ingenuity is required for this task.

Summary

This time, we reviewed and summarized the EC site using the automatic summarization algorithm LexRank. By summarizing a large number of reviews, it seems that you will be able to efficiently grasp the characteristics of the product. On the other hand, I felt that it was necessary to devise a little more, such as reducing sentence segments and redundancy. I would like to create this point again next time.

bonus

The code of the vector calculation method (tf-idf, word2vec) for calculating the similarity of each sentence is shown below. This script itself is called in lexrank.py. (vector = tfidf.compute_tfidf (sentences)) You can execute it by doing import tfidf in lexrank.py. Also, the word vector model ('../ models / wiki_sg_d100.bin') called by the word2vec model is a word vector learned from the full text of Japanese Wikipedia.

tfidf.py


#!/usr/bin/env python                                                                                                                                                    
# -*- coding: utf-8 -*-                                                                                                                                                  

import sys
import numpy as np
import fasttext as ft
from scipy.spatial import distance


def word2id(bow, word_id):

    for w in bow:
        if word_id.has_key(w) == False:
            word_id[w] = len(word_id)

    return word_id

def compute_tf(sentences, word_id):

    tf = np.zeros([len(sentences), len(word_id)])

    for i in range(len(sentences)):
        for w in sentences[i]:
            tf[i][word_id[w]] += 1

    return tf

def compute_df(sentences, word_id):

    df = np.zeros(len(word_id))

    for i in range(len(sentences)):
        exist = {}
        for w in sentences[i]:
            if exist.has_key(w) == False:
                df[word_id[w]] += 1
                exist[w] = 1
            else:
                continue

    return df

def compute_idf(sentences, word_id):

    idf = np.zeros(len(word_id))
    df = compute_df(sentences, word_id)

    for i in range(len(df)):
        idf[i] = np.log(len(sentences)/df[i]) + 1

    return idf

def compute_tfidf(sentences):

    word_id = {}

    for sent in sentences:
        word_id = word2id(sent, word_id)

    tf = compute_tf(sentences, word_id)
    idf = compute_idf(sentences, word_id)

    tf_idf = np.zeros([len(sentences), len(word_id)])

    for i in range(len(sentences)):
        tf_idf[i] = tf[i] * idf

    return tf_idf

def compute_cosine(v1, v2):

    return 1 - distance.cosine(v1, v2)
    
def sent2vec(bow, model_w):

    vector = np.zeros(100)
    N = len(bow)

    for b in bow:
        try:
            vector += model_w[b]
        except:
            continue

    vector = vector / float(N)

    return vector

def compute_word2vec(sentences):

    model_w = ft.load_model('../models/wiki_sg_d100.bin')
    vector = np.zeros([len(sentences), 100])

    for i in range(len(sentences)):
        vector[i] = sent2vec(sentences[i], model_w)

    return vector
 
if __name__ == "__main__":

    pass       

Recommended Posts

Summarizing the commercial value of an EC site using the automatic summarization algorithm LexRank
Finding the optimum value of a function using a genetic algorithm (Part 1)
Let's summarize Donald Trump's speech in three lines with the automatic summarization algorithm LexRank
[Golang] Specify an array in the value of map