College students who don't want to read English dissertations (python)

I want to enjoy reading English papers! !! !!

I'm too weak in English to go on a journal, so I thought there was a way to enjoy it. What I came up with was to do something like a summary and translate the English sentence ...

Operation check environment

What is Colaboratory?

Colaboratory is a Jupyter notebook environment that runs entirely in the cloud. No settings required and can be used free of charge. With Colaboratory, you can write and execute code, store and share analyzes, access powerful computing resources, etc., all for free from your browser. Can. Sorry, the video cannot be played.

procedure

  1. Extract only the text from the paper (remove titles, chapter names, charts, and references)
  2. Apply LexRank to the text of each chapter and extract the key text
  3. Combine key sentences in each chapter and use a language model to create sentences that are close to a summary

About LexRank

Sentence summaries can be broadly divided into extractive and generative types. LexRank is a summarization algorithm classified as an extract type. By creating a graph structure from a document and creating a ranking of important sentences, a sentence that can be said to be a summary is output. Proposed by Gunes Erkan, Dragomir R. Radev in 2004.

Consider LexRank, a new approach to calculating the importance of a sentence based on the concept of eigenvector centrality in the graph representation of a sentence. In this model, the connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix for the graphical representation of the sentence.

We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences.

20181117143216.png

It was explained in detail in ohke's article.

LexRank has two key points, which is a derivative of TextRank (proposal paper PDF) inspired by PageRank. Create a non-directional graph with sentences as nodes and similarities between sentences as edges. In the proposed paper, it is calculated from TF-IDF by cosine similarity (word2vec etc. should be usable in modern times). Calculate until the transition probability matrix (M) and the probability vector (P) obtained from the above graph are stable (MP = P), and select the sentence with the larger final probability vector value as the summary sentence. In the above figure (extracted from the proposed paper Figure 2) that visualizes the above theory, d5s1 and d4s1 with many edges (= similar to many sentences) and thick edges (= high similarity) are shown. Candidate for abstract.

Implementation

The library import is as follows. We use the libraries needed to implement and divide LexRank and build LSTM language models.

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from googletrans import Translator
import nltk
import numpy as np
import random
import sys
import io
import os
import glob

Supplement: When using nltk, it seems that it can be used by updating the entire punkt package.

!python -c "import nltk; nltk.download('punkt')"

LexRank part (implementation using sumy)

def sectionLex():
  #Language set to English
  LANG = "english"
  #.txt file(Body data for each section)Select all
  file = glob.glob('*.txt')
  ex = []
  for i in range(len(file)):
    parser = PlaintextParser.from_file(file[i], Tokenizer(LANG))
    stemmer = Stemmer(LANG)
    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANG)
    for sentence in summarizer(parser.document, [How many sentences to output]):
      ex.append(str(sentence) + '\n')
    # utf-Output with 8 encoding
    with open('output.txt', mode='w', encoding='utf-8') as f:
      f.writelines(ex)

Declaration of variables used in the dictionary

#Orthodox dictionary list
chr_index = {}
#Reverse dictionary list
rvs_index = {}
#List of sentences
sentences = []
#Next word
next_word = []

Word-separation

# utf-Read with 8 encoding and store in text
with io.open('output.txt', encoding='utf-8') as f:
    text = f.read().lower()

#Break down word by word (separate)
text = nltk.word_tokenize(text)
chars = text

Creating a dictionary


#Create a list in order
count = 0
for c in chars:
    if not c in chr_index:
        chr_index[c] = count
        count += 1
        print(count, c)
#Create a list in reverse order
rvs_index = dict([(value, key) for (key, value) in chr_index.items()])

Creating a substring

for i in range(0, len(text) - maxlen, step):
    #Substring of maxlen words(1 sentence)Stored as
    sentences.append(text[i: i + maxlen])
    #Stores the next word of the stored substring
    next_word.append(text[i + maxlen])

Word vectorization

#np.bool type 3D array:(Number of substrings, maximum length of substrings, number of words)
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
#np.Two-dimensional array of bool type:(Number of substrings, number of words)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
#Vectorize each substring
for i, sentence in enumerate(sentences):
    for t, ch in enumerate(sentence):
        x[i, t, chr_index[ch]] = 1
    y[i, chr_index[next_word[i]]] = 1

Creating a model

This time, I am using Sequential model. Regarding softmax, @ rtok's article was easy to understand.

#Make a simple model
model = Sequential()
#Uses LSTM. Batch size is 128
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
#Make it possible to calculate the probability with softmax for each word
model.add(Dense(len(chars), activation='softmax'))

I used RMSprop for gradient method. For RMSprop, @ tokkuman's article was easy to understand.

optimizer = RMSprop(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Calculate the appearance rate of each word and select the word to output

#preds: Output from the model. float32 type.
#temperature: Variety. The lower the value, the easier it is to select the one with the higher appearance rate.
#The output from the model is in the form of a multinomial distribution, so the sum is 1.Become 0
def selectWD(preds, temperature=1.0):
    #Convert to float64 type
    preds = np.asarray(preds).astype('float64')
    #Divide the natural logarithm by the variety to make it easier to choose words with low probability
    preds = np.log(preds) / temperature
    #Inversely transform the natural logarithm of probability (make it a natural exponential function)
    exp_preds = np.exp(preds)
    #Divide the total value by the sum so that the sum is 1.
    preds = exp_preds / np.sum(exp_preds)
    #Randomly selected according to multinomial distribution
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Processing for each epoch

def on_epoch_end(epoch, _):
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    #Put the first 4 words at the beginning of the input string
    start_index = 0
    #diversity: Diversity. Same as selectWD temperature. Characters with a lower probability are also selected as they are higher.
    for diversity in [0.2, 0.5, 0.8, 1.0]:
        print('----- diversity:', diversity)
        #For output
        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += " ".join(sentence)
        print(" ".join(sentence))

        print('----- Generating with seed: "' + " ".join(sentence)+ '"')
        sys.stdout.write(generated)
        
        #Output OUTSEN sentences or end with 1000 words output
        flag = OUTSEN
        for i in range(1000):
            #What word is in which position in the current sentence
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, ch in enumerate(sentence):
                x_pred[0, t, chr_index[ch]] = 1.
            #Predict the next word
            preds = model.predict(x_pred, verbose=0)[0]
            next_index = selectWD(preds, diversity)
            next_char = rvs_index[next_index]
            #Remove the first word and add the predicted word after
            sentence = sentence[1:]
            sentence.append(next_char)
            #Output organization
            if next_char == '.':
                flag -= 1
                generated += next_char + "\n"
                sys.stdout.write(next_char+"\n")
            elif next_char == ',':
                generated += next_char
                sys.stdout.write(next_char)
            else:
                generated += " " + next_char
                sys.stdout.write(" "+next_char)
            sys.stdout.flush()
            if flag <= 0:
                break
        sys.stdout.flush()
        print()
#Set to call the above process at each epoch
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

Fitting process

#Batch size 128, epoch number 100, call the function described above
model.fit(x, y, batch_size=128, epochs=100, callbacks=[print_callback])

result

This is the result of using Botta et al.'S Integration of Cloud computing and Internet of Things: A survey (2016) as data.

----- Generating text after Epoch: 99 ----- diversity: 0.2 in general , iot ----- Generating with seed: "in general , iot" in general , iot can benefit from the virtually unlimited capabilities and resources of cloud to compensate its technological constraints ( e.g., storage, processing, communication ). being iot characterized by a very high heterogeneity of devices, technologies, and protocols, it lacks different important properties such as scalability, interoperability, flexibility, reliability, efficiency, availability, and security. as a consequence, analyses of unprecedented complexity are possible, and data-driven decision making and prediction algorithms can be employed at low cost, providing means for increasing revenues and reduced risks. the availability of high speed networks enables effective monitoring and control of remote things, their coordination, their communications, and real-time access to the produced data. this represents another important cloudiot driver : iot processing needs can be properly satisfied for performing real-time data analysis ( on-the-fly ), for implementing scalable, real-time, collaborative, sensor-centric applications, for managing complex events, and for supporting task offloading for energy saving.

I am outputting 5 sentences, but when I translate it into Japanese with google translate, it looks like this.

In general, iot can benefit from virtually unlimited features and resources in the cloud to compensate for technical constraints (storage, processing, communication, etc.). It features very high non-uniformity of devices, technologies, and protocols, and lacks various important properties such as scalability, interoperability, flexibility, reliability, efficiency, availability, and security. The result is unprecedented complexity analysis and low cost adoption of data-driven decision and forecasting algorithms to increase revenue and reduce risk. High-speed network availability enables effective monitoring and control of remote things, their coordination, communication, and real-time access to generated data. This represents another important cloudiot driver: performing real-time data analytics (on-the-fly), implementing scalable, real-time collaboration-centric sensor-centric applications, managing complex events, and offloading tasks to save money. Supports.

It feels like it's organized!

The entire source code is available on github. There are some strange things in Japanese, so I will improve it ...

Recommended Posts