I want to enjoy reading English papers! !! !!

I'm too weak in English to go on a journal, so I thought there was a way to enjoy it. What I came up with was to do something like a summary and translate the English sentence ...

Operation check environment

Windows10 --Powershell (cmd is OK)
Google Colaboratory
Python 3.7.3

What is Colaboratory?

Colaboratory is a Jupyter notebook environment that runs entirely in the cloud. No settings required and can be used free of charge. With Colaboratory, you can write and execute code, store and share analyzes, access powerful computing resources, etc., all for free from your browser. Can.

procedure

Extract only the text from the paper (remove titles, chapter names, charts, and references)
Apply LexRank to the text of each chapter and extract the key text
Combine key sentences in each chapter and use a language model to create sentences that are close to a summary

In order to generate more detailed sentences than the abstract and conclusion of the dissertation, we use a method that does not require correct sentences.

About LexRank

Sentence summaries can be broadly divided into extractive and generative types. LexRank is a summarization algorithm classified as an extract type. By creating a graph structure from a document and creating a ranking of important sentences, a sentence that can be said to be a summary is output. Proposed by Gunes Erkan, Dragomir R. Radev in 2004.

Consider LexRank, a new approach to calculating the importance of a sentence based on the concept of eigenvector centrality in the graph representation of a sentence. In this model, the connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix for the graphical representation of the sentence.

We consider a new approach, LexRank, for computing sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences.

It was explained in detail in ohke's article.

LexRank has two key points, which is a derivative of TextRank (proposal paper PDF) inspired by PageRank. Create a non-directional graph with sentences as nodes and similarities between sentences as edges. In the proposed paper, it is calculated from TF-IDF by cosine similarity (word2vec etc. should be usable in modern times). Calculate until the transition probability matrix (M) and the probability vector (P) obtained from the above graph are stable (MP = P), and select the sentence with the larger final probability vector value as the summary sentence. In the above figure (extracted from the proposed paper Figure 2) that visualizes the above theory, d5s1 and d4s1 with many edges (= similar to many sentences) and thick edges (= high similarity) are shown. Candidate for abstract.

Implementation

The library import is as follows. We use the libraries needed to implement and divide LexRank and build LSTM language models.

from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from googletrans import Translator
import nltk
import numpy as np
import random
import sys
import io
import os
import glob

Supplement: When using nltk, it seems that it can be used by updating the entire punkt package.

!python -c "import nltk; nltk.download('punkt')"

LexRank part (implementation using sumy)

def sectionLex(): #Language set to English LANG = "english" #.txt file(Body data for each section)Select all file = glob.glob('*.txt') ex = [] for i in range(len(file)): parser = PlaintextParser.from_file(file[i], Tokenizer(LANG)) stemmer = Stemmer(LANG) summarizer = Summarizer(stemmer) summarizer.stop_words = get_stop_words(LANG) for sentence in summarizer(parser.document, [How many sentences to output]): ex.append(str(sentence) + '\n') # utf-Output with 8 encoding with open('output.txt', mode='w', encoding='utf-8') as f: f.writelines(ex)

Declaration of variables used in the dictionary

#Orthodox dictionary list chr_index = {} #Reverse dictionary list rvs_index = {} #List of sentences sentences = [] #Next word next_word = []

Word-separation

# utf-Read with 8 encoding and store in text with io.open('output.txt', encoding='utf-8') as f: text = f.read().lower() #Break down word by word (separate) text = nltk.word_tokenize(text) chars = text

Creating a dictionary

#Create a list in order count = 0 for c in chars: if not c in chr_index: chr_index[c] = count count += 1 print(count, c) #Create a list in reverse order rvs_index = dict([(value, key) for (key, value) in chr_index.items()])

Creating a substring

for i in range(0, len(text) - maxlen, step): #Substring of maxlen words(1 sentence)Stored as sentences.append(text[i: i + maxlen]) #Stores the next word of the stored substring next_word.append(text[i + maxlen])

Word vectorization

#np.bool type 3D array:(Number of substrings, maximum length of substrings, number of words) x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool) #np.Two-dimensional array of bool type:(Number of substrings, number of words) y = np.zeros((len(sentences), len(chars)), dtype=np.bool) #Vectorize each substring for i, sentence in enumerate(sentences): for t, ch in enumerate(sentence): x[i, t, chr_index[ch]] = 1 y[i, chr_index[next_word[i]]] = 1

Creating a model

This time, I am using Sequential model. Regarding softmax, @ rtok's article was easy to understand.

#Make a simple model model = Sequential() #Uses LSTM. Batch size is 128 model.add(LSTM(128, input_shape=(maxlen, len(chars)))) #Make it possible to calculate the probability with softmax for each word model.add(Dense(len(chars), activation='softmax'))

I used RMSprop for gradient method. For RMSprop, @ tokkuman's article was easy to understand.

optimizer = RMSprop(learning_rate=0.01) model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Calculate the appearance rate of each word and select the word to output

#preds: Output from the model. float32 type. #temperature: Variety. The lower the value, the easier it is to select the one with the higher appearance rate. #The output from the model is in the form of a multinomial distribution, so the sum is 1.Become 0 def selectWD(preds, temperature=1.0): #Convert to float64 type preds = np.asarray(preds).astype('float64') #Divide the natural logarithm by the variety to make it easier to choose words with low probability preds = np.log(preds) / temperature #Inversely transform the natural logarithm of probability (make it a natural exponential function) exp_preds = np.exp(preds) #Divide the total value by the sum so that the sum is 1. preds = exp_preds / np.sum(exp_preds) #Randomly selected according to multinomial distribution probas = np.random.multinomial(1, preds, 1) return np.argmax(probas)

Processing for each epoch

def on_epoch_end(epoch, _): print() print('----- Generating text after Epoch: %d' % epoch) #Put the first 4 words at the beginning of the input string start_index = 0 #diversity: Diversity. Same as selectWD temperature. Characters with a lower probability are also selected as they are higher. for diversity in [0.2, 0.5, 0.8, 1.0]: print('----- diversity:', diversity) #For output generated = '' sentence = text[start_index: start_index + maxlen] generated += " ".join(sentence) print(" ".join(sentence)) print('----- Generating with seed: "' + " ".join(sentence)+ '"') sys.stdout.write(generated) #Output OUTSEN sentences or end with 1000 words output flag = OUTSEN for i in range(1000): #What word is in which position in the current sentence x_pred = np.zeros((1, maxlen, len(chars))) for t, ch in enumerate(sentence): x_pred[0, t, chr_index[ch]] = 1. #Predict the next word preds = model.predict(x_pred, verbose=0)[0] next_index = selectWD(preds, diversity) next_char = rvs_index[next_index] #Remove the first word and add the predicted word after sentence = sentence[1:] sentence.append(next_char) #Output organization if next_char == '.': flag -= 1 generated += next_char + "\n" sys.stdout.write(next_char+"\n") elif next_char == ',': generated += next_char sys.stdout.write(next_char) else: generated += " " + next_char sys.stdout.write(" "+next_char) sys.stdout.flush() if flag <= 0: break sys.stdout.flush() print()

#Set to call the above process at each epoch print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

Fitting process

#Batch size 128, epoch number 100, call the function described above model.fit(x, y, batch_size=128, epochs=100, callbacks=[print_callback])

result

This is the result of using Botta et al.'S Integration of Cloud computing and Internet of Things: A survey (2016) as data.

----- Generating text after Epoch: 99 ----- diversity: 0.2 in general , iot ----- Generating with seed: "in general , iot" in general , iot can benefit from the virtually unlimited capabilities and resources of cloud to compensate its technological constraints ( e.g., storage, processing, communication ). being iot characterized by a very high heterogeneity of devices, technologies, and protocols, it lacks different important properties such as scalability, interoperability, flexibility, reliability, efficiency, availability, and security. as a consequence, analyses of unprecedented complexity are possible, and data-driven decision making and prediction algorithms can be employed at low cost, providing means for increasing revenues and reduced risks. the availability of high speed networks enables effective monitoring and control of remote things, their coordination, their communications, and real-time access to the produced data. this represents another important cloudiot driver : iot processing needs can be properly satisfied for performing real-time data analysis ( on-the-fly ), for implementing scalable, real-time, collaborative, sensor-centric applications, for managing complex events, and for supporting task offloading for energy saving.

I am outputting 5 sentences, but when I translate it into Japanese with google translate, it looks like this.

In general, iot can benefit from virtually unlimited features and resources in the cloud to compensate for technical constraints (storage, processing, communication, etc.). It features very high non-uniformity of devices, technologies, and protocols, and lacks various important properties such as scalability, interoperability, flexibility, reliability, efficiency, availability, and security. The result is unprecedented complexity analysis and low cost adoption of data-driven decision and forecasting algorithms to increase revenue and reduce risk. High-speed network availability enables effective monitoring and control of remote things, their coordination, communication, and real-time access to generated data. This represents another important cloudiot driver: performing real-time data analytics (on-the-fly), implementing scalable, real-time collaboration-centric sensor-centric applications, managing complex events, and offloading tasks to save money. Supports.

It feels like it's organized!

The entire source code is available on github. There are some strange things in Japanese, so I will improve it ...

Recommended Posts
College students who don't want to read English dissertations (python)

For those who want to write Python with vim

Don't write Python if you want to speed it up with Python

Python techniques for those who want to get rid of beginners

I want to debug with Python

English speech recognition with python [speech to text]

Tips to make Python here-documents easier to read

How to read pydoc on python interpreter

I want to use jar from python

I want to analyze logs with Python

I want to play with aws with python

Read Python csv and export to txt

How to write environment variables that you don't want to put on [GitHub] Python

For those who want to learn Excel VBA and get started with Python

5 Reasons Processing is Useful for Those Who Want to Get Started with Python