[PYTHON] Author estimation using neural network and Doc2Vec (Aozora Bunko)

Introduction

I did the task of estimating the author using the work pulled from Aozora Bunko, so I wrote it as an article. The code is available here [https://github.com/minnsou/aozora_pred).

The flow I did this time is as follows.

Download the text from Aozora Bunko using wget
Use MeCab to format the text
Using Doc2Vec, vectorize the text (create data with the vector created from the text as x and the author ID as y)
Construct a neural network using keras and supervised learning as a classification problem

environment

MacOS Catalina
python3.7

The library mainly uses BeautifulSoup, keras, Mecab, gensim. I will omit these installation methods because they deviate from the main point. Basically pip was pretty good.

Preparation

First, decide who will download the work. This time, for the time being, we targeted people who satisfy "__ the author of the line " and " the number of published works is 20 or more __".

Get the author ID required to download the work from Aozora Bunko Writer List. For example, Ryunosuke Akutagawa is 879.

Create ʻauthors.txt` that summarizes these. I could have created this automatically, but since the number of people was small, I made it manually.

`authors.txt`


Ryunosuke Akutagawa 879
Takeo Arishima 25
Andersen Hans Christian 19
Ishikawa Takuboku 153
Jun Ishiwara 1429
Kyoka Izumi 50
Mansaku Itami 231
Sachio Ito 58
Noe Ito 416
Bin Ueda 235
Uemura Shoen 355
Uchida Roan 165
Unno Juza 160
Edogawa Ranpo 1779
Yu Okubo 10
Shigenobu Okuma 1879
Keigetsu Omachi 237
Asajiro Oka 1474
Kanoko Okamoto 76
Kido Okamoto 82
Mimei Ogawa 1475
Hideo Oguma 124
Mushitaro Oguri 125
Sakunosuke Oda 40
Shinobu Orikuchi 933

25 people in total. There is a space between the name and the author ID. In addition, some people (Teruko Okura) have been omitted from this list due to issues described later.

The rest is importing the required libraries. Subsequent python scripts are the same as those published at author_prediction.ipynb.

from bs4 import BeautifulSoup
import re
import MeCab
from gensim.models.doc2vec import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import numpy as np
import matplotlib.pyplot as plt
from keras import layers
from keras import models
from keras import optimizers
from keras.utils import np_utils

This completes the preparation.

1. Download the text from Aozora Bunko using `wget`

1.1 Acquisition of work ID

First, use authors.txt to get the work ID for each author. Save this as personID ??. Txt (?? is the author ID).

# authors.Based on txt, wget and personID with work ID??.Generate txt(??Is the person ID)
# personID_put personID in list

personID_list = []
memo = open('./authors.txt')
for line in memo:
    line = line.rstrip()
    line = line.split( )
    #print(line)
    author = line[0]
    personID = line[1]
    personID_list.append(personID)
    
    # authors.Wget index based on personID of txt (it is not necessary to do it because it has already been created)
    #!wget https://www.aozora.gr.jp/index_pages/person{personID}.html -O ./data/index{personID}.html
    #!sleep 1
    
    #Saved index??.open html
    with open("./data/index{}.html".format(personID), encoding="utf-8") as f:
        soup = BeautifulSoup(f)
        ol = soup.find("ol").text
        bookID = re.findall('ID：[0-9]*', ol) # index??.Get the part where the work ID is written from the html
        #print(bookID)
        bookID_list = []
        for b in bookID:
            b = b[3:] # 'ID：'Delete
            bookID_list.append(b) #Add work ID
        #print(bookID_list)
        
        print('author {}\tpersonID {}\tnumber of cards {}'.format(author, personID, len(bookID_list)))
        
        # bookID_Create a text file that describes the work ID of a certain author based on the list (it is not necessary to do it because it has already been created)
        #with open('./data/personID{}.txt'.format(personID), mode='w') as f:
        #    for b in bookID_list:
        #        f.write(b + ' ')

When you run it, you will get an output like this. This will create a file called personID ??. Txt for 25 people (??is the author ID).

author Ryunosuke Akutagawa personID 879 number of cards 376
author Takeo Arishima personID 25 number of cards 44
author Andersen Hans Christian personID 19 number of cards 23
author Takuboku Ishikawa personID 153 number of cards 78
author Jun Ishiwara personID 1429 number of cards 24
author Kyoka Izumi personID 50 number of cards 208
author Mansaku Itami personID 231 number of cards 23
author Sachio Ito personID 58 number of cards 39
author Ito Noe personID 416 number of cards 80
author Bin Ueda personID 235 number of cards 53
author Uemura Shoen personID 355 number of cards 83
author Roan Uchida personID 165 number of cards 26
author Unno Juza personID 160 number of cards 177
author Edogawa Ranpo personID 1779 number of cards 91
author Yu Okubo personID 10 number of cards 68
author Shigenobu Okuma personID 1879 number of cards 31
author Keigetsu Omachi personID 237 number of cards 60
author Asajiro Oka personID 1474 number of cards 25
author Kanoko Okamoto personID 76 number of cards 119
author Kido Okamoto personID 82 number of cards 247
author Mimei Ogawa personID 1475 number of cards 521
author Hideo Oguma personID 124 number of cards 33
author Mushitaro Oguri personID 125 number of cards 22
author Sakunosuke Oda personID 40 number of cards 70
author Shinobu Orikuchi personID 933 number of cards 197

Caution

In the above script, if you remove # and create wget yourself to create personID ??. Txt, you will get one with more work IDs than the uploaded personID ??. Txt. I will. This is because I manually erased the work ID that gives an error when extracting the text with the following script.

For example, in the work "Aozora Bunko" by Yu Okubo, there is an external link in addition to the usual Aozora Bunko site. It is pasted and I will get it. I wish I could get this Aozora Bunko site, but [external site](http://p.booklog. jp / book / 35337) will be taken. Also, like Hideo Oguma's Tanka Collection, the text does not exist (<div class =" main_text "> Some have no tags), which also causes an error.

I have uploaded the deleted personID ??. Txt for such exceptional work IDs, so if you want to move it for the time being, it is safe to avoid uncommenting __. Only those who want to check the operation should uncomment __ and change the save destination directory __.

1.2 Save your work with `wget`

Next, download the work using the work ID written in personID ??. Txt. I use pubserver2 to download the work, but now that I think about it, could I just use wget without going through pubserver2?

# personID??.Get the work ID from txt and bring the work with wget (up to 50 works per author)
#The name of the html where the work is written is text_x_y.html(x is personID, y is bookID)

for personID in personID_list:
    print('personID', personID)
    with open("./data/personID{}.txt".format(personID), encoding="utf-8") as f:
        for bookID_str in f:
            bookID_list = bookID_str.split( )
            print('number of cards', len(bookID_list))
            
            #If there are too many works, it will take time, so limit to 50 works
            if len(bookID_list) >= 50:
                bookID_list = bookID_list[:50]
            for bookID in bookID_list:
                print('ID', bookID)
                
                #Create an html that describes the text by wget based on the bookID (it is not necessary to do it because it has already been created)
                #!wget http://pubserver2.herokuapp.com/api/v0.1/books/{bookID}/content?format=html -O ./data/text{personID}_{bookID}.html
                #!sleep 1

2. Use MeCab to format the text

Create a function to format and tag the body. This is based on here. The tags assign numbers from 0 to 24 in the order of ʻauthors.txt` (Ryunosuke Akutagawa is 0, Takeo Arishima is 1, ..., Shinobu Orikuchi is 24). If you use the author ID as it is for the tag number, it will be troublesome when generating data, so we will renumber it here.

# doc(The text of the work)To list words with only verbs, adjectives, and nouns
#Generate a Tagged Document consisting of words and tags

def split_into_words(doc, name=''):
    mecab = MeCab.Tagger("-Ochasen")
    lines = mecab.parse(doc).splitlines() #Morphological analysis
    words = []
    for line in lines:
        chunks = line.split('\t')
        #Add only nouns (excluding numbers), verbs, and adjectives
        if len(chunks) > 3 and (chunks[3].startswith('verb') or chunks[3].startswith('adjective') or (chunks[3].startswith('noun') and not chunks[3].startswith('noun-number'))):
            words.append(chunks[0])
    #print(words)
    return TaggedDocument(words=words, tags=[name])

Generate train_text with training data and test_text with evaluation data.

#Train to learn_Generate text (fixed to 20 works per author)
#Test for testing_Also make text (use all the rest of the data not used in learning)
#Since the number of works varies from person to person, test_The number of works included in text also varies from person to person

train_text = []
test_text = []

for i, personID in enumerate(personID_list):
    print('personID', personID)

    with open("./data/personID{}.txt".format(personID), encoding="utf-8") as f:
        for bookID_str in f:
            #print(bookID)
            bookID_list = bookID_str.split( )
            
            #I haven't downloaded more than 50 works, so I cut it.
            if len(bookID_list) >= 50:
                bookID_list = bookID_list[:50]
            print('number of cards', len(bookID_list))
            
            for j, bookID in enumerate(bookID_list):
                
                #Open the html that contains the body you just saved
                soup = BeautifulSoup(open("./data/text{}_{}.html".format(personID, bookID), encoding="shift_jis"))

                #The text is written<div>Take out
                main_text = soup.find("div", "main_text").text
                #print(main_text)
                
                #The first 20 works are train_Put it in text, the rest is test_Put in text
                if j < 20:
                    train_text.append(split_into_words(main_text, str(i)))
                    print('bookID\t{}\ttrain'.format(bookID))
                else:
                    test_text.append(split_into_words(main_text, str(i)))
                    print('bookID\t{}\ttest'.format(bookID))

3. Vectorize the text using Doc2Vec

Create a model for Doc2Vec. Hyperparameters such as alpha and epochs are decided fairly appropriately.

#Create a model of Doc2Vec and train it
model = Doc2Vec(vector_size=len(train_text), dm=0, alpha=0.05, min_count=5)
model.build_vocab(train_text)
model.train(train_text, total_examples=len(train_text), epochs=5)

#Save learning results
#model.save('./data/doc2vec.model')

Create data by converting the text into a numerical vector for learning with a neural network.

#Create data for neural network from text which is a list of created model and Tagged Document
def text2xy(model, text):
    x = []
    y = []
    for i in range(len(text)):
        #print(i)
        vec = model.infer_vector(text[i].words) #Convert to a vector consisting of numbers
        x.append(vec.tolist())
        y.append(int(text[i].tags[0]))

    x = np.array(x)
    y = np_utils.to_categorical(y) #Convert tag numbers to onehot
    return x, y

#Creation of training data and evaluation data
x_train, y_train = text2xy(model, train_text)
x_test, y_test = text2xy(model, test_text)

4. Construct a neural network using keras and supervised learning as a classification problem

Prepare a function (dense_train) that performs from model creation to training and a function (draw_acc and draw_loss) for drawing.

The model of the neural network I made is a fairly simple model consisting of three fully connected layers.

For the loss function, we used the categorical cross entropy, which is often used in classification problems. The number of units and learning rate are appropriate.

def dense_train(epochs):    
    #Model definition
    kmodel = models.Sequential()
    kmodel.add(layers.Dense(512, activation='relu', input_shape=(500,)))
    kmodel.add(layers.Dense(256, activation='relu'))
    kmodel.add(layers.Dense(25, activation='softmax'))
    kmodel.summary()

    #Compiling the model
    kmodel.compile(loss='categorical_crossentropy', optimizer=optimizers.RMSprop(lr=1e-4), metrics=['acc'])
    
    #Model learning
    history = kmodel.fit(x=x_train, y=y_train, epochs=epochs, validation_data=(x_test, y_test))

    #Save model
    #model.save('./data/dense.h5')
    return history, kmodel

#Correct answer rate plot
def draw_acc(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    epochs = range(1, len(acc) + 1)

    fig = plt.figure()
    fig1 = fig.add_subplot(111)
    fig1.plot(epochs, acc, 'bo', label='Training acc')
    fig1.plot(epochs, val_acc, 'b', label='Validation acc')

    fig1.set_xlabel('epochs')
    fig1.set_ylabel('accuracy')
    fig.legend(bbox_to_anchor=(0., 0.19, 0.86, 0.102), loc=5) #The second argument of anchor (legend) is y, the third argument is x

    #Save image
    fig.savefig('./acc.pdf')
    plt.show()

#loss plot
def draw_loss(history):
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs = range(1, len(loss) + 1)
    
    fig = plt.figure()
    fig1 = fig.add_subplot(111)
    fig1.plot(epochs, loss, 'bo', label='Training loss')
    fig1.plot(epochs, val_loss, 'b', label='Validation loss')

    fig1.set_xlabel('epochs')
    fig1.set_ylabel('loss')
    fig.legend(bbox_to_anchor=(0., 0.73, 0.86, 0.102), loc=5) #The second argument of anchor (legend) is y, the third argument is x

    #Save image
    #fig.savefig('./loss.pdf')
    plt.show()

Train and draw a graph of correct answer rate. This time, the number of epochs was 10.

history, kmodel = dense_train(10)
draw_acc(history)

The output obtained is as follows.

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 512)               256512    
_________________________________________________________________
dense_2 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_3 (Dense)              (None, 25)                6425      
=================================================================
Total params: 394,265
Trainable params: 394,265
Non-trainable params: 0
_________________________________________________________________
Train on 500 samples, validate on 523 samples
Epoch 1/10
500/500 [==============================] - 0s 454us/step - loss: 3.0687 - acc: 0.1340 - val_loss: 2.9984 - val_acc: 0.3308
Epoch 2/10
500/500 [==============================] - 0s 318us/step - loss: 2.6924 - acc: 0.6300 - val_loss: 2.8255 - val_acc: 0.5698
Epoch 3/10
500/500 [==============================] - 0s 315us/step - loss: 2.3527 - acc: 0.8400 - val_loss: 2.6230 - val_acc: 0.6864
Epoch 4/10
500/500 [==============================] - 0s 283us/step - loss: 1.9961 - acc: 0.9320 - val_loss: 2.4101 - val_acc: 0.7610
Epoch 5/10
500/500 [==============================] - 0s 403us/step - loss: 1.6352 - acc: 0.9640 - val_loss: 2.1824 - val_acc: 0.8088
Epoch 6/10
500/500 [==============================] - 0s 237us/step - loss: 1.2921 - acc: 0.9780 - val_loss: 1.9504 - val_acc: 0.8337
Epoch 7/10
500/500 [==============================] - 0s 227us/step - loss: 0.9903 - acc: 0.9820 - val_loss: 1.7273 - val_acc: 0.8432
Epoch 8/10
500/500 [==============================] - 0s 220us/step - loss: 0.7424 - acc: 0.9840 - val_loss: 1.5105 - val_acc: 0.8642
Epoch 9/10
500/500 [==============================] - 0s 225us/step - loss: 0.5504 - acc: 0.9840 - val_loss: 1.3299 - val_acc: 0.8623
Epoch 10/10
500/500 [==============================] - 0s 217us/step - loss: 0.4104 - acc: 0.9840 - val_loss: 1.1754 - val_acc: 0.8719

Click here for a diagram of the learning results.

After 6 epochs, I feel like I'm sick, but in the end I got a correct answer rate of __87.16% __. Even though I haven't done much hyperparameter tuning, I feel like I got a reasonable percentage of correct answers.

At the end

Hyperparameter tuning and model selection (use LSTM or something?) May be done more. Please comment if you have any suggestions. Thank you for visiting.