[PYTHON] [Introduction to Pytorch] I want to generate sentences in news articles

Overview

I have collected news articles about the coronavirus, and I would like to use it to challenge sentence generation. I started studying Deep Learning using Pytorch at home time, so let me output it. I'm still studying, so please understand that there may be some mistakes ...

environment

Google Colaboratory

Library to use, etc.

import torch
import torch.nn as nn
import torch.optim as optimizers
from torch.utils.data import DataLoader
import torch.nn.functional as F
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.utils import shuffle
import random
from tqdm import tqdm_notebook as tqdm
import pickle
import matplotlib.pyplot as plt
import logging
import numpy as np

Data preparation

Reads pre-processed and word-separated text of news about scraped coronavirus. Pre-processing is not a big deal.
The news about the coronavirus for the last week was collected on Yahoo News, so I got it from there. I scraped it without noticing that it was for one week, so the amount of data I actually got was too small and it was messed up. I'm too worried if I can learn well with this, but I will try it. ..

data_news = pickle.load(open("Destination/corona_wakati.pickle", "rb"))

The data used is like this

data_news[0]

['patient', 'Ya', 'Health care workers', 'La', 'But', 'New Coronavirus', 'To', 'infection', 'Shi', 'Ta', 'thing', 'But', 'Revealed', 'Shi', 'Ta', 'Ikuno-ku, Osaka', 'of', '「', 'Nami', 'Ha', 'Ya', 'Rehabilitation', 'hospital', '」', 'about', '、', 'Osaka Prefecture', 'Ha', '0', 'Day', 'Night', '、', 'further', '0', 'Man', 'of', 'infection', 'But', 'clear', 'To', 'Now', 'Ta', 'When', 'Presentation', 'Shi', 'Ta', '。']

Make the word id

Since the word cannot be processed by the neural network as it is, set it to id. Since it is necessary to return from id to word when actually generating a sentence, we will also implement decoder.
I wanted to define a general-purpose class, so I tried to put symbols at the beginning and end of the sentence, but this time there is always a punctuation mark at the end of the sentence, so it is not necessary.

class EncoderDecoder(object):
    def __init__(self):
        # word_to_id dictionary
        self.w2i = {}
        # id_to_word dictionary
        self.i2w = {}
        #Reserved word(Padding,The beginning of the sentence)
        self.special_chars = ['<pad>', '<s>', '</s>', '<unk>']
        self.bos_char = self.special_chars[1]
        self.eos_char = self.special_chars[2]
        self.oov_char = self.special_chars[3]

    #Function to be called
    def __call__(self, sentence):
        return self.transform(sentence)

    #Dictionary creation
    def fit(self, sentences):
        self._words = set()

        #Create a set of unknown words
        for sentence in sentences:
            self._words.update(sentence)

        #Shift the reserved words and shake the id
        self.w2i = {w: (i + len(self.special_chars))
                    for i, w in enumerate(self._words)}

        #Add reserved words to the dictionary(<pad>:0, <s>:1, </s>:2, <unk>:3)
        for i, w in enumerate(self.special_chars):
            self.w2i[w] = i

        # word_to_id using the id dictionary_to_Create a dictionary of words
        self.i2w = {i: w for w, i in self.w2i.items()}

    #Convert the read data to id at once
    def transform(self, sentences, bos=False, eos=False):
        output = []
        #Add start and end symbols if specified
        for sentence in sentences:
            if bos:
                sentence = [self.bos_char] + sentence
            if eos:
                sentence = sentence + [self.eos_char]
            output.append(self.encode(sentence))

        return output

    #Make id one sentence at a time
    def encode(self, sentence):
        output = []
        for w in sentence:
            if w not in self.w2i:
                idx = self.w2i[self.oov_char]
            else:
                idx = self.w2i[w]
            output.append(idx)

        return output

    #Convert sentence by sentence into word list
    def decode(self, sentence):
        return [self.i2w[id] for id in sentence]

Use the defined class as follows

en_de = EncoderDecoder()
en_de.fit(data_news)
data_news_id = en_de(data_news)

data_news_id[0]

Decoding returns to the original statement

en_de.decode(data_news_id[0])

Create data and labels

In this sentence generation task, you will learn as shown in the image below. Therefore, the correct label is the one with the data shifted by one from the label. This time, I will create my own Pytorch-specific Dataset and create the data and labels in it.
Also, padding with 0 to the specified length to make the length of the data uniform, and then returning it as a Long Tensor type.
By the way, the pad_sequence of keras is used here, but for the time being, a similar one is prepared for pytorch. However, I am using keras because I can't specify the length to pad the pyroch one.

class MyDataset(torch.utils.data.Dataset):

    def __init__(self, data, max_length=50):
        self.data_num = len(data)
        #Shift the data by one
        self.x = [d[:-1] for d in data]
        self.y = [d[1:] for d in data]
        #Length to pad and match
        self.max_length = max_length

    def __len__(self):
        return self.data_num

    def __getitem__(self, idx):

        out_data = self.x[idx]
        out_label =  self.y[idx]

        #Pad to match length
        out_data = pad_sequences([out_data], padding='post', maxlen=self.max_length)[0]
        out_label = pad_sequences([out_label], padding='post', maxlen=self.max_length)[0]

        #Convert to LongTensor type
        out_data = torch.LongTensor(out_data)
        out_label = torch.LongTensor(out_label)

        return out_data, out_label

dataset = MyDataset(data_news_id, max_length=50)

dataset[0]

(tensor([7142, 5775, 3686, 4630, 5891, 4003,  358, 3853, 4139, 4604, 4591, 5891,
         2233, 4139, 4604, 5507, 7378, 2222, 6002, 3277, 5775, 7380, 7234, 5941,
         5788, 2982, 4901, 3277, 6063, 5812, 4647, 2982, 1637, 6063, 6125, 7378,
         3853, 5891, 1071,  358, 7273, 4604, 5835, 1328, 4139, 4604,    0,    0,
            0,    0]),
 tensor([5775, 3686, 4630, 5891, 4003,  358, 3853, 4139, 4604, 4591, 5891, 2233,
         4139, 4604, 5507, 7378, 2222, 6002, 3277, 5775, 7380, 7234, 5941, 5788,
         2982, 4901, 3277, 6063, 5812, 4647, 2982, 1637, 6063, 6125, 7378, 3853,
         5891, 1071,  358, 7273, 4604, 5835, 1328, 4139, 4604, 1226,    0,    0,
            0,    0]))

Batch unit with DataLoader

Finally, Pytorch's Data Loader divides the data into batches. If the number of data is not divisible by the batch size, it will be different by the last batch number, so set drop_last to True.

data_loader = DataLoader(dataset, batch_size=50, drop_last=True)

Check only the first batch

for (x, y) in data_loader:
    print("x_dim: {}, y_dim: {}".format(x.shape, y.shape))
    break

x_dim: torch.Size([50, 50]), y_dim: torch.Size([50, 50])

Modeling / learning

It is difficult to understand because the batch size and the number of data of one data are the same, but in the process of data generation so far, the dimension of the data is (batch size, number of time series (when considering sentences as time series data of words), Although it is (input dimension), Pytorch defaults to (time series number, batch size, input dimension), so batch_first = True must be specified. Other than that, there is nothing special to mention.

class RNNLM(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size=100, num_layers=1, device="cuda"):
        super().__init__()
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.hidden_dim = hidden_dim
        self.device = device

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.dropout1 = nn.Dropout(0.5)
        self.lstm1 = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, num_layers=self.num_layers)
        self.dropout2 = nn.Dropout(0.5)
        self.lstm2 = nn.LSTM(hidden_dim, hidden_dim, batch_first=True, num_layers=self.num_layers)
        self.dropout3 = nn.Dropout(0.5)
        self.lstm3 = nn.LSTM(hidden_dim, hidden_dim, batch_first=True, num_layers=self.num_layers)
        self.linear = nn.Linear(hidden_dim, vocab_size)

        nn.init.xavier_normal_(self.lstm1.weight_ih_l0)
        nn.init.orthogonal_(self.lstm1.weight_hh_l0)
        nn.init.xavier_normal_(self.lstm2.weight_ih_l0)
        nn.init.orthogonal_(self.lstm2.weight_hh_l0)
        nn.init.xavier_normal_(self.lstm3.weight_ih_l0)
        nn.init.orthogonal_(self.lstm3.weight_hh_l0)
        nn.init.xavier_normal_(self.linear.weight)
        

    def init_hidden(self):
        self.hidden_state = (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim, device=self.device), torch.zeros(self.num_layers, self.batch_size, self.hidden_dim, device=self.device))

    def forward(self, x):
        x = self.embedding(x)
        x = self.dropout1(x)
        h, self.hidden_state = self.lstm1(x, self.hidden_state)
        h = self.dropout2(h)
        h, self.hidden_state = self.lstm2(h, self.hidden_state)
        h = self.dropout3(h)
        h, self.hidden_state = self.lstm3(h, self.hidden_state)
        y = self.linear(h)
        return y

learn

The parameters are quite appropriate. sorry. ..
Since the number of data is small, I increased the number of epochs and turned it a lot to save it frequently.
In other deep learning tasks, we may use evaluation data to evaluate whether or not overfitting is being performed, and based on that, we may end learning early, but the probability like this time. The task of outputting the distribution is difficult to evaluate quantitatively. This time, we are evaluating the progress of learning and the model using an index called perplexity. Perplexity is a little complicated when expressed in mathematical formulas, but intuitively it is the reciprocal of the output probability, which represents the number of branches. In other words, in the case of the task of predicting the next word like this time, if the perplexity is 2, it means that the word prediction is narrowed down to 2 choices.

if __name__ == '__main__':
    np.random.seed(123)
    torch.manual_seed(123)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    EMBEDDING_DIM = HIDDEN_DIM = 256
    VOCAB_SIZE = len(en_de.i2w)
    BATCH_SIZE=50

    model = RNNLM(EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE, batch_size=BATCH_SIZE).to(device)

    criterion = nn.CrossEntropyLoss(reduction='mean', ignore_index=0)
    optimizer = optimizers.Adam(model.parameters(),
                                lr=0.001,
                                betas=(0.9, 0.999), amsgrad=True)
    
    hist = {'train_loss': [], 'ppl':[]}
    epochs = 1000

    def compute_loss(label, pred):
        return criterion(pred, label)

    def train_step(x, t):
        model.train()
        model.init_hidden()
        preds = model(x)
        loss = compute_loss(t.view(-1),
                            preds.view(-1, preds.size(-1)))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        return loss, preds

    for epoch in tqdm(range(epochs)):
        print('-' * 20)
        print('epoch: {}'.format(epoch+1))

        train_loss = 0.
        loss_count = 0
        
        for (x, t) in data_loader:
            x, t = x.to(device), t.to(device)
            loss, _ = train_step(x, t)
            train_loss += loss.item()
            loss_count += 1

        # perplexity
        ppl = np.exp(train_loss / loss_count)    
        train_loss /= len(data_loader)

        print('train_loss: {:.3f}, ppl: {:.3f}'.format(
            train_loss, ppl
        ))
        
        hist["train_loss"].append(train_loss)
        hist["ppl"].append(ppl)

        
        #Save every 20 epoch.
        if epoch % 20 == 0:
            model_name = "Destination/embedding{}_v{}.pt".format(EMBEDDING_DIM, epoch)
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': train_loss
            }, model_name)
            logging.info("Saving the checkpoint...")


    torch.save(model.state_dict(), "Destination/embedding{}_v{}.model".format(EMBEDDING_DIM, epoch))

--------------------
epoch: 1
train_loss: 6.726, ppl: 833.451
--------------------
epoch: 2
train_loss: 6.073, ppl: 433.903
--------------------
epoch: 3
train_loss: 6.014, ppl: 409.209
--------------------
epoch: 4
train_loss: 5.904, ppl: 366.649
--------------------
epoch: 5
train_loss: 5.704, ppl: 300.046

・・ epoch: 995 train_loss: 0.078, ppl: 1.081 -------------------- epoch: 996 train_loss: 0.077, ppl: 1.081 -------------------- epoch: 997 train_loss: 0.076, ppl: 1.079 -------------------- epoch: 998 train_loss: 0.077, ppl: 1.080 -------------------- epoch: 999 train_loss: 0.077, ppl: 1.080 -------------------- epoch: 1000 train_loss: 0.077, ppl: 1.080

Evaluation

Let's see the transition of train_loss and perplexity

 #Error visualization
 train_loss = hist['train_loss']

 fig = plt.figure(figsize=(10, 5))
 plt.plot(range(len(train_loss)), train_loss,
             linewidth=1,
             label='train_loss')
 plt.xlabel('epochs')
 plt.ylabel('train_loss')
 plt.legend()
 plt.savefig('output.jpg')
 plt.show()

 ppl = hist['ppl']

 fig = plt.figure(figsize=(10, 5))
 plt.plot(range(len(ppl)), ppl,
             linewidth=1,
             label='perplexity')
 plt.xlabel('epochs')
 plt.ylabel('perplexity')
 plt.legend()
 plt.show()

train_loss is going down steadily. The perplexity dropped sharply in the early stages and remained low in the second half, eventually reaching a fairly low value of 1.08. The second half hasn't changed much, so I may not have had to study 1000 times.

Next, let's actually generate a sentence. Decide on a seed word and let it guess the word that follows it. By giving the probability output as a weight to the choice in the np.random.choice part, the selected word changes each time it is executed.

def generate_sentence(morphemes, model_path, embedding_dim, hidden_dim, vocab_size, batch_size=1):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = RNNLM(embedding_dim, hidden_dim, vocab_size, batch_size).to(device)
    checkpoint = torch.load(model_path)
    model.load_state_dict(checkpoint)

    model.eval()

    with torch.no_grad():
        for morpheme in morphemes:
            model.init_hidden()
            sentence = [morpheme]
            for _ in range(50):
                input_index = en_de.encode([morpheme])
                input_tensor = torch.tensor([input_index], device=device)
                outputs = model(input_tensor)
                probs = F.softmax(torch.squeeze(outputs))
                p = probs.cpu().detach().numpy()
                morpheme = en_de.i2w[np.random.choice(len(p), p=p)]
                sentence.append(morpheme)
                if morpheme in ["。", "<pad>"]:
                    break
            print("".join(sentence))
            print('-' * 50)

EMBEDDING_DIM = HIDDEN_DIM = 256
VOCAB_SIZE = len(en_de.i2w)
model_path ="Destination/embedding{}_v{}.model"
morphemes = ["Prime Minister", "Governor of Tokyo", "corona", "New modelcoronaウイルス", "New model", "Japan", "Tokyo", "Infected person", "Emergency"]

generate_sentence(morphemes, model_path, EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE)

The prime minister's sprinting change voice is too much to raise the year! "There is no medium fever and important and joint caution. Tell the doctor and the secretary-general is off -------------------------------------------------- The Governor of Tokyo also asked for it, but even though there was no bed, I thought it was a store store decision What should I hurry out of the new coronavirus infection, related measures with the Liberal Democratic Party New coronavirus infection speed Why 0 years old, practice also announced and reviewed This number apology practice breaks are prohibited Anxiety Past Cremona fever at least -------------------------------------------------- Psychology that can provide corona benefits The government does not have the scale and required resources nationwide. From the new coronavirus infection -------------------------------------------------- The new coronavirus fund clerk also reported that he was enthusiastic and responded jointly from the past Tokyo. -------------------------------------------------- There are no stores announced as new designations, and more than residents (explain ...). "It is reported that the number of Yuriko will decrease. Why is the government calling for an increase? -------------------------------------------------- The country right next to Japan is calling for a match. Disinfecting T-shirts. Telling how to proceed. Johnny's office Past swords Past hospitalization Joint Feeling that this medical collapse has reduced the feeling. -------------------------------------------------- From Tokyo or above, from 0 yen to 0 round trips per evaluation, the number of collapses (the timing of wearing at home and the comments requested also decreased There is no prospect of midnight, and the past -------------------------------------------------- 0 round-trip medical institutions from infected people, etc. LINE Tokyo Decrease transition expression A little practice for customers Approximately 0 medical institutions Confirmed at midnight in the past report Tele-East There is no wonderful infection speed Trump administration There is no Komei Party Stores to reduce or foreigners ... "With the new coronavirus infection -------------------------------------------------- Confirmation of emergencies for newcomers Coping speed Estimated (Thank you for reducing the number of people in the year, disinfecting the number of deaths in the prefecture Issued leader Past issue -------------------------------------------------- I haven't made a sentence that makes sense, but it seems that I have learned a little about the position of the part of speech.
Sentence generation was difficult. I want to study further and improve
If you have any advice or suggestions, I would appreciate it if you could comment ..!

Referenced sites / references

Automatic text generation using RNN
Play with Deep Learning (3): Try to generate Natsume Soseki-like sentences with LSTM-RNN
Generate Natsume Soseki-like sentences with LSTM
Memory overflows during Pytorch test
[Detailed Deep Learning](https://www.amazon.co.jp/%E8%A9%B3%E8%A7%A3%E3%83%87%E3%82%A3%E3%83%BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-% E7% AC% AC2% E7% 89% 88-TensorFlow -Keras% E3% 83% BBPyTorch% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87 % E3% 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-% E5% B7% A3% E7% B1% A0 / dp / 4839969515)
[Deep Learning from scratch ❷ ― Natural language processing](https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81%8B%E3%82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95% E8% 87% AA% E7% 84% B6% E8% A8% 80% E8% AA% 9E% E5% 87 % A6% E7% 90% 86% E7% B7% A8-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% E6% AF% 85 / dp / 4873118360 / ref = sr_1_3? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & crid = 46J75IZVHZK0 & dchild = 1 & keywords =% E3% 82% BC% E3% 83% AD% E3% 81% 8B% E3 % 82% 89% E4% BD% 9C% E3% 82% 8Bdeep + learning & qid = 1587489442 & s = books & sprefix =% E3% 82% BC% E3% 83% AD% E3% 81% 8B% E3% 82% 89% 2Cstripbooks% 2C233 & sr = 1-3)