[PYTHON] [Introduction to Pytorch] I want to generate sentences in news articles

Overview

I have collected news articles about the coronavirus, and I would like to use it to challenge sentence generation. I started studying Deep Learning using Pytorch at home time, so let me output it. I'm still studying, so please understand that there may be some mistakes ...

environment

Library to use, etc.

import torch
import torch.nn as nn
import torch.optim as optimizers
from torch.utils.data import DataLoader
import torch.nn.functional as F
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.utils import shuffle
import random
from tqdm import tqdm_notebook as tqdm
import pickle
import matplotlib.pyplot as plt
import logging
import numpy as np

Data preparation

Reads pre-processed and word-separated text of news about scraped coronavirus. Pre-processing is not a big deal.
The news about the coronavirus for the last week was collected on Yahoo News, so I got it from there. I scraped it without noticing that it was for one week, so the amount of data I actually got was too small and it was messed up. I'm too worried if I can learn well with this, but I will try it. ..

data_news = pickle.load(open("Destination/corona_wakati.pickle", "rb"))

The data used is like this

data_news[0]

['patient', 'Ya', 'Health care workers', 'La', 'But', 'New Coronavirus', 'To', 'infection', 'Shi', 'Ta', 'thing', 'But', 'Revealed', 'Shi', 'Ta', 'Ikuno-ku, Osaka', 'of', '「', 'Nami', 'Ha', 'Ya', 'Rehabilitation', 'hospital', '」', 'about', '、', 'Osaka Prefecture', 'Ha', '0', 'Day', 'Night', '、', 'further', '0', 'Man', 'of', 'infection', 'But', 'clear', 'To', 'Now', 'Ta', 'When', 'Presentation', 'Shi', 'Ta', '。']

Make the word id

Since the word cannot be processed by the neural network as it is, set it to id. Since it is necessary to return from id to word when actually generating a sentence, we will also implement decoder.
I wanted to define a general-purpose class, so I tried to put symbols at the beginning and end of the sentence, but this time there is always a punctuation mark at the end of the sentence, so it is not necessary.

class EncoderDecoder(object):
    def __init__(self):
        # word_to_id dictionary
        self.w2i = {}
        # id_to_word dictionary
        self.i2w = {}
        #Reserved word(Padding,The beginning of the sentence)
        self.special_chars = ['<pad>', '<s>', '</s>', '<unk>']
        self.bos_char = self.special_chars[1]
        self.eos_char = self.special_chars[2]
        self.oov_char = self.special_chars[3]

    #Function to be called
    def __call__(self, sentence):
        return self.transform(sentence)

    #Dictionary creation
    def fit(self, sentences):
        self._words = set()

        #Create a set of unknown words
        for sentence in sentences:
            self._words.update(sentence)

        #Shift the reserved words and shake the id
        self.w2i = {w: (i + len(self.special_chars))
                    for i, w in enumerate(self._words)}

        #Add reserved words to the dictionary(<pad>:0, <s>:1, </s>:2, <unk>:3)
        for i, w in enumerate(self.special_chars):
            self.w2i[w] = i

        # word_to_id using the id dictionary_to_Create a dictionary of words
        self.i2w = {i: w for w, i in self.w2i.items()}

    #Convert the read data to id at once
    def transform(self, sentences, bos=False, eos=False):
        output = []
        #Add start and end symbols if specified
        for sentence in sentences:
            if bos:
                sentence = [self.bos_char] + sentence
            if eos:
                sentence = sentence + [self.eos_char]
            output.append(self.encode(sentence))

        return output

    #Make id one sentence at a time
    def encode(self, sentence):
        output = []
        for w in sentence:
            if w not in self.w2i:
                idx = self.w2i[self.oov_char]
            else:
                idx = self.w2i[w]
            output.append(idx)

        return output

    #Convert sentence by sentence into word list
    def decode(self, sentence):
        return [self.i2w[id] for id in sentence]

Use the defined class as follows

en_de = EncoderDecoder()
en_de.fit(data_news)
data_news_id = en_de(data_news)
data_news_id[0]
[7142,
 5775,
 3686,
 4630,
 5891,
 4003,
 358,
 3853,
 4139,
 4604,
 4591,
 5891,
 2233,
 4139,
 4604,
 5507,
 7378,
 2222,
 6002,
 3277,
 5775,
 7380,
 7234,
 5941,
 5788,
 2982,
 4901,
 3277,
 6063,
 5812,
 4647,
 2982,
 1637,
 6063,
 6125,
 7378,
 3853,
 5891,
 1071,
 358,
 7273,
 4604,
 5835,
 1328,
 4139,
 4604,
 1226]

Decoding returns to the original statement

en_de.decode(data_news_id[0])

['patient', 'Ya', 'Health care workers', 'La', 'But', 'New Coronavirus', 'To', 'infection', 'Shi', 'Ta', 'thing', 'But', 'Revealed', 'Shi', 'Ta', 'Ikuno-ku, Osaka', 'of', '「', 'Nami', 'Ha', 'Ya', 'Rehabilitation', 'hospital', '」', 'about', '、', 'Osaka Prefecture', 'Ha', '0', 'Day', 'Night', '、', 'further', '0', 'Man', 'of', 'infection', 'But', 'clear', 'To', 'Now', 'Ta', 'When', 'Presentation', 'Shi', 'Ta', '。']

Create data and labels

In this sentence generation task, you will learn as shown in the image below. Therefore, the correct label is the one with the data shifted by one from the label. This time, I will create my own Pytorch-specific Dataset and create the data and labels in it.
Also, padding with 0 to the specified length to make the length of the data uniform, and then returning it as a Long Tensor type.
By the way, the pad_sequence of keras is used here, but for the time being, a similar one is prepared for pytorch. However, I am using keras because I can't specify the length to pad the pyroch one. blog_rnnlm.png

class MyDataset(torch.utils.data.Dataset):

    def __init__(self, data, max_length=50):
        self.data_num = len(data)
        #Shift the data by one
        self.x = [d[:-1] for d in data]
        self.y = [d[1:] for d in data]
        #Length to pad and match
        self.max_length = max_length

    def __len__(self):
        return self.data_num

    def __getitem__(self, idx):

        out_data = self.x[idx]
        out_label =  self.y[idx]

        #Pad to match length
        out_data = pad_sequences([out_data], padding='post', maxlen=self.max_length)[0]
        out_label = pad_sequences([out_label], padding='post', maxlen=self.max_length)[0]

        #Convert to LongTensor type
        out_data = torch.LongTensor(out_data)
        out_label = torch.LongTensor(out_label)

        return out_data, out_label
dataset = MyDataset(data_news_id, max_length=50)
dataset[0]
(tensor([7142, 5775, 3686, 4630, 5891, 4003,  358, 3853, 4139, 4604, 4591, 5891,
         2233, 4139, 4604, 5507, 7378, 2222, 6002, 3277, 5775, 7380, 7234, 5941,
         5788, 2982, 4901, 3277, 6063, 5812, 4647, 2982, 1637, 6063, 6125, 7378,
         3853, 5891, 1071,  358, 7273, 4604, 5835, 1328, 4139, 4604,    0,    0,
            0,    0]),
 tensor([5775, 3686, 4630, 5891, 4003,  358, 3853, 4139, 4604, 4591, 5891, 2233,
         4139, 4604, 5507, 7378, 2222, 6002, 3277, 5775, 7380, 7234, 5941, 5788,
         2982, 4901, 3277, 6063, 5812, 4647, 2982, 1637, 6063, 6125, 7378, 3853,
         5891, 1071,  358, 7273, 4604, 5835, 1328, 4139, 4604, 1226,    0,    0,
            0,    0]))

Batch unit with DataLoader

Finally, Pytorch's Data Loader divides the data into batches. If the number of data is not divisible by the batch size, it will be different by the last batch number, so set drop_last to True.

data_loader = DataLoader(dataset, batch_size=50, drop_last=True)

Check only the first batch

for (x, y) in data_loader:
    print("x_dim: {}, y_dim: {}".format(x.shape, y.shape))
    break
x_dim: torch.Size([50, 50]), y_dim: torch.Size([50, 50])

Modeling / learning

It is difficult to understand because the batch size and the number of data of one data are the same, but in the process of data generation so far, the dimension of the data is (batch size, number of time series (when considering sentences as time series data of words), Although it is (input dimension), Pytorch defaults to (time series number, batch size, input dimension), so batch_first = True must be specified. Other than that, there is nothing special to mention.

class RNNLM(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, batch_size=100, num_layers=1, device="cuda"):
        super().__init__()
        self.num_layers = num_layers
        self.batch_size = batch_size
        self.hidden_dim = hidden_dim
        self.device = device

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.dropout1 = nn.Dropout(0.5)
        self.lstm1 = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, num_layers=self.num_layers)
        self.dropout2 = nn.Dropout(0.5)
        self.lstm2 = nn.LSTM(hidden_dim, hidden_dim, batch_first=True, num_layers=self.num_layers)
        self.dropout3 = nn.Dropout(0.5)
        self.lstm3 = nn.LSTM(hidden_dim, hidden_dim, batch_first=True, num_layers=self.num_layers)
        self.linear = nn.Linear(hidden_dim, vocab_size)

        nn.init.xavier_normal_(self.lstm1.weight_ih_l0)
        nn.init.orthogonal_(self.lstm1.weight_hh_l0)
        nn.init.xavier_normal_(self.lstm2.weight_ih_l0)
        nn.init.orthogonal_(self.lstm2.weight_hh_l0)
        nn.init.xavier_normal_(self.lstm3.weight_ih_l0)
        nn.init.orthogonal_(self.lstm3.weight_hh_l0)
        nn.init.xavier_normal_(self.linear.weight)
        

    def init_hidden(self):
        self.hidden_state = (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim, device=self.device), torch.zeros(self.num_layers, self.batch_size, self.hidden_dim, device=self.device))

    def forward(self, x):
        x = self.embedding(x)
        x = self.dropout1(x)
        h, self.hidden_state = self.lstm1(x, self.hidden_state)
        h = self.dropout2(h)
        h, self.hidden_state = self.lstm2(h, self.hidden_state)
        h = self.dropout3(h)
        h, self.hidden_state = self.lstm3(h, self.hidden_state)
        y = self.linear(h)
        return y

learn

The parameters are quite appropriate. sorry. ..
Since the number of data is small, I increased the number of epochs and turned it a lot to save it frequently.
In other deep learning tasks, we may use evaluation data to evaluate whether or not overfitting is being performed, and based on that, we may end learning early, but the probability like this time. The task of outputting the distribution is difficult to evaluate quantitatively. This time, we are evaluating the progress of learning and the model using an index called perplexity. Perplexity is a little complicated when expressed in mathematical formulas, but intuitively it is the reciprocal of the output probability, which represents the number of branches. In other words, in the case of the task of predicting the next word like this time, if the perplexity is 2, it means that the word prediction is narrowed down to 2 choices.

if __name__ == '__main__':
    np.random.seed(123)
    torch.manual_seed(123)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    EMBEDDING_DIM = HIDDEN_DIM = 256
    VOCAB_SIZE = len(en_de.i2w)
    BATCH_SIZE=50

    model = RNNLM(EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE, batch_size=BATCH_SIZE).to(device)

    criterion = nn.CrossEntropyLoss(reduction='mean', ignore_index=0)
    optimizer = optimizers.Adam(model.parameters(),
                                lr=0.001,
                                betas=(0.9, 0.999), amsgrad=True)
    
    hist = {'train_loss': [], 'ppl':[]}
    epochs = 1000

    def compute_loss(label, pred):
        return criterion(pred, label)

    def train_step(x, t):
        model.train()
        model.init_hidden()
        preds = model(x)
        loss = compute_loss(t.view(-1),
                            preds.view(-1, preds.size(-1)))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        return loss, preds

    for epoch in tqdm(range(epochs)):
        print('-' * 20)
        print('epoch: {}'.format(epoch+1))

        train_loss = 0.
        loss_count = 0
        
        for (x, t) in data_loader:
            x, t = x.to(device), t.to(device)
            loss, _ = train_step(x, t)
            train_loss += loss.item()
            loss_count += 1

        # perplexity
        ppl = np.exp(train_loss / loss_count)    
        train_loss /= len(data_loader)

        print('train_loss: {:.3f}, ppl: {:.3f}'.format(
            train_loss, ppl
        ))
        
        hist["train_loss"].append(train_loss)
        hist["ppl"].append(ppl)

        
        #Save every 20 epoch.
        if epoch % 20 == 0:
            model_name = "Destination/embedding{}_v{}.pt".format(EMBEDDING_DIM, epoch)
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': train_loss
            }, model_name)
            logging.info("Saving the checkpoint...")


    torch.save(model.state_dict(), "Destination/embedding{}_v{}.model".format(EMBEDDING_DIM, epoch))
--------------------
epoch: 1
train_loss: 6.726, ppl: 833.451
--------------------
epoch: 2
train_loss: 6.073, ppl: 433.903
--------------------
epoch: 3
train_loss: 6.014, ppl: 409.209
--------------------
epoch: 4
train_loss: 5.904, ppl: 366.649
--------------------
epoch: 5
train_loss: 5.704, ppl: 300.046

・ ・ epoch: 995 train_loss: 0.078, ppl: 1.081 -------------------- epoch: 996 train_loss: 0.077, ppl: 1.081 -------------------- epoch: 997 train_loss: 0.076, ppl: 1.079 -------------------- epoch: 998 train_loss: 0.077, ppl: 1.080 -------------------- epoch: 999 train_loss: 0.077, ppl: 1.080 -------------------- epoch: 1000 train_loss: 0.077, ppl: 1.080

Evaluation

Let's see the transition of train_loss and perplexity

 #Error visualization
 train_loss = hist['train_loss']

 fig = plt.figure(figsize=(10, 5))
 plt.plot(range(len(train_loss)), train_loss,
             linewidth=1,
             label='train_loss')
 plt.xlabel('epochs')
 plt.ylabel('train_loss')
 plt.legend()
 plt.savefig('output.jpg')
 plt.show()

output_26_0.png

 ppl = hist['ppl']

 fig = plt.figure(figsize=(10, 5))
 plt.plot(range(len(ppl)), ppl,
             linewidth=1,
             label='perplexity')
 plt.xlabel('epochs')
 plt.ylabel('perplexity')
 plt.legend()
 plt.show()

output_27_0.png

train_loss is going down steadily. The perplexity dropped sharply in the early stages and remained low in the second half, eventually reaching a fairly low value of 1.08. The second half hasn't changed much, so I may not have had to study 1000 times.

Next, let's actually generate a sentence. Decide on a seed word and let it guess the word that follows it. By giving the probability output as a weight to the choice in the np.random.choice part, the selected word changes each time it is executed.

def generate_sentence(morphemes, model_path, embedding_dim, hidden_dim, vocab_size, batch_size=1):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = RNNLM(embedding_dim, hidden_dim, vocab_size, batch_size).to(device)
    checkpoint = torch.load(model_path)
    model.load_state_dict(checkpoint)

    model.eval()

    with torch.no_grad():
        for morpheme in morphemes:
            model.init_hidden()
            sentence = [morpheme]
            for _ in range(50):
                input_index = en_de.encode([morpheme])
                input_tensor = torch.tensor([input_index], device=device)
                outputs = model(input_tensor)
                probs = F.softmax(torch.squeeze(outputs))
                p = probs.cpu().detach().numpy()
                morpheme = en_de.i2w[np.random.choice(len(p), p=p)]
                sentence.append(morpheme)
                if morpheme in ["。", "<pad>"]:
                    break
            print("".join(sentence))
            print('-' * 50)


EMBEDDING_DIM = HIDDEN_DIM = 256
VOCAB_SIZE = len(en_de.i2w)
model_path ="Destination/embedding{}_v{}.model"
morphemes = ["Prime Minister", "Governor of Tokyo", "corona", "New modelcoronaウイルス", "New model", "Japan", "Tokyo", "Infected person", "Emergency"]

generate_sentence(morphemes, model_path, EMBEDDING_DIM, HIDDEN_DIM, VOCAB_SIZE)

The prime minister's sprinting change voice is too much to raise the year! "There is no medium fever and important and joint caution. Tell the doctor and the secretary-general is off -------------------------------------------------- The Governor of Tokyo also asked for it, but even though there was no bed, I thought it was a store store decision What should I hurry out of the new coronavirus infection, related measures with the Liberal Democratic Party New coronavirus infection speed Why 0 years old, practice also announced and reviewed This number apology practice breaks are prohibited Anxiety Past Cremona fever at least -------------------------------------------------- Psychology that can provide corona benefits The government does not have the scale and required resources nationwide. From the new coronavirus infection -------------------------------------------------- The new coronavirus fund clerk also reported that he was enthusiastic and responded jointly from the past Tokyo. -------------------------------------------------- There are no stores announced as new designations, and more than residents (explain ...). "It is reported that the number of Yuriko will decrease. Why is the government calling for an increase? -------------------------------------------------- The country right next to Japan is calling for a match. Disinfecting T-shirts. Telling how to proceed. Johnny's office Past swords Past hospitalization Joint Feeling that this medical collapse has reduced the feeling. -------------------------------------------------- From Tokyo or above, from 0 yen to 0 round trips per evaluation, the number of collapses (the timing of wearing at home and the comments requested also decreased There is no prospect of midnight, and the past -------------------------------------------------- 0 round-trip medical institutions from infected people, etc. LINE Tokyo Decrease transition expression A little practice for customers Approximately 0 medical institutions Confirmed at midnight in the past report Tele-East There is no wonderful infection speed Trump administration There is no Komei Party Stores to reduce or foreigners ... "With the new coronavirus infection -------------------------------------------------- Confirmation of emergencies for newcomers Coping speed Estimated (Thank you for reducing the number of people in the year, disinfecting the number of deaths in the prefecture Issued leader Past issue -------------------------------------------------- I haven't made a sentence that makes sense, but it seems that I have learned a little about the position of the part of speech.
Sentence generation was difficult. I want to study further and improve
If you have any advice or suggestions, I would appreciate it if you could comment ..!

Referenced sites / references

Recommended Posts

[Introduction to Pytorch] I want to generate sentences in news articles
[Introduction to Pytorch] I played with sinGAN ♬
I want to embed Matplotlib in PySimpleGUI
[C language] I want to generate random numbers in the specified range
I want to do Dunnett's test in Python
I want to pin Datetime.now in Django tests
I want to create a window in Python
I wrote "Introduction to Effect Verification" in Python
I want to store DB information in list
I want to merge nested dicts in Python
I want to display the progress in Python!
I want to use PyTorch to generate something like the lyrics of Japari Park
Introduction to Lightning pytorch
I want to write in Python! (1) Code format check
I want to embed a variable in a Python string
I want to easily implement a timeout in python
I want to generate a UUID quickly (memorandum) ~ Python ~
I want to transition with a button in flask
I want to use self in Backpropagation (tf.custom_gradient) (tensorflow)
I want to write in Python! (2) Let's write a test
I want to randomly sample a file in Python
I want to work with a robot in python.
[Python3] I want to generate harassment names from Japanese!
I want to write in Python! (3) Utilize the mock
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
[Failure] I wanted to generate sentences using Flair's TextRegressor
I want to use the R dataset in python
I want to do something in Python when I finish
I want to manipulate strings in Kotlin like Python!
I want to easily delete columns containing NA in R
I want to do something like sort uniq in Python
[Introduction to pytorch] Preprocessing by audio I / O and torch audio (> <;)
I want to automatically generate a modern metal band name
I made a command to generate a table comment in Django
I want to solve Sudoku (Sudoku)
[Django] I want to log in automatically after new registration
I want to make the Dictionary type in the List unique
I want to count unique values in arrays and tuples
I want to align the significant figures in the Numpy array
I want to be able to run Python in VS Code
I want to make input () a nice complement in python
I didn't want to write the AWS key in the program
I want to manually assign the training parameters of the [Pytorch] model
[Linux] I want to know the date when the user logged in
I want to solve APG4b with Python (only 4.01 and 4.04 in Chapter 4)
I want to run Rails with rails s even in vagrant environment
LINEbot development, I want to check the operation in the local environment
[Python / AWS Lambda layers] I want to reuse only module in AWS Lambda Layers
[Introduction] I want to make a Mastodon Bot with Python! 【Beginners】
I want to create a pipfile and reflect it in docker
I want to make the second line the column name in pandas
I want to pass the G test in one month Day 1
I want to know the population of each country in the world.
I want to understand systemd roughly
[Details (?)] Introduction to pytorch ~ CNN CIFAR10 ~
I tried to explain Pytorch dataset
I wrote Gray Scale in Pytorch
I want to scrape images to learn
I want to do ○○ with Pandas
I want to copy yolo annotations
I want to debug with Python