The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

Chapter 8 is here.

The environment is Python 3.8.2 and Ubuntu 18.04.

Chapter 9: RNN, CNN

Same as Chapter 8 and use PyTorch.

80. Conversion to ID number

I want to give a unique ID number to the words in the learning data constructed in question 51. Assign an ID number to the word that appears more than once in the learning data by the method such as 1 for the word that appears most frequently in the learning data, 2 for the word that appears second, and so on. Then, implement a function that returns a sequence of ID numbers for a given word string. However, all ID numbers of words that appear less than twice should be 0.

`code`


import re
import spacy
import torch

Prepare spacy model and label.

`code`


nlp = spacy.load('en')
categories = ['b', 't', 'e', 'm']
category_names = ['business', 'science and technology', 'entertainment', 'health']

Read the file and tokenize it with spacy.

`code`


def tokenize(x):
    x = re.sub(r'\s+', ' ', x)
    x = nlp.make_doc(x)
    x = [d.text for d in x]
    return x

def read_feature_dataset(filename):
    with open(filename) as f:
        dataset = f.read().splitlines()
    dataset = [line.split('\t') for line in dataset]
    dataset_t = [categories.index(line[0]) for line in dataset]
    dataset_x = [tokenize(line[1]) for line in dataset]
    return dataset_x, torch.tensor(dataset_t, dtype=torch.long)

`code`


train_x, train_t = read_feature_dataset('data/train.txt')
valid_x, valid_t = read_feature_dataset('data/valid.txt')
test_x, test_t = read_feature_dataset('data/test.txt')

Extract the vocabulary. Only words that appear more than once are targeted.

`code`


from collections import Counter

`code`


counter = Counter([
    x
    for sent in train_x
    for x in sent
])

vocab_in_train = [
    token
    for token, freq in counter.most_common()
    if freq > 1
]
len(vocab_in_train)

`output`

Converts a word string into a string of ID numbers.

`code`


vocab_list = ['[UNK]'] + vocab_in_train
vocab_dict = {x:n for n, x in enumerate(vocab_list)}

`code`


def sent_to_ids(sent):
    return torch.tensor([vocab_dict[x if x in vocab_dict else '[UNK]'] for x in sent], dtype=torch.long)

Let's tokenize the first sentence of the training data.

`code`


print(train_x[0])
print(sent_to_ids(train_x[0]))

`output`


['Kathleen', 'Sebelius', "'", 'LGBT', 'legacy']
tensor([   0,    0,    2, 2648,    0])

Convert it to a column of ID numbers.

`code`


def dataset_to_ids(dataset):
    return [sent_to_ids(x) for x in dataset]

`code`


train_s = dataset_to_ids(train_x)
valid_s = dataset_to_ids(valid_x)
test_s = dataset_to_ids(test_x)

train_s[:3]

`output`


[tensor([   0,    0,    2, 2648,    0]),
 tensor([   9, 6740, 1445, 2076,  583,   10,  547,   32,   51,  873, 6741]),
 tensor([   0,  205, 4198,  315, 1899, 1232,    0])]

81. Forecast by RNN

There is a word string $ \ boldsymbol {x} = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Using a recurrent neural network (RNN), implement the following equation as a model for predicting the category $ y $ from the word string $ \ boldsymbol {x} $. $ \overrightarrow h_0 = 0, \ \overrightarrow h_t = {\rm \overrightarrow{RNN}}(\mathrm{emb}(x_t), \overrightarrow h_{t-1}), \ y = {\rm softmax}(W^{(yh)} \overrightarrow h_T + b^{(y)}) $ However, $ \ mathrm {emb} (x) \ in \ mathbb {R} ^ {d_w} $ is word embedding (a function that converts a word from one-hot notation to a word vector), $ \ overridearrow h_t \ in \ mathbb {R} ^ {d_h} $ is the hidden state vector of time $ t $, $ {\ rm \ overridearrow {RNN}} (x, h) $ is from the input $ x $ and the hidden state $ h $ of the previous time The RNN unit that calculates the next state, $ W ^ {(yh)} \ in \ mathbb {R} ^ {L \ times d_h} $ is the matrix for predicting the category from the hidden state vector, $ b ^ {(y) )} \ in \ mathbb {R} ^ {L} $ is a bias term ($ d_w, d_h, L $ are the number of word embedding dimensions, the number of hidden state vectors, and the number of labels, respectively). The RNN unit $ {\ rm \ overrightarrow {RNN}} (x, h) $ can have various configurations, and the following equation is a typical example. $ {\rm \overrightarrow{RNN}}(x,h) = g(W^{(hx)} x + W^{(hh)}h + b^{(h)}) $ However, $ W ^ {(hx)} \ in \ mathbb {R} ^ {d_h \ times d_w}, W ^ {(hh)} \ in \ mathbb {R} ^ {d_h \ times d_h}, b ^ {(h)} \ in \ mathbb {R} ^ {d_h} $ is the parameter of the RNN unit, and $ g $ is the activation function (for example, $ \ tanh $ and ReLU). In this problem, we do not learn the parameters, we just need to calculate $ y $ with the randomly initialized parameters. Hyperparameters such as the number of dimensions should be set to appropriate values such as $ d_w = 300, d_h = 50 $ (the same applies to the following problems).

Unlike Chapter 8, the length of the input data differs depending on the sentence. Use various things in torch.nn.utils.rnn to pad the end of the variable length series so that it can be handled.

`code`


import random as rd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence as pad
from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as unpack

Create a Dataset class that holds the dataset. In addition to the input statement source and the target label target, it has the input statement lengths lengths as members.

`code`


class Dataset(torch.utils.data.Dataset):
    def __init__(self, source, target):
        self.source = source
        self.target = target
        self.lengths = torch.tensor([len(x) for x in source])
        self.size = len(source)
    
    def __len__(self):
        return self.size
            
    def __getitem__(self, index):
        return {
            'src':self.source[index],
            'trg':self.target[index],
            'lengths':self.lengths[index],
        }
    
    def collate(self, xs):
        return {
            'src':pad([x['src'] for x in xs]),
            'trg':torch.stack([x['trg'] for x in xs], dim=-1),
            'lengths':torch.stack([x['lengths'] for x in xs], dim=-1)
        }

Prepare the data set.

`code`


train_dataset = Dataset(train_s, train_t)
valid_dataset = Dataset(valid_s, valid_t)
test_dataset = Dataset(test_s, test_t)

Define the same Sampler as in Chapter 8.

`code`


class Sampler(torch.utils.data.Sampler):
    def __init__(self, dataset, width, shuffle = False):
        self.dataset = dataset
        self.width = width
        self.shuffle = shuffle
        if not shuffle:
            self.indices = torch.arange(len(dataset))
            
    def __iter__(self):
        if self.shuffle:
            self.indices = torch.randperm(len(self.dataset))
        index = 0
        while index < len(self.dataset):
            yield self.indices[index : index + self.width]
            index += self.width

Since it is convenient to pack padding when the series in the batch are in descending order, we define a sampler that satisfies such a contract.

All you have to do is sort the indexes in descending order of length in advance and load them into batches from the front.

`code`


class DescendingSampler(Sampler):
    def __init__(self, dataset, width, shuffle = False):
        assert not shuffle
        super().__init__(dataset, width, shuffle)
        self.indices = self.indices[self.dataset.lengths[self.indices].argsort(descending=True)]

Also, during training, the less padding in the batch, the less unnecessary calculations and faster learning, so we will implement such a checkmate method. In the above two examples, the index is separated by the number of batches, but the next child is separated by the maximum number of tokens in the batch, so the number of cases in the batch is not always constant.

`code`


class MaxTokensSampler(Sampler):
    def __iter__(self):
        self.indices = torch.randperm(len(self.dataset))
        self.indices = self.indices[self.dataset.lengths[self.indices].argsort(descending=True)]
        for batch in self.generate_batches():
            yield batch

    def generate_batches(self):
        batches = []
        batch = []
        acc = 0
        max_len = 0
        for index in self.indices:
            acc += 1
            this_len = self.dataset.lengths[index]
            max_len = max(max_len, this_len)
            if acc * max_len > self.width:
                batches.append(batch)
                batch = [index]
                acc = 1
                max_len = this_len
            else:
                batch.append(index)
        if batch != []:
            batches.append(batch)
        rd.shuffle(batches)
        return batches

Prepare a function to create DataLoader.

`code`


def gen_loader(dataset, width, sampler=Sampler, shuffle=False, num_workers=8):
    return torch.utils.data.DataLoader(
        dataset, 
        batch_sampler = sampler(dataset, width, shuffle),
        collate_fn = dataset.collate,
        num_workers = num_workers,
    )

def gen_descending_loader(dataset, width, num_workers=0):
    return gen_loader(dataset, width, sampler = DescendingSampler, shuffle = False, num_workers = num_workers)

def gen_maxtokens_loader(dataset, width, num_workers=0):
    return gen_loader(dataset, width, sampler = MaxTokensSampler, shuffle = True, num_workers = num_workers)

Defines a one-layer unidirectional LSTM classifier. The length of each statement is required when packing apaded tensor with the collate of the dataset.

`code`


class LSTMClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.rnn = nn.LSTM(e_size, h_size, num_layers = 1)
        self.out = nn.Linear(h_size, c_size)
        self.dropout = nn.Dropout(dropout)
        self.embed.weight.data.uniform_(-0.1, 0.1)
        for name, param in self.rnn.named_parameters():
            if 'weight' in name or 'bias' in name:
                param.data.uniform_(-0.1, 0.1)
        self.out.weight.data.uniform_(-0.1, 0.1)
    
    def forward(self, batch, h=None):
        x = self.embed(batch['src'])
        x = pack(x, batch['lengths'])
        x, (h, c) = self.rnn(x, h)
        h = self.out(h)
        return h.squeeze(0)

Predict.

`code`


model = LSTMClassifier(len(vocab_dict), 300, 50, 4)
loader = gen_loader(test_dataset, 10, DescendingSampler, False)
model(iter(loader).next()).argmax(dim=-1)

`output`


tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

82. Learning by stochastic gradient descent

Learn the model constructed in Problem 81 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).

All you have to do is define Task and Trainer and run the training.

`code`


class Task:
    def __init__(self):
        self.criterion = nn.CrossEntropyLoss()
    
    def train_step(self, model, batch):
        model.zero_grad()
        loss = self.criterion(model(batch), batch['trg'])
        loss.backward()
        return loss.item()
    
    def valid_step(self, model, batch):
        with torch.no_grad():
            loss = self.criterion(model(batch), batch['trg'])
        return loss.item()

`code`


class Trainer:
    def __init__(self, model, loaders, task, optimizer, max_iter, device = None):
        self.model = model
        self.model.to(device)
        self.train_loader, self.valid_loader = loaders
        self.task = task
        self.optimizer = optimizer
        self.max_iter = max_iter
        self.device = device
    
    def send(self, batch):
        for key in batch:
            batch[key] = batch[key].to(self.device)
        return batch
        
    def train_epoch(self):
        self.model.train()
        acc = 0
        for n, batch in enumerate(self.train_loader):
            batch = self.send(batch)
            acc += self.task.train_step(self.model, batch)
            self.optimizer.step()
        return acc / n
            
    def valid_epoch(self):
        self.model.eval()
        acc = 0
        for n, batch in enumerate(self.valid_loader):
            batch = self.send(batch)
            acc += self.task.valid_step(self.model, batch)
        return acc / n
    
    def train(self):
        for epoch in range(self.max_iter):
            train_loss = self.train_epoch()
            valid_loss = self.valid_epoch()
            print('epoch {}, train_loss:{:.5f}, valid_loss:{:.5f}'.format(epoch, train_loss, valid_loss))

I will learn.

`code`


device = torch.device('cuda')
model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
loaders = (
    gen_loader(train_dataset, 1),
    gen_loader(valid_dataset, 1),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 3, device)
trainer.train()

I will try to predict.

`code`


import numpy as np

`code`


class Predictor:
    def __init__(self, model, loader, device=None):
        self.model = model
        self.loader = loader
        self.device = device
        
    def send(self, batch):
        for key in batch:
            batch[key] = batch[key].to(self.device)
        return batch
        
    def infer(self, batch):
        self.model.eval()
        batch = self.send(batch)
        return self.model(batch).argmax(dim=-1).item()
        
    def predict(self):
        lst = []
        for batch in self.loader:
            lst.append(self.infer(batch))
        return lst

`code`


def accuracy(true, pred):
    return np.mean([t == p for t, p in zip(true, pred)])

predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

`output`


Correct answer rate in training data: 0.7592661924372894
Correct answer rate in evaluation data: 0.6384730538922155

If you turn the epoch a little more, the accuracy will improve a little, but I don't care.

83. Mini-batch / Learning on GPU

Modify the code of Problem 82 so that learning can be performed by calculating the loss / gradient for each $ B $ case (choose the value of $ B $ appropriately). Also, execute learning on the GPU.

`code`


model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
loaders = (
    gen_maxtokens_loader(train_dataset, 4000),
    gen_descending_loader(valid_dataset, 128),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.2, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

`output`


epoch 0, train_loss:1.22489, valid_loss:1.26302
epoch 1, train_loss:1.11631, valid_loss:1.19404
epoch 2, train_loss:1.07750, valid_loss:1.18451
epoch 3, train_loss:0.96149, valid_loss:1.06748
epoch 4, train_loss:0.81597, valid_loss:0.86547
epoch 5, train_loss:0.74748, valid_loss:0.81049
epoch 6, train_loss:0.80179, valid_loss:0.89621
epoch 7, train_loss:0.60231, valid_loss:0.78494
epoch 8, train_loss:0.52551, valid_loss:0.73272
epoch 9, train_loss:0.97286, valid_loss:1.05034

`code`


predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

`output`


Correct answer rate in training data: 0.7202358667165856
Correct answer rate in evaluation data: 0.6773952095808383

For some reason, the loss has increased and the accuracy has decreased, but let's live strongly without worrying about it.

84. Introduction of word vector

Pre-learned word vector (for example, [learned word vector] in Google News dataset (about 100 billion words)](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing )) Initialize the word embedding $ \ mathrm {emb} (x) $ and learn.

`code`


from gensim.models import KeyedVectors
vectors = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

Initialize the word embedding before learning.

`code`


def init_embed(embed):
    for i, token in enumerate(vocab_list):
        if token in vectors:
            embed.weight.data[i] = torch.from_numpy(vectors[token])
    return embed

`code`


model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
init_embed(model.embed)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

`output`


epoch 0, train_loss:1.21390, valid_loss:1.19333
epoch 1, train_loss:0.88751, valid_loss:0.74930
epoch 2, train_loss:0.57240, valid_loss:0.65822
epoch 3, train_loss:0.50240, valid_loss:0.62686
epoch 4, train_loss:0.45800, valid_loss:0.59535
epoch 5, train_loss:0.44051, valid_loss:0.55849
epoch 6, train_loss:0.38251, valid_loss:0.51837
epoch 7, train_loss:0.35731, valid_loss:0.47709
epoch 8, train_loss:0.30278, valid_loss:0.43797
epoch 9, train_loss:0.25518, valid_loss:0.41287

`code`


predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

`output`


Correct answer rate in training data: 0.925028079371022
Correct answer rate in evaluation data: 0.8839820359281437

85. Bi-directional RNN / multi-layer

Encode the input text using both forward and reverse RNNs and train the model. $ \overleftarrow h_{T+1} = 0, \ \overleftarrow h_t = {\rm \overleftarrow{RNN}}(\mathrm{emb}(x_t), \overleftarrow h_{t+1}), \ y = {\rm softmax}(W^{(yh)} [\overrightarrow h_T; \overleftarrow h_1] + b^{(y)}) $ However, $ \ overrightarrow h_t \ in \ mathbb {R} ^ {d_h}, \ overleftarrow h_t \ in \ mathbb {R} ^ {d_h} $ are the times $ t obtained by the forward and reverse RNNs, respectively. The hidden state vector of $, $ {\ rm \ overleftarrow {RNN}} (x, h) $ is the RNN unit that calculates the previous state from the input $ x $ and the hidden state $ h $ of the next time, $ W ^ {( yh)} \ in \ mathbb {R} ^ {L \ times 2d_h} $ is a matrix for predicting categories from hidden state vectors, $ b ^ {(y)} \ in \ mathbb {R} ^ {L} $ Is a bias term. Also, $ [a; b] $ represents the concatenation of the vectors $ a $ and $ b $. Furthermore, experiment with bidirectional RNNs in multiple layers.

If you change the parameters of nn.LSTM a little, you can make it multi-layered or bidirectional.

Since the hidden state increases, the last two (hidden states in the forward and reverse directions of the last layer) are acquired.

`code`


class BiLSTMClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.rnn = nn.LSTM(e_size, h_size, num_layers = 2, bidirectional = True)
        self.out = nn.Linear(h_size * 2, c_size)
        self.dropout = nn.Dropout(dropout)
        nn.init.uniform_(self.embed.weight, -0.1, 0.1)
        for name, param in self.rnn.named_parameters():
            if 'weight' in name or 'bias' in name:
                nn.init.uniform_(param, -0.1, 0.1)
        nn.init.uniform_(self.out.weight, -0.1, 0.1)
    
    def forward(self, batch, h=None):
        x = self.embed(batch['src'])
        x = pack(x, batch['lengths'])
        x, (h, c) = self.rnn(x, h)
        h = h[-2:]
        h = h.transpose(0,1)
        h = h.contiguous().view(-1, h.size(1) * h.size(2))
        h = self.out(h)
        return h

`code`


model = BiLSTMClassifier(len(vocab_dict), 300, 128, 4)
init_embed(model.embed)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

`code`


predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

86. Convolutional Neural Network (CNN)

There is a word string $ \ boldsymbol x = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Implement a model that predicts the category $ y $ from the word string $ \ boldsymbol x $ using a convolutional neural network (CNN). However, the configuration of the convolutional neural network is as follows.

Word embedding dimensions: $ d_w $

Convolution filter size: 3 tokens
Convolution Stride: 1 token
Convolution padding: Yes
Number of dimensions of vector at each time after convolution operation: $ d_h $
After the convolution operation, apply max pooling and express the input statement as a hidden vector of $ d_h $ dimension.

That is, the feature vector $ p_t \ in \ mathbb {R} ^ {d_h} $ at time $ t $ is expressed by the following equation. $ p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)}) $ However, $ W ^ {(px)} \ in \ mathbb {R} ^ {d_h \ times 3d_w}, b ^ {(p)} \ in \ mathbb {R} ^ {d_h} $ is a CNN parameter, $ g $ is the activation function (eg $ \ tanh $, ReLU, etc.), and $ [a; b; c] $ is the concatenation of the vectors $ a, b, c $. The number of columns in the matrix $ W ^ {(px)} $ is $ 3d_w $ because the linear transformation is performed on the concatenated word embeddings of three tokens. In maximum value pooling, the maximum value at all times is taken for each dimension of the feature vector, and the feature vector $ c \ in \ mathbb {R} ^ {d_h} $ of the input document is obtained. If $ c [i] $ represents the value of the $ i $ th dimension of the vector $ c $, the maximum value pooling is expressed by the following equation. $ c[i] = \max_{1 \leq t \leq T} p_t[i] $ Finally, the input document feature vector $ c $ with the matrix $ W ^ {(yc)} \ in \ mathbb {R} ^ {L \ times d_h} $ and the bias term $ b ^ {(y)} \ in Apply the linear transformation by \ mathbb {R} ^ {L} $ and the softmax function to predict the category $ y $. $ y = {\rm softmax}(W^{(yc)} c + b^{(y)}) $ Note that this problem does not train the model, it only needs to calculate $ y $ with a randomly initialized weight matrix.

I want to add PAD tokens to both ends of the input data, so I will use such a dataset.

`code`


cnn_vocab_list = ['[PAD]', '[UNK]'] + vocab_in_train
cnn_vocab_dict = {x:n for n, x in enumerate(cnn_vocab_list)}

def cnn_sent_to_ids(sent):
    return torch.tensor([cnn_vocab_dict[x if x in cnn_vocab_dict else '[UNK]'] for x in sent], dtype=torch.long)

print(train_x[0])
print(cnn_sent_to_ids(train_x[0]))

`output`


['Kathleen', 'Sebelius', "'", 'LGBT', 'legacy']
tensor([   1,    1,    3, 2649,    1])

Since the window width is 3, I added two ʻEOS`s.

`output`


def cnn_dataset_to_ids(dataset):
    return [cnn_sent_to_ids(x) for x in dataset]

cnn_train_s = cnn_dataset_to_ids(train_x)
cnn_valid_s = cnn_dataset_to_ids(valid_x)
cnn_test_s = cnn_dataset_to_ids(test_x)

cnn_train_s[:3]

`output`


[tensor([   1,    1,    3, 2649,    1]),
 tensor([  10, 6741, 1446, 2077,  584,   11,  548,   33,   52,  874, 6742]),
 tensor([   1,  206, 4199,  316, 1900, 1233,    1])]

`code`


class CNNDataset(Dataset):
    def collate(self, xs):
        max_seq_len = max([x['lengths'] for x in xs])
        src = [torch.cat([x['src'], torch.zeros(max_seq_len - x['lengths'], dtype=torch.long)], dim=-1) for x in xs]
        src = torch.stack(src)
        mask = [[1] * x['lengths'] + [0] * (max_seq_len - x['lengths']) for x in xs]
        mask = torch.tensor(mask, dtype=torch.long)
        return {
            'src':src,
            'trg':torch.tensor([x['trg'] for x in xs]),
            'mask':mask,
        }

cnn_train_dataset = CNNDataset(cnn_train_s, train_t)
cnn_valid_dataset = CNNDataset(cnn_valid_s, valid_t)
cnn_test_dataset = CNNDataset(cnn_test_s, test_t)

Create a CNN model.

`code`


class CNNClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.conv = nn.Conv1d(e_size, h_size, 3, padding=1)
        self.act = nn.ReLU()
        self.out = nn.Linear(h_size, c_size)
        self.dropout = nn.Dropout(dropout)
        nn.init.normal_(self.embed.weight, 0, 0.1)
        nn.init.kaiming_normal_(self.conv.weight)
        nn.init.constant_(self.conv.bias, 0)
        nn.init.kaiming_normal_(self.out.weight)
        nn.init.constant_(self.out.bias, 0)
    
    def forward(self, batch):
        x = self.embed(batch['src'])
        x = self.dropout(x)
        x = self.conv(x.transpose(-1, -2))
        x = self.act(x)
        x = self.dropout(x)
        x.masked_fill_(batch['mask'].unsqueeze(-2) == 0, -1e4)
        x = F.max_pool1d(x, x.size(-1)).squeeze(-1)
        x = self.out(x)
        return x

Since padding = 1 is specified for nn.Conv1d, one padding token is inserted at each end. This token is 0 by default, but there is no problem because the padding id that matches the length of the series length is also 0. When convolving in the time direction, transpose is done to make the last dimension the time axis. The value of the padding part is set to -1 so that the value is not fetched by the maximum value pooling. When pooling the maximum value with max_pool1d, it is necessary to specify the dimension.

87. Learning CNN by Stochastic Gradient Descent

Learn the model constructed in Problem 86 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).

Let me learn.

`code`


def init_cnn_embed(embed):
    for i, token in enumerate(cnn_vocab_list):
        if token in vectors:
            embed.weight.data[i] = torch.from_numpy(vectors[token])
    return embed

`code`


model = CNNClassifier(len(cnn_vocab_dict), 300, 128, 4)
init_cnn_embed(model.embed)
loaders = (
    gen_maxtokens_loader(cnn_train_dataset, 4000),
    gen_descending_loader(cnn_valid_dataset, 32),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()

`output`


epoch 0, train_loss:1.03501, valid_loss:0.85454
epoch 1, train_loss:0.68068, valid_loss:0.70825
epoch 2, train_loss:0.56784, valid_loss:0.60257
epoch 3, train_loss:0.50570, valid_loss:0.55611
epoch 4, train_loss:0.45707, valid_loss:0.52386
epoch 5, train_loss:0.42078, valid_loss:0.48479
epoch 6, train_loss:0.38858, valid_loss:0.45933
epoch 7, train_loss:0.36667, valid_loss:0.43547
epoch 8, train_loss:0.34746, valid_loss:0.41509
epoch 9, train_loss:0.32849, valid_loss:0.40350

`code`


predictor = Predictor(model, gen_loader(cnn_train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(cnn_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

`output`


Correct answer rate in training data: 0.9106140022463497
Correct answer rate in evaluation data: 0.8847305389221557

88. Parameter tuning

Build a high-performance category classifier by modifying the code of Problem 85 and Problem 87 and adjusting the shape and hyperparameters of the neural network.

Since it's a big deal, I wrote a model that pools the output of LSTM to the maximum value with CNN.

`code`


class BiLSTMCNNDataset(Dataset):
    def collate(self, xs):
        max_seq_len = max([x['lengths'] for x in xs])
        mask = [[1] * (x['lengths'] - 2) + [0] * (max_seq_len - x['lengths']) for x in xs]
        mask = torch.tensor(mask, dtype=torch.long)
        return {
            'src':pad([x['src'] for x in xs]),
            'trg':torch.stack([x['trg'] for x in xs], dim=-1),
            'mask':mask,
            'lengths':torch.stack([x['lengths'] for x in xs], dim=-1)
        }

rnncnn_train_dataset = BiLSTMCNNDataset(cnn_train_s, train_t)
rnncnn_valid_dataset = BiLSTMCNNDataset(cnn_valid_s, valid_t)
rnncnn_test_dataset = BiLSTMCNNDataset(cnn_test_s, test_t)

`code`


class BiLSTMCNNClassifier(nn.Module):
    def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
        super().__init__()
        self.embed = nn.Embedding(v_size, e_size)
        self.rnn = nn.LSTM(e_size, h_size, bidirectional = True)
        self.conv = nn.Conv1d(h_size* 2, h_size, 3, padding=1)
        self.act = nn.ReLU()
        self.out = nn.Linear(h_size, c_size)
        self.dropout = nn.Dropout(dropout)
        nn.init.uniform_(self.embed.weight, -0.1, 0.1)
        for name, param in self.rnn.named_parameters():
            if 'weight' in name or 'bias' in name:
                nn.init.uniform_(param, -0.1, 0.1)
        nn.init.kaiming_normal_(self.conv.weight)
        nn.init.constant_(self.conv.bias, 0)
        nn.init.kaiming_normal_(self.out.weight)
        nn.init.constant_(self.out.bias, 0)
    
    def forward(self, batch, h=None):
        x = self.embed(batch['src'])
        x = self.dropout(x)
        x = pack(x, batch['lengths'])
        x, (h, c) = self.rnn(x, h)
        x, _ = unpack(x)
        x = self.dropout(x)
        x = self.conv(x.permute(1, 2, 0))
        x = self.act(x)
        x = self.dropout(x)
        x.masked_fill_(batch['mask'].unsqueeze(-2) == 0, -1)
        x = F.max_pool1d(x, x.size(-1)).squeeze(-1)
        x = self.out(x)
        return x

`code`


loaders = (
    gen_maxtokens_loader(rnncnn_train_dataset, 4000),
    gen_descending_loader(rnncnn_valid_dataset, 32),
)
task = Task()
for h in [32, 64, 128, 256, 512]:
    model = BiLSTMCNNClassifier(len(cnn_vocab_dict), 300, h, 4)
    init_cnn_embed(model.embed)
    optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9, nesterov=True)
    trainer = Trainer(model, loaders, task, optimizer, 10, device)
    trainer.train()
    predictor = Predictor(model, gen_loader(rnncnn_test_dataset, 1), device)
    pred = predictor.predict()
    print('Correct answer rate in evaluation data:', accuracy(test_t, pred))

`output`


epoch 0, train_loss:1.21905, valid_loss:1.12725
epoch 1, train_loss:0.95913, valid_loss:0.84094
epoch 2, train_loss:0.66851, valid_loss:0.66997
epoch 3, train_loss:0.57141, valid_loss:0.61373
epoch 4, train_loss:0.52795, valid_loss:0.59354
epoch 5, train_loss:0.49844, valid_loss:0.57013
epoch 6, train_loss:0.47408, valid_loss:0.55163
epoch 7, train_loss:0.44922, valid_loss:0.52349
epoch 8, train_loss:0.41864, valid_loss:0.49231
epoch 9, train_loss:0.38975, valid_loss:0.46807
Correct answer rate in evaluation data: 0.8690119760479041
epoch 0, train_loss:1.16516, valid_loss:1.06582
epoch 1, train_loss:0.81246, valid_loss:0.71224
epoch 2, train_loss:0.58068, valid_loss:0.61988
epoch 3, train_loss:0.52451, valid_loss:0.58465
epoch 4, train_loss:0.48807, valid_loss:0.55663
epoch 5, train_loss:0.45712, valid_loss:0.52742
epoch 6, train_loss:0.41639, valid_loss:0.50089
epoch 7, train_loss:0.38595, valid_loss:0.46442
epoch 8, train_loss:0.35262, valid_loss:0.43459
epoch 9, train_loss:0.32527, valid_loss:0.40692
Correct answer rate in evaluation data: 0.8772455089820359
epoch 0, train_loss:1.12191, valid_loss:0.97533
epoch 1, train_loss:0.71378, valid_loss:0.66554
epoch 2, train_loss:0.55280, valid_loss:0.59733
epoch 3, train_loss:0.50526, valid_loss:0.57163
epoch 4, train_loss:0.46889, valid_loss:0.53955
epoch 5, train_loss:0.43500, valid_loss:0.50500
epoch 6, train_loss:0.40006, valid_loss:0.47222
epoch 7, train_loss:0.36444, valid_loss:0.43941
epoch 8, train_loss:0.33329, valid_loss:0.41224
epoch 9, train_loss:0.30588, valid_loss:0.39965
Correct answer rate in evaluation data: 0.8839820359281437
epoch 0, train_loss:1.04536, valid_loss:0.84626
epoch 1, train_loss:0.61410, valid_loss:0.62255
epoch 2, train_loss:0.49830, valid_loss:0.55984
epoch 3, train_loss:0.44190, valid_loss:0.51720
epoch 4, train_loss:0.39713, valid_loss:0.46718
epoch 5, train_loss:0.35052, valid_loss:0.43181
epoch 6, train_loss:0.32145, valid_loss:0.39898
epoch 7, train_loss:0.30279, valid_loss:0.37586
epoch 8, train_loss:0.28171, valid_loss:0.37333
epoch 9, train_loss:0.26904, valid_loss:0.37849
Correct answer rate in evaluation data: 0.8884730538922155
epoch 0, train_loss:0.93974, valid_loss:0.71999
epoch 1, train_loss:0.53687, valid_loss:0.58747
epoch 2, train_loss:0.44848, valid_loss:0.52432
epoch 3, train_loss:0.38761, valid_loss:0.46509
epoch 4, train_loss:0.34431, valid_loss:0.43651
epoch 5, train_loss:0.31699, valid_loss:0.39881
epoch 6, train_loss:0.28963, valid_loss:0.38732
epoch 7, train_loss:0.27550, valid_loss:0.37152
epoch 8, train_loss:0.26003, valid_loss:0.36476
epoch 9, train_loss:0.24991, valid_loss:0.36012
Correct answer rate in evaluation data: 0.8944610778443114

I tried learning while changing the size of the hidden state. The result is that the larger the size of the hidden state, the better.

89. Transfer learning from a pre-trained language model

Build a model that classifies news article headlines into categories, starting from a pre-learned language model (eg BERT).

Use huggingface / transformers.

`code`


from transformers import *

The BERT input is a wordpiece, so you have to give it through the tokenizer. Prepare a tokenizer.

`code`


tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Talkize.

`code`


def read_for_bert(filename):
    with open(filename) as f:
        dataset = f.read().splitlines()
    dataset = [line.split('\t') for line in dataset]
    dataset_t = [categories.index(line[0]) for line in dataset]
    dataset_x = [torch.tensor(tokenizer.encode(line[1]), dtype=torch.long) for line in dataset]
    return dataset_x, torch.tensor(dataset_t, dtype=torch.long)

bert_train_x, bert_train_t = read_for_bert('data/train.txt')
bert_valid_x, bert_valid_t = read_for_bert('data/valid.txt')
bert_test_x, bert_test_t = read_for_bert('data/test.txt')

Prepare a dataset class for BERT. I do padding and so on. mask is an attention mask. I try not to get attention on the padding token.

`code`


class BertDataset(Dataset):
    def collate(self, xs):
        max_seq_len = max([x['lengths'] for x in xs])
        src = [torch.cat([x['src'], torch.zeros(max_seq_len - x['lengths'], dtype=torch.long)], dim=-1) for x in xs]
        src = torch.stack(src)
        mask = [[1] * x['lengths'] + [0] * (max_seq_len - x['lengths']) for x in xs]
        mask = torch.tensor(mask, dtype=torch.long)
        return {
            'src':src,
            'trg':torch.tensor([x['trg'] for x in xs]),
            'mask':mask,
        }

`code`


bert_train_dataset = BertDataset(bert_train_x, bert_train_t)
bert_valid_dataset = BertDataset(bert_valid_x, bert_valid_t)
bert_test_dataset = BertDataset(bert_test_x, bert_test_t)

Load the pre-learning model.

`code`


class BertClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        config = BertConfig.from_pretrained('bert-base-cased', num_labels=4)
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-cased', config=config)
    
    def forward(self, batch):
        x = self.bert(batch['src'], attention_mask=batch['mask'])
        return x[0]

`code`


model = BertClassifier()
loaders = (
    gen_maxtokens_loader(bert_train_dataset, 1000),
    gen_descending_loader(bert_valid_dataset, 32),
)
task = Task()
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
trainer = Trainer(model, loaders, task, optimizer, 5, device)
trainer.train()

`code`


predictor = Predictor(model, gen_loader(bert_train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))

predictor = Predictor(model, gen_loader(bert_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(test_t, pred))

`output`


Correct answer rate in training data: 0.9927929614376638
Correct answer rate in training data: 0.9229041916167665

Sounds good.

Next is Chapter 10

Language processing 100 knocks 2020 Chapter 10: Machine translation (90-98)

[PYTHON] 100 Language Processing Knock 2020 Chapter 9: RNN, CNN

Chapter 9: RNN, CNN

80. Conversion to ID number

code

code

code

code

code

code

output

code

code

code

output

code

code

output

81. Forecast by RNN

code

code

code

code

code

code

code

code

code

output

82. Learning by stochastic gradient descent

code

code

code

code

code

code

output

83. Mini-batch / Learning on GPU

code

output

code

output

84. Introduction of word vector

code

code

code

output

code

output

85. Bi-directional RNN / multi-layer

code

code

code

86. Convolutional Neural Network (CNN)

code

output

output

output

code

code

87. Learning CNN by Stochastic Gradient Descent

code

code

output

code

output

88. Parameter tuning

code

code

code

output

89. Transfer learning from a pre-trained language model

code

code

code

code

code

code

code

code

output

`code`

`code`

`code`

`code`

`code`

`code`

`output`

`code`

`code`

`code`

`output`

`code`

`code`

`output`

`code`

`code`

`code`

`code`

`code`

`code`

`code`

`code`

`code`

`output`

`code`

`code`

`code`

`code`

`code`

`code`

`output`

`code`

`output`

`code`

`output`

`code`

`code`

`code`

`output`

`code`

`output`

`code`

`code`

`code`

`code`

`output`

`output`

`output`

`code`

`code`

`code`

`code`

`output`

`code`

`output`

`code`

`code`

`code`

`output`

`code`

`code`

`code`

`code`

`code`

`code`

`code`

`code`

`output`