The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.
All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.
Chapter 8 is here.
The environment is Python 3.8.2 and Ubuntu 18.04.
Same as Chapter 8 and use PyTorch.
I want to give a unique ID number to the words in the learning data constructed in question 51. Assign an ID number to the word that appears more than once in the learning data by the method such as
1
for the word that appears most frequently in the learning data,2
for the word that appears second, and so on. Then, implement a function that returns a sequence of ID numbers for a given word string. However, all ID numbers of words that appear less than twice should be0
.
code
import re
import spacy
import torch
Prepare spacy model and label.
code
nlp = spacy.load('en')
categories = ['b', 't', 'e', 'm']
category_names = ['business', 'science and technology', 'entertainment', 'health']
Read the file and tokenize it with spacy.
code
def tokenize(x):
x = re.sub(r'\s+', ' ', x)
x = nlp.make_doc(x)
x = [d.text for d in x]
return x
def read_feature_dataset(filename):
with open(filename) as f:
dataset = f.read().splitlines()
dataset = [line.split('\t') for line in dataset]
dataset_t = [categories.index(line[0]) for line in dataset]
dataset_x = [tokenize(line[1]) for line in dataset]
return dataset_x, torch.tensor(dataset_t, dtype=torch.long)
code
train_x, train_t = read_feature_dataset('data/train.txt')
valid_x, valid_t = read_feature_dataset('data/valid.txt')
test_x, test_t = read_feature_dataset('data/test.txt')
Extract the vocabulary. Only words that appear more than once are targeted.
code
from collections import Counter
code
counter = Counter([
x
for sent in train_x
for x in sent
])
vocab_in_train = [
token
for token, freq in counter.most_common()
if freq > 1
]
len(vocab_in_train)
output
9700
Converts a word string into a string of ID numbers.
code
vocab_list = ['[UNK]'] + vocab_in_train
vocab_dict = {x:n for n, x in enumerate(vocab_list)}
code
def sent_to_ids(sent):
return torch.tensor([vocab_dict[x if x in vocab_dict else '[UNK]'] for x in sent], dtype=torch.long)
Let's tokenize the first sentence of the training data.
code
print(train_x[0])
print(sent_to_ids(train_x[0]))
output
['Kathleen', 'Sebelius', "'", 'LGBT', 'legacy']
tensor([ 0, 0, 2, 2648, 0])
Convert it to a column of ID numbers.
code
def dataset_to_ids(dataset):
return [sent_to_ids(x) for x in dataset]
code
train_s = dataset_to_ids(train_x)
valid_s = dataset_to_ids(valid_x)
test_s = dataset_to_ids(test_x)
train_s[:3]
output
[tensor([ 0, 0, 2, 2648, 0]),
tensor([ 9, 6740, 1445, 2076, 583, 10, 547, 32, 51, 873, 6741]),
tensor([ 0, 205, 4198, 315, 1899, 1232, 0])]
There is a word string $ \ boldsymbol {x} = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Using a recurrent neural network (RNN), implement the following equation as a model for predicting the category $ y $ from the word string $ \ boldsymbol {x} $. $ \overrightarrow h_0 = 0, \ \overrightarrow h_t = {\rm \overrightarrow{RNN}}(\mathrm{emb}(x_t), \overrightarrow h_{t-1}), \ y = {\rm softmax}(W^{(yh)} \overrightarrow h_T + b^{(y)}) $ However, $ \ mathrm {emb} (x) \ in \ mathbb {R} ^ {d_w} $ is word embedding (a function that converts a word from one-hot notation to a word vector), $ \ overridearrow h_t \ in \ mathbb {R} ^ {d_h} $ is the hidden state vector of time $ t $, $ {\ rm \ overridearrow {RNN}} (x, h) $ is from the input $ x $ and the hidden state $ h $ of the previous time The RNN unit that calculates the next state, $ W ^ {(yh)} \ in \ mathbb {R} ^ {L \ times d_h} $ is the matrix for predicting the category from the hidden state vector, $ b ^ {(y) )} \ in \ mathbb {R} ^ {L} $ is a bias term ($ d_w, d_h, L $ are the number of word embedding dimensions, the number of hidden state vectors, and the number of labels, respectively). The RNN unit $ {\ rm \ overrightarrow {RNN}} (x, h) $ can have various configurations, and the following equation is a typical example. $ {\rm \overrightarrow{RNN}}(x,h) = g(W^{(hx)} x + W^{(hh)}h + b^{(h)}) $ However, $ W ^ {(hx)} \ in \ mathbb {R} ^ {d_h \ times d_w}, W ^ {(hh)} \ in \ mathbb {R} ^ {d_h \ times d_h}, b ^ {(h)} \ in \ mathbb {R} ^ {d_h} $ is the parameter of the RNN unit, and $ g $ is the activation function (for example, $ \ tanh $ and ReLU). In this problem, we do not learn the parameters, we just need to calculate $ y $ with the randomly initialized parameters. Hyperparameters such as the number of dimensions should be set to appropriate values such as $ d_w = 300, d_h = 50 $ (the same applies to the following problems).
Unlike Chapter 8, the length of the input data differs depending on the sentence. Use various things in torch.nn.utils.rnn
to pad the end of the variable length series so that it can be handled.
code
import random as rd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence as pad
from torch.nn.utils.rnn import pack_padded_sequence as pack
from torch.nn.utils.rnn import pad_packed_sequence as unpack
Create a Dataset
class that holds the dataset.
In addition to the input statement source
and the target label target
, it has the input statement lengths lengths
as members.
code
class Dataset(torch.utils.data.Dataset):
def __init__(self, source, target):
self.source = source
self.target = target
self.lengths = torch.tensor([len(x) for x in source])
self.size = len(source)
def __len__(self):
return self.size
def __getitem__(self, index):
return {
'src':self.source[index],
'trg':self.target[index],
'lengths':self.lengths[index],
}
def collate(self, xs):
return {
'src':pad([x['src'] for x in xs]),
'trg':torch.stack([x['trg'] for x in xs], dim=-1),
'lengths':torch.stack([x['lengths'] for x in xs], dim=-1)
}
Prepare the data set.
code
train_dataset = Dataset(train_s, train_t)
valid_dataset = Dataset(valid_s, valid_t)
test_dataset = Dataset(test_s, test_t)
Define the same Sampler
as in Chapter 8.
code
class Sampler(torch.utils.data.Sampler):
def __init__(self, dataset, width, shuffle = False):
self.dataset = dataset
self.width = width
self.shuffle = shuffle
if not shuffle:
self.indices = torch.arange(len(dataset))
def __iter__(self):
if self.shuffle:
self.indices = torch.randperm(len(self.dataset))
index = 0
while index < len(self.dataset):
yield self.indices[index : index + self.width]
index += self.width
Since it is convenient to pack padding when the series in the batch are in descending order, we define a sampler that satisfies such a contract.
All you have to do is sort the indexes in descending order of length in advance and load them into batches from the front.
code
class DescendingSampler(Sampler):
def __init__(self, dataset, width, shuffle = False):
assert not shuffle
super().__init__(dataset, width, shuffle)
self.indices = self.indices[self.dataset.lengths[self.indices].argsort(descending=True)]
Also, during training, the less padding in the batch, the less unnecessary calculations and faster learning, so we will implement such a checkmate method. In the above two examples, the index is separated by the number of batches, but the next child is separated by the maximum number of tokens in the batch, so the number of cases in the batch is not always constant.
code
class MaxTokensSampler(Sampler):
def __iter__(self):
self.indices = torch.randperm(len(self.dataset))
self.indices = self.indices[self.dataset.lengths[self.indices].argsort(descending=True)]
for batch in self.generate_batches():
yield batch
def generate_batches(self):
batches = []
batch = []
acc = 0
max_len = 0
for index in self.indices:
acc += 1
this_len = self.dataset.lengths[index]
max_len = max(max_len, this_len)
if acc * max_len > self.width:
batches.append(batch)
batch = [index]
acc = 1
max_len = this_len
else:
batch.append(index)
if batch != []:
batches.append(batch)
rd.shuffle(batches)
return batches
Prepare a function to create DataLoader
.
code
def gen_loader(dataset, width, sampler=Sampler, shuffle=False, num_workers=8):
return torch.utils.data.DataLoader(
dataset,
batch_sampler = sampler(dataset, width, shuffle),
collate_fn = dataset.collate,
num_workers = num_workers,
)
def gen_descending_loader(dataset, width, num_workers=0):
return gen_loader(dataset, width, sampler = DescendingSampler, shuffle = False, num_workers = num_workers)
def gen_maxtokens_loader(dataset, width, num_workers=0):
return gen_loader(dataset, width, sampler = MaxTokensSampler, shuffle = True, num_workers = num_workers)
Defines a one-layer unidirectional LSTM classifier. The length of each statement is required when pack
ing apad
ed tensor with the collate
of the dataset.
code
class LSTMClassifier(nn.Module):
def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
super().__init__()
self.embed = nn.Embedding(v_size, e_size)
self.rnn = nn.LSTM(e_size, h_size, num_layers = 1)
self.out = nn.Linear(h_size, c_size)
self.dropout = nn.Dropout(dropout)
self.embed.weight.data.uniform_(-0.1, 0.1)
for name, param in self.rnn.named_parameters():
if 'weight' in name or 'bias' in name:
param.data.uniform_(-0.1, 0.1)
self.out.weight.data.uniform_(-0.1, 0.1)
def forward(self, batch, h=None):
x = self.embed(batch['src'])
x = pack(x, batch['lengths'])
x, (h, c) = self.rnn(x, h)
h = self.out(h)
return h.squeeze(0)
Predict.
code
model = LSTMClassifier(len(vocab_dict), 300, 50, 4)
loader = gen_loader(test_dataset, 10, DescendingSampler, False)
model(iter(loader).next()).argmax(dim=-1)
output
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Learn the model constructed in Problem 81 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).
All you have to do is define Task and Trainer and run the training.
code
class Task:
def __init__(self):
self.criterion = nn.CrossEntropyLoss()
def train_step(self, model, batch):
model.zero_grad()
loss = self.criterion(model(batch), batch['trg'])
loss.backward()
return loss.item()
def valid_step(self, model, batch):
with torch.no_grad():
loss = self.criterion(model(batch), batch['trg'])
return loss.item()
code
class Trainer:
def __init__(self, model, loaders, task, optimizer, max_iter, device = None):
self.model = model
self.model.to(device)
self.train_loader, self.valid_loader = loaders
self.task = task
self.optimizer = optimizer
self.max_iter = max_iter
self.device = device
def send(self, batch):
for key in batch:
batch[key] = batch[key].to(self.device)
return batch
def train_epoch(self):
self.model.train()
acc = 0
for n, batch in enumerate(self.train_loader):
batch = self.send(batch)
acc += self.task.train_step(self.model, batch)
self.optimizer.step()
return acc / n
def valid_epoch(self):
self.model.eval()
acc = 0
for n, batch in enumerate(self.valid_loader):
batch = self.send(batch)
acc += self.task.valid_step(self.model, batch)
return acc / n
def train(self):
for epoch in range(self.max_iter):
train_loss = self.train_epoch()
valid_loss = self.valid_epoch()
print('epoch {}, train_loss:{:.5f}, valid_loss:{:.5f}'.format(epoch, train_loss, valid_loss))
I will learn.
code
device = torch.device('cuda')
model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
loaders = (
gen_loader(train_dataset, 1),
gen_loader(valid_dataset, 1),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 3, device)
trainer.train()
I will try to predict.
code
import numpy as np
code
class Predictor:
def __init__(self, model, loader, device=None):
self.model = model
self.loader = loader
self.device = device
def send(self, batch):
for key in batch:
batch[key] = batch[key].to(self.device)
return batch
def infer(self, batch):
self.model.eval()
batch = self.send(batch)
return self.model(batch).argmax(dim=-1).item()
def predict(self):
lst = []
for batch in self.loader:
lst.append(self.infer(batch))
return lst
code
def accuracy(true, pred):
return np.mean([t == p for t, p in zip(true, pred)])
predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))
predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))
output
Correct answer rate in training data: 0.7592661924372894
Correct answer rate in evaluation data: 0.6384730538922155
If you turn the epoch a little more, the accuracy will improve a little, but I don't care.
Modify the code of Problem 82 so that learning can be performed by calculating the loss / gradient for each $ B $ case (choose the value of $ B $ appropriately). Also, execute learning on the GPU.
code
model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
loaders = (
gen_maxtokens_loader(train_dataset, 4000),
gen_descending_loader(valid_dataset, 128),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.2, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()
output
epoch 0, train_loss:1.22489, valid_loss:1.26302
epoch 1, train_loss:1.11631, valid_loss:1.19404
epoch 2, train_loss:1.07750, valid_loss:1.18451
epoch 3, train_loss:0.96149, valid_loss:1.06748
epoch 4, train_loss:0.81597, valid_loss:0.86547
epoch 5, train_loss:0.74748, valid_loss:0.81049
epoch 6, train_loss:0.80179, valid_loss:0.89621
epoch 7, train_loss:0.60231, valid_loss:0.78494
epoch 8, train_loss:0.52551, valid_loss:0.73272
epoch 9, train_loss:0.97286, valid_loss:1.05034
code
predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))
predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))
output
Correct answer rate in training data: 0.7202358667165856
Correct answer rate in evaluation data: 0.6773952095808383
For some reason, the loss has increased and the accuracy has decreased, but let's live strongly without worrying about it.
Pre-learned word vector (for example, [learned word vector] in Google News dataset (about 100 billion words)](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing )) Initialize the word embedding $ \ mathrm {emb} (x) $ and learn.
code
from gensim.models import KeyedVectors
vectors = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)
Initialize the word embedding before learning.
code
def init_embed(embed):
for i, token in enumerate(vocab_list):
if token in vectors:
embed.weight.data[i] = torch.from_numpy(vectors[token])
return embed
code
model = LSTMClassifier(len(vocab_dict), 300, 128, 4)
init_embed(model.embed)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()
output
epoch 0, train_loss:1.21390, valid_loss:1.19333
epoch 1, train_loss:0.88751, valid_loss:0.74930
epoch 2, train_loss:0.57240, valid_loss:0.65822
epoch 3, train_loss:0.50240, valid_loss:0.62686
epoch 4, train_loss:0.45800, valid_loss:0.59535
epoch 5, train_loss:0.44051, valid_loss:0.55849
epoch 6, train_loss:0.38251, valid_loss:0.51837
epoch 7, train_loss:0.35731, valid_loss:0.47709
epoch 8, train_loss:0.30278, valid_loss:0.43797
epoch 9, train_loss:0.25518, valid_loss:0.41287
code
predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))
predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))
output
Correct answer rate in training data: 0.925028079371022
Correct answer rate in evaluation data: 0.8839820359281437
Encode the input text using both forward and reverse RNNs and train the model. $ \overleftarrow h_{T+1} = 0, \ \overleftarrow h_t = {\rm \overleftarrow{RNN}}(\mathrm{emb}(x_t), \overleftarrow h_{t+1}), \ y = {\rm softmax}(W^{(yh)} [\overrightarrow h_T; \overleftarrow h_1] + b^{(y)}) $ However, $ \ overrightarrow h_t \ in \ mathbb {R} ^ {d_h}, \ overleftarrow h_t \ in \ mathbb {R} ^ {d_h} $ are the times $ t obtained by the forward and reverse RNNs, respectively. The hidden state vector of $, $ {\ rm \ overleftarrow {RNN}} (x, h) $ is the RNN unit that calculates the previous state from the input $ x $ and the hidden state $ h $ of the next time, $ W ^ {( yh)} \ in \ mathbb {R} ^ {L \ times 2d_h} $ is a matrix for predicting categories from hidden state vectors, $ b ^ {(y)} \ in \ mathbb {R} ^ {L} $ Is a bias term. Also, $ [a; b] $ represents the concatenation of the vectors $ a $ and $ b $. Furthermore, experiment with bidirectional RNNs in multiple layers.
If you change the parameters of nn.LSTM
a little, you can make it multi-layered or bidirectional.
Since the hidden state increases, the last two (hidden states in the forward and reverse directions of the last layer) are acquired.
code
class BiLSTMClassifier(nn.Module):
def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
super().__init__()
self.embed = nn.Embedding(v_size, e_size)
self.rnn = nn.LSTM(e_size, h_size, num_layers = 2, bidirectional = True)
self.out = nn.Linear(h_size * 2, c_size)
self.dropout = nn.Dropout(dropout)
nn.init.uniform_(self.embed.weight, -0.1, 0.1)
for name, param in self.rnn.named_parameters():
if 'weight' in name or 'bias' in name:
nn.init.uniform_(param, -0.1, 0.1)
nn.init.uniform_(self.out.weight, -0.1, 0.1)
def forward(self, batch, h=None):
x = self.embed(batch['src'])
x = pack(x, batch['lengths'])
x, (h, c) = self.rnn(x, h)
h = h[-2:]
h = h.transpose(0,1)
h = h.contiguous().view(-1, h.size(1) * h.size(2))
h = self.out(h)
return h
code
model = BiLSTMClassifier(len(vocab_dict), 300, 128, 4)
init_embed(model.embed)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.05, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()
code
predictor = Predictor(model, gen_loader(train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))
predictor = Predictor(model, gen_loader(test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))
There is a word string $ \ boldsymbol x = (x_1, x_2, \ dots, x_T) $ represented by an ID number. However, $ T $ is the length of the word string, and $ x_t \ in \ mathbb {R} ^ {V} $ is the one-hot notation of the word ID number ($ V $ is the total number of words). Implement a model that predicts the category $ y $ from the word string $ \ boldsymbol x $ using a convolutional neural network (CNN). However, the configuration of the convolutional neural network is as follows.
- Word embedding dimensions: $ d_w $
That is, the feature vector $ p_t \ in \ mathbb {R} ^ {d_h} $ at time $ t $ is expressed by the following equation. $ p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)}) $ However, $ W ^ {(px)} \ in \ mathbb {R} ^ {d_h \ times 3d_w}, b ^ {(p)} \ in \ mathbb {R} ^ {d_h} $ is a CNN parameter, $ g $ is the activation function (eg $ \ tanh $, ReLU, etc.), and $ [a; b; c] $ is the concatenation of the vectors $ a, b, c $. The number of columns in the matrix $ W ^ {(px)} $ is $ 3d_w $ because the linear transformation is performed on the concatenated word embeddings of three tokens. In maximum value pooling, the maximum value at all times is taken for each dimension of the feature vector, and the feature vector $ c \ in \ mathbb {R} ^ {d_h} $ of the input document is obtained. If $ c [i] $ represents the value of the $ i $ th dimension of the vector $ c $, the maximum value pooling is expressed by the following equation. $ c[i] = \max_{1 \leq t \leq T} p_t[i] $ Finally, the input document feature vector $ c $ with the matrix $ W ^ {(yc)} \ in \ mathbb {R} ^ {L \ times d_h} $ and the bias term $ b ^ {(y)} \ in Apply the linear transformation by \ mathbb {R} ^ {L} $ and the softmax function to predict the category $ y $. $ y = {\rm softmax}(W^{(yc)} c + b^{(y)}) $ Note that this problem does not train the model, it only needs to calculate $ y $ with a randomly initialized weight matrix.
I want to add PAD
tokens to both ends of the input data, so I will use such a dataset.
code
cnn_vocab_list = ['[PAD]', '[UNK]'] + vocab_in_train
cnn_vocab_dict = {x:n for n, x in enumerate(cnn_vocab_list)}
def cnn_sent_to_ids(sent):
return torch.tensor([cnn_vocab_dict[x if x in cnn_vocab_dict else '[UNK]'] for x in sent], dtype=torch.long)
print(train_x[0])
print(cnn_sent_to_ids(train_x[0]))
output
['Kathleen', 'Sebelius', "'", 'LGBT', 'legacy']
tensor([ 1, 1, 3, 2649, 1])
Since the window width is 3, I added two ʻEOS`s.
output
def cnn_dataset_to_ids(dataset):
return [cnn_sent_to_ids(x) for x in dataset]
cnn_train_s = cnn_dataset_to_ids(train_x)
cnn_valid_s = cnn_dataset_to_ids(valid_x)
cnn_test_s = cnn_dataset_to_ids(test_x)
cnn_train_s[:3]
output
[tensor([ 1, 1, 3, 2649, 1]),
tensor([ 10, 6741, 1446, 2077, 584, 11, 548, 33, 52, 874, 6742]),
tensor([ 1, 206, 4199, 316, 1900, 1233, 1])]
code
class CNNDataset(Dataset):
def collate(self, xs):
max_seq_len = max([x['lengths'] for x in xs])
src = [torch.cat([x['src'], torch.zeros(max_seq_len - x['lengths'], dtype=torch.long)], dim=-1) for x in xs]
src = torch.stack(src)
mask = [[1] * x['lengths'] + [0] * (max_seq_len - x['lengths']) for x in xs]
mask = torch.tensor(mask, dtype=torch.long)
return {
'src':src,
'trg':torch.tensor([x['trg'] for x in xs]),
'mask':mask,
}
cnn_train_dataset = CNNDataset(cnn_train_s, train_t)
cnn_valid_dataset = CNNDataset(cnn_valid_s, valid_t)
cnn_test_dataset = CNNDataset(cnn_test_s, test_t)
Create a CNN model.
code
class CNNClassifier(nn.Module):
def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
super().__init__()
self.embed = nn.Embedding(v_size, e_size)
self.conv = nn.Conv1d(e_size, h_size, 3, padding=1)
self.act = nn.ReLU()
self.out = nn.Linear(h_size, c_size)
self.dropout = nn.Dropout(dropout)
nn.init.normal_(self.embed.weight, 0, 0.1)
nn.init.kaiming_normal_(self.conv.weight)
nn.init.constant_(self.conv.bias, 0)
nn.init.kaiming_normal_(self.out.weight)
nn.init.constant_(self.out.bias, 0)
def forward(self, batch):
x = self.embed(batch['src'])
x = self.dropout(x)
x = self.conv(x.transpose(-1, -2))
x = self.act(x)
x = self.dropout(x)
x.masked_fill_(batch['mask'].unsqueeze(-2) == 0, -1e4)
x = F.max_pool1d(x, x.size(-1)).squeeze(-1)
x = self.out(x)
return x
Since padding = 1
is specified for nn.Conv1d
, one padding token is inserted at each end. This token is 0 by default, but there is no problem because the padding id that matches the length of the series length is also 0.
When convolving in the time direction, transpose
is done to make the last dimension the time axis.
The value of the padding part is set to -1
so that the value is not fetched by the maximum value pooling.
When pooling the maximum value with max_pool1d
, it is necessary to specify the dimension.
Learn the model constructed in Problem 86 using Stochastic Gradient Descent (SGD). Learn the model while displaying the loss and correct answer rate on the training data and the loss and correct answer rate on the evaluation data, and finish with an appropriate standard (for example, 10 epochs).
Let me learn.
code
def init_cnn_embed(embed):
for i, token in enumerate(cnn_vocab_list):
if token in vectors:
embed.weight.data[i] = torch.from_numpy(vectors[token])
return embed
code
model = CNNClassifier(len(cnn_vocab_dict), 300, 128, 4)
init_cnn_embed(model.embed)
loaders = (
gen_maxtokens_loader(cnn_train_dataset, 4000),
gen_descending_loader(cnn_valid_dataset, 32),
)
task = Task()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()
output
epoch 0, train_loss:1.03501, valid_loss:0.85454
epoch 1, train_loss:0.68068, valid_loss:0.70825
epoch 2, train_loss:0.56784, valid_loss:0.60257
epoch 3, train_loss:0.50570, valid_loss:0.55611
epoch 4, train_loss:0.45707, valid_loss:0.52386
epoch 5, train_loss:0.42078, valid_loss:0.48479
epoch 6, train_loss:0.38858, valid_loss:0.45933
epoch 7, train_loss:0.36667, valid_loss:0.43547
epoch 8, train_loss:0.34746, valid_loss:0.41509
epoch 9, train_loss:0.32849, valid_loss:0.40350
code
predictor = Predictor(model, gen_loader(cnn_train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))
predictor = Predictor(model, gen_loader(cnn_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))
output
Correct answer rate in training data: 0.9106140022463497
Correct answer rate in evaluation data: 0.8847305389221557
Build a high-performance category classifier by modifying the code of Problem 85 and Problem 87 and adjusting the shape and hyperparameters of the neural network.
Since it's a big deal, I wrote a model that pools the output of LSTM to the maximum value with CNN.
code
class BiLSTMCNNDataset(Dataset):
def collate(self, xs):
max_seq_len = max([x['lengths'] for x in xs])
mask = [[1] * (x['lengths'] - 2) + [0] * (max_seq_len - x['lengths']) for x in xs]
mask = torch.tensor(mask, dtype=torch.long)
return {
'src':pad([x['src'] for x in xs]),
'trg':torch.stack([x['trg'] for x in xs], dim=-1),
'mask':mask,
'lengths':torch.stack([x['lengths'] for x in xs], dim=-1)
}
rnncnn_train_dataset = BiLSTMCNNDataset(cnn_train_s, train_t)
rnncnn_valid_dataset = BiLSTMCNNDataset(cnn_valid_s, valid_t)
rnncnn_test_dataset = BiLSTMCNNDataset(cnn_test_s, test_t)
code
class BiLSTMCNNClassifier(nn.Module):
def __init__(self, v_size, e_size, h_size, c_size, dropout=0.2):
super().__init__()
self.embed = nn.Embedding(v_size, e_size)
self.rnn = nn.LSTM(e_size, h_size, bidirectional = True)
self.conv = nn.Conv1d(h_size* 2, h_size, 3, padding=1)
self.act = nn.ReLU()
self.out = nn.Linear(h_size, c_size)
self.dropout = nn.Dropout(dropout)
nn.init.uniform_(self.embed.weight, -0.1, 0.1)
for name, param in self.rnn.named_parameters():
if 'weight' in name or 'bias' in name:
nn.init.uniform_(param, -0.1, 0.1)
nn.init.kaiming_normal_(self.conv.weight)
nn.init.constant_(self.conv.bias, 0)
nn.init.kaiming_normal_(self.out.weight)
nn.init.constant_(self.out.bias, 0)
def forward(self, batch, h=None):
x = self.embed(batch['src'])
x = self.dropout(x)
x = pack(x, batch['lengths'])
x, (h, c) = self.rnn(x, h)
x, _ = unpack(x)
x = self.dropout(x)
x = self.conv(x.permute(1, 2, 0))
x = self.act(x)
x = self.dropout(x)
x.masked_fill_(batch['mask'].unsqueeze(-2) == 0, -1)
x = F.max_pool1d(x, x.size(-1)).squeeze(-1)
x = self.out(x)
return x
code
loaders = (
gen_maxtokens_loader(rnncnn_train_dataset, 4000),
gen_descending_loader(rnncnn_valid_dataset, 32),
)
task = Task()
for h in [32, 64, 128, 256, 512]:
model = BiLSTMCNNClassifier(len(cnn_vocab_dict), 300, h, 4)
init_cnn_embed(model.embed)
optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.9, nesterov=True)
trainer = Trainer(model, loaders, task, optimizer, 10, device)
trainer.train()
predictor = Predictor(model, gen_loader(rnncnn_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in evaluation data:', accuracy(test_t, pred))
output
epoch 0, train_loss:1.21905, valid_loss:1.12725
epoch 1, train_loss:0.95913, valid_loss:0.84094
epoch 2, train_loss:0.66851, valid_loss:0.66997
epoch 3, train_loss:0.57141, valid_loss:0.61373
epoch 4, train_loss:0.52795, valid_loss:0.59354
epoch 5, train_loss:0.49844, valid_loss:0.57013
epoch 6, train_loss:0.47408, valid_loss:0.55163
epoch 7, train_loss:0.44922, valid_loss:0.52349
epoch 8, train_loss:0.41864, valid_loss:0.49231
epoch 9, train_loss:0.38975, valid_loss:0.46807
Correct answer rate in evaluation data: 0.8690119760479041
epoch 0, train_loss:1.16516, valid_loss:1.06582
epoch 1, train_loss:0.81246, valid_loss:0.71224
epoch 2, train_loss:0.58068, valid_loss:0.61988
epoch 3, train_loss:0.52451, valid_loss:0.58465
epoch 4, train_loss:0.48807, valid_loss:0.55663
epoch 5, train_loss:0.45712, valid_loss:0.52742
epoch 6, train_loss:0.41639, valid_loss:0.50089
epoch 7, train_loss:0.38595, valid_loss:0.46442
epoch 8, train_loss:0.35262, valid_loss:0.43459
epoch 9, train_loss:0.32527, valid_loss:0.40692
Correct answer rate in evaluation data: 0.8772455089820359
epoch 0, train_loss:1.12191, valid_loss:0.97533
epoch 1, train_loss:0.71378, valid_loss:0.66554
epoch 2, train_loss:0.55280, valid_loss:0.59733
epoch 3, train_loss:0.50526, valid_loss:0.57163
epoch 4, train_loss:0.46889, valid_loss:0.53955
epoch 5, train_loss:0.43500, valid_loss:0.50500
epoch 6, train_loss:0.40006, valid_loss:0.47222
epoch 7, train_loss:0.36444, valid_loss:0.43941
epoch 8, train_loss:0.33329, valid_loss:0.41224
epoch 9, train_loss:0.30588, valid_loss:0.39965
Correct answer rate in evaluation data: 0.8839820359281437
epoch 0, train_loss:1.04536, valid_loss:0.84626
epoch 1, train_loss:0.61410, valid_loss:0.62255
epoch 2, train_loss:0.49830, valid_loss:0.55984
epoch 3, train_loss:0.44190, valid_loss:0.51720
epoch 4, train_loss:0.39713, valid_loss:0.46718
epoch 5, train_loss:0.35052, valid_loss:0.43181
epoch 6, train_loss:0.32145, valid_loss:0.39898
epoch 7, train_loss:0.30279, valid_loss:0.37586
epoch 8, train_loss:0.28171, valid_loss:0.37333
epoch 9, train_loss:0.26904, valid_loss:0.37849
Correct answer rate in evaluation data: 0.8884730538922155
epoch 0, train_loss:0.93974, valid_loss:0.71999
epoch 1, train_loss:0.53687, valid_loss:0.58747
epoch 2, train_loss:0.44848, valid_loss:0.52432
epoch 3, train_loss:0.38761, valid_loss:0.46509
epoch 4, train_loss:0.34431, valid_loss:0.43651
epoch 5, train_loss:0.31699, valid_loss:0.39881
epoch 6, train_loss:0.28963, valid_loss:0.38732
epoch 7, train_loss:0.27550, valid_loss:0.37152
epoch 8, train_loss:0.26003, valid_loss:0.36476
epoch 9, train_loss:0.24991, valid_loss:0.36012
Correct answer rate in evaluation data: 0.8944610778443114
I tried learning while changing the size of the hidden state. The result is that the larger the size of the hidden state, the better.
Build a model that classifies news article headlines into categories, starting from a pre-learned language model (eg BERT).
Use huggingface / transformers.
code
from transformers import *
The BERT input is a wordpiece, so you have to give it through the tokenizer. Prepare a tokenizer.
code
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
Talkize.
code
def read_for_bert(filename):
with open(filename) as f:
dataset = f.read().splitlines()
dataset = [line.split('\t') for line in dataset]
dataset_t = [categories.index(line[0]) for line in dataset]
dataset_x = [torch.tensor(tokenizer.encode(line[1]), dtype=torch.long) for line in dataset]
return dataset_x, torch.tensor(dataset_t, dtype=torch.long)
bert_train_x, bert_train_t = read_for_bert('data/train.txt')
bert_valid_x, bert_valid_t = read_for_bert('data/valid.txt')
bert_test_x, bert_test_t = read_for_bert('data/test.txt')
Prepare a dataset class for BERT. I do padding and so on.
mask
is an attention mask. I try not to get attention on the padding token.
code
class BertDataset(Dataset):
def collate(self, xs):
max_seq_len = max([x['lengths'] for x in xs])
src = [torch.cat([x['src'], torch.zeros(max_seq_len - x['lengths'], dtype=torch.long)], dim=-1) for x in xs]
src = torch.stack(src)
mask = [[1] * x['lengths'] + [0] * (max_seq_len - x['lengths']) for x in xs]
mask = torch.tensor(mask, dtype=torch.long)
return {
'src':src,
'trg':torch.tensor([x['trg'] for x in xs]),
'mask':mask,
}
code
bert_train_dataset = BertDataset(bert_train_x, bert_train_t)
bert_valid_dataset = BertDataset(bert_valid_x, bert_valid_t)
bert_test_dataset = BertDataset(bert_test_x, bert_test_t)
Load the pre-learning model.
code
class BertClassifier(nn.Module):
def __init__(self):
super().__init__()
config = BertConfig.from_pretrained('bert-base-cased', num_labels=4)
self.bert = BertForSequenceClassification.from_pretrained('bert-base-cased', config=config)
def forward(self, batch):
x = self.bert(batch['src'], attention_mask=batch['mask'])
return x[0]
code
model = BertClassifier()
loaders = (
gen_maxtokens_loader(bert_train_dataset, 1000),
gen_descending_loader(bert_valid_dataset, 32),
)
task = Task()
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
trainer = Trainer(model, loaders, task, optimizer, 5, device)
trainer.train()
code
predictor = Predictor(model, gen_loader(bert_train_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(train_t, pred))
predictor = Predictor(model, gen_loader(bert_test_dataset, 1), device)
pred = predictor.predict()
print('Correct answer rate in training data:', accuracy(test_t, pred))
output
Correct answer rate in training data: 0.9927929614376638
Correct answer rate in training data: 0.9229041916167665
Sounds good.
Language processing 100 knocks 2020 Chapter 10: Machine translation (90-98)
Recommended Posts