This article was written as an internship at NS Solutions (NSSOL).
This article consists of the following structure.
I will give a rough explanation of the model PPLM explained in this article.
The Plug and Play Language Model (PPLM) is the model proposed in PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION. .. Author Implementation also exists on Github.
This research is a research that tackles the task of controlled text generation. Controlled text generation is the task of generating sentences that match the specified attributes (polarity such as positive / negative and topics such as "politics" and "science") when the language generation model generates sentences.
The great thing about PPLM is that it realizes controlled text generation just by learning a new small model for the existing language generation model. This greatness can be seen by comparing it with existing methods. Existing research approaches to controlled text generation can be categorized as follows.
The former has sufficient performance in controlling attributes, but has a big problem in terms of learning cost. When you want to add new attributes, you need to retrain the large model. The latter has no problem in terms of learning cost, but its performance seems to be significantly inferior to the former. The method proposed by PPLM has the advantage that the learning cost is very low while maintaining the performance comparable to the former.
PPLM is intended for language generative models that use Transformer as a Decoder. Since GPT-2 is used in the author's implementation, it will be described on the assumption that GPT-2 will be used in the future.
This is a figure quoted from the original paper of PPLM (https://openreview.net/pdf?id=H1edEyBKDS). However, the blue letters ($ H_1, H_2, x_0, x_1, x_2 $) are my additions.
If you look only at the black arrow in the figure and ignore [Attribute Model p (a | x)], the model will be a model using the original Transformer Decoder (hereinafter referred to as the original model).
The LM (Language Model) in the figure is a stack of L Transformer Decoder blocks.
$ H_t $ holds the self-attention Key and Value pairs that occur in each of these l-layer blocks. That is,
In PPLM, two methods are proposed for the construction of Attribute Model $ p (a | x) $. One uses Bag-of-Words and the other uses a discriminator. Here, we will explain these with implementation code.
Here, a set of keywords related to the attribute is created in advance. Attributes provided by the author implementation includes "computers", "fantasy", "kitchen", "legal" , "military", "politics", "positive_words", "religion", "science", "space". For example, "science" has 48 words such as "astronomy", "atom", "biology", "cell", and "chemical".
Given the keyword set $ \ {w_1, ..., w_k \} $ for an attribute $ a $, the distribution of word likelihood calculated by the original model $ p_ {t + 1} $ Using, $ p (a | x_ {t + 1}) $, which represents how plausible the output word $ x_ {t + 1} $ is for the attribute $ a
The Attribute Model using Bag-of-Words mentioned above was a simple design, but there is a problem. That is, there are cases where it is difficult to express an attribute only with a set of keywords. In such cases, the discriminator model described in this section is useful.
The attribute discriminator is the output word
run_pplm_discrim_train.py
class Discriminator(torch.nn.Module):
"""Transformer encoder followed by a Classification Head"""
def __init__(
self,
class_size,
pretrained_model="gpt2-medium",
cached_mode=False,
device='cpu'
):
super(Discriminator, self).__init__()
self.tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model)
self.encoder = GPT2LMHeadModel.from_pretrained(pretrained_model)
self.embed_size = self.encoder.transformer.config.hidden_size
self.classifier_head = ClassificationHead(
class_size=class_size,
embed_size=self.embed_size
)
self.cached_mode = cached_mode
self.device = device
def get_classifier(self):
return self.classifier_head
def train_custom(self):
for param in self.encoder.parameters():
param.requires_grad = False
self.classifier_head.train()
def avg_representation(self, x):
mask = x.ne(0).unsqueeze(2).repeat(
1, 1, self.embed_size
).float().to(self.device).detach() #mask used to ignore 0 in padding
hidden, _ = self.encoder.transformer(x)
masked_hidden = hidden * mask
avg_hidden = torch.sum(masked_hidden, dim=1) / (
torch.sum(mask, dim=1).detach() + EPSILON
)
return avg_hidden
def forward(self, x):
if self.cached_mode:
avg_hidden = x.to(self.device)
else:
avg_hidden = self.avg_representation(x.to(self.device))
logits = self.classifier_head(avg_hidden)
probs = F.log_softmax(logits, dim=-1)
return probs
class ClassificationHead(torch.nn.Module):
"""Classification Head for transformer encoders"""
def __init__(self, class_size, embed_size):
super(ClassificationHead, self).__init__()
self.class_size = class_size
self.embed_size = embed_size
self.mlp = torch.nn.Linear(embed_size, class_size)
def forward(self, hidden_state):
logits = self.mlp(hidden_state)
return logits
The Discriminator class consists of two parts: a pre-trained model and a ClassificationHead to be learned. We do not retrain the trained model at all. When executing PPLM
hidden, _ = self.encoder.transformer(x)
This hidden is like a distributed representation of each word of input x. Next, since this hidden contains extra words due to padding due to Tensor's batch processing, mask it to ignore it. By doing this, the distributed representation of the words added by padding becomes 0 (masked_hidden). Finally, avg_hidden can be obtained by adding the distributed expressions in each sentence of this masked_hidden. If you think of the words in a sentence as being added together, this can be interpreted as a decentralized representation of the sentence. This avg_hidden is the input to ClassificationHead. ClassificationHead is a neural network with no intermediate layer and only input layer and output layer. The number of nodes in the input layer is the number of dimensions of the distributed representation of avg_hidden, and the number of nodes in the output layer is the number of attributes. The output is logit, which is passed through a softmax function to take the logarithm.
logits = self.classifier_head(avg_hidden)
probs = F.log_softmax(logits, dim=-1)
The value (negative log-likelihood) of this probs (= output_t) corresponding to the correct answer class (target_t) becomes loss, and this loss is backpropagated for learning.
loss = F.nll_loss(output_t, target_t)
loss.backward(retain_graph=True)
optimizer.step()
After learning the Attribute Model in the previous chapter, the next step is to actually execute PPLM. Actually, a trained Attribute Model is prepared, so if you just want to check the operation, you do not need to learn it again. Also, learning is not required when using the Attribute Model by Bag-of-Words. As mentioned in [Model Configuration](#Model Configuration), PPLM controls the output word by using $ \ tilde {H_t} $ which is an updated version of $ H_t $ of the original model. This update is actually done as follows.
\begin{align}
\tilde{H_t} &= H_t + \Delta H_t \\
\Delta H_t & \leftarrow \Delta H_t + \alpha \frac{\nabla_{\Delta H_t} \log p(a | H_t + \Delta H_t)}{||\nabla_{\Delta H_t} \log p(a | H_t + \Delta H_t)||^{\gamma}}
\end{align}
$ \ Alpha and \ gamma $ are hyperparameters. What we are doing here is the calculation of the update $ \ Delta H_t $ that increases the likelihood of the attribute $ a $. This $ \ Delta H_t $ itself is calculated and updated multiple times, and then added to $ H_t $. This number is said to be 3-10 times (the default value in the implementation code is 3). This update is done in the perturb_past function of run_pplm.py. First, let's take a look at the implementation by Attribute Model using Bag-of-Words.
loss = 0.0
bow_logits = torch.mm(probs, torch.t(one_hot_bow))
bow_loss = -torch.log(torch.sum(bow_logits))
loss += bow_loss
Although the process is omitted, probs is equivalent to $ p_ {t + 1} $ and is the distribution of the likelihood of all words in the knowledge of the language model. one_hot_bow represents each word belonging to the attribute keyword set as a one-hot-vector for all words in the knowledge of the language model. The sum of bow_logits multiplied by these is $ \ sum_ {i} ^ k p_ {t + described in [Attribute Model by Bag-of-Words](-attribute-model by # bag-of-words). 1} [w_i] Corresponds to the calculation of $. The negative logarithm of this sum is bow_loss. This bow_loss corresponds to $-\ log p (a | H_t + \ Delta H_t) $.
Next, the implementation by Attribute Model using the discriminator is as follows.
ce_loss = torch.nn.CrossEntropyLoss()
prediction = classifier(new_accumulated_hidden / (curr_length + 1 + horizon_length))
label = torch.tensor(prediction.shape[0] * [class_label],
device=device,
dtype=torch.long)
discrim_loss = ce_loss(prediction, label)
loss += discrim_loss
The classifier is the ClassificationHead part of the discriminator learned in the previous chapter. I will explain new_accumulated_hidden. The hidden state output from the last layer of the 12-layer Transformer calculated by giving the updated $ \ tilde {H_t} $ to GPT-2 is displayed in [Attribute Model by discriminator](# by discriminator). -As with avg_hidden explained in (attribute-model), consider the sum of distributed expressions. This calculation is performed for the unupdated $ H_t $ in the same way, and the sum of these two is new_accumulated_hidden. You may not know this area unless you actually follow the code, but it may be okay to recognize that you are entering the hidden state in the classifier. The default value for curr_length is the number of input words for GPT-2 at the moment, and for horizon_length, 1 is the default value (horizon_length, the explanation of the head family is not clear what role it plays). The class_label used for label is given by the user when run_pplm.py is run. For example, a pre-assigned index for the positive class. Find the cross-entropy loss between this label and prediction. The discrim_loss calculated in this way also corresponds to $-\ log p (a | H_t + \ Delta H_t) $. The Bag-of-Words and the Attribute Model by the discriminator can be used together. In that case, add bow_loss and discrim_loss.
loss = 0.0
loss += bow_loss
loss += discrim_loss
Up to here
kl_loss = kl_scale * (
(corrected_probs * (corrected_probs / unpert_probs).log()).sum()
)
loss += kl_loss
Here, corrected_probs is $ \ tilde {p_ {t + 1}} $ and unpert_probs is $ p_ {t + 1} $. kl_scale is a hyperparameter, and basically it should be set to 0.01. The calculated kl_loss is added to bow_loss, discrim_loss, or their sum. When updating $ \ Delta H_t $, all of them are moved together in the gradient direction.
KL-Divergence is the only countermeasure at the renewal design stage. Another approach is to actually sample the words according to their likelihood after calculating $ \ tilde {p_ {t + 1}}
pert_probs = ((pert_probs ** gm_scale) * (unpert_probs ** (1 - gm_scale)))
pert_probs = top_k_filter(pert_probs, k=top_k, probs=True)
if torch.sum(pert_probs) <= 1:
pert_probs = pert_probs / torch.sum(pert_probs)
if sample:
last = torch.multinomial(pert_probs, num_samples=1)
else:
_, last = torch.topk(pert_probs, k=1, dim=-1)
The first line calculates $ \ tilde {p_ {t + 1}} ^ {\ gamma_ {gm}} {p_ {t + 1}} ^ {1- \ gamma_ {gm}} $. In the second line, a filter is applied to leave only the k words with the highest likelihood so that words with too low a likelihood do not appear when sampling. The operation on lines 3-4 is the operation of dividing by $ \ beta $. Lines 6-7 are sampling according to the distribution of likelihood, and lines 8-9 are greedily sampling words with maximum likelihood.
The above is the idea of PPLM to update $ H_t $ and devise sampling.
Here is an actual test example. Author Implementation has two example instructions, so try them.
First, about PPLM using Bag-of-Words. Use the following instructions implemented by the author as a trial.
python run_pplm.py -B military --cond_text "The potato" --length 50 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.03 --window_length 5 --kl_scale 0.01 --gm_scale 0.99 --colorama --sample
-Β military specifies the Bag-of-Words model of military attribute. Here are the results.
Unperturbed generated text <|endoftext|>The potato is probably the world's most widely eaten plant. But what if it's also the most dangerous?
In the last two decades, there's been a dramatic decrease in potato crop damage from crop rot and disease. The decline, which started in
Perturbed generated text 1 <|endoftext|>The potato-flour soup that is the best way to start a weekend! The following recipe is one of several that I have been working on over the past few months. I think it is the best of them. It uses all the elements of the
Perturbed generated text 2 <|endoftext|>The potato bomb and the anti-Semitic attack that killed four Jewish students at a Jewish school in France are the most recent examples of bomb threats targeting Israeli facilities. The latest bomb threat targeting a U.S. nuclear facility, the bomb was sent out over the
Perturbed generated text 3 <|endoftext|>The potato chip explosion has been a boon to the world's food industry since its release in late March. A handful of companies have already announced plans to produce chips using the chips, including Chipotle Mexican Grill Corp.'s parent company, Taco Bell Corp.'s
Perturbed generated text 4 <|endoftext|>The potato is a very popular and delicious vegetable in many countries, but it can also cause severe health problems for people. The health of your body depends on your diet. If your diet doesn't include enough protein to get through the meal, or if you are
Perturbed generated text 5 <|endoftext|>The potato plant, which is a member of the same family as wheat, can be found around the world. It's also used to make potato chips, bread, and other food products.
The Plant
The plant grows as a seed and produces
Perturbed generated text 6 <|endoftext|>The potato bomb has been a controversial weapon for years. The device is packed with bomb-like devices and packed on a bomb-filled potato bomb. It's a bomb that detonates in the bomb-packed potato bomb and explodes in the potato bomb. So
Perturbed generated text 7 <|endoftext|>The potato has a lot in common with the human earworm: The first, and only, time you hear it, you'll hear the sound of the potato in your ear as well.
It's the first sound you hear when your cat or dog
Perturbed generated text 8 <|endoftext|>The potato salad is coming to a restaurant near you!
The new restaurant, in the heart of downtown Chicago, will be named the Potato Salad.
A photo posted by @the_mike_barnes on Aug 7, 2016 at
Perturbed generated text 9 <|endoftext|>The potato is a staple in many people's diet, and it is an easy food to make in your home.
The best potato chips in the world are made by hand using only potatoes.
The potato is a staple in many people's diet
Perturbed generated text 10 <|endoftext|>The potato bomb is an improvised explosive device, typically containing one bomb and no more than 10 grams of explosive and containing no explosive material.
Bombardment of an aircraft aircraft, a tank truck or explosive device
Bombardment of an aircraft aircraft
Generates sentences independently for the number of num_samples specified at the time of execution (10 this time). Words related to attributes are emphasized in red. This can be displayed in red on standard output by specifying --colorama at runtime. Let's look at a generation example. First, there are no military elements in the original uncontrolled model generation statement. For controlled sentences, the word "bomb" is often found for 2, 6 and 10. However, I don't get the impression that it is a military statement for other examples. Military attributes starting with The potato The sentence may have been a little difficult.
Try a model that uses an attribute discriminator. Execution is the following statement of the author implementation.
python run_pplm.py -D sentiment --class_label 2 --cond_text "My dog died" --length 50 --gamma 1.0 --num_iterations 10 --num_samples 10 --stepsize 0.04 --kl_scale 0.01 --gm_scale 0.95 --sample
Here are the results.
Unperturbed generated text <|endoftext|>My dog died in February, after suffering from severe arthritis. He had been suffering with a terrible cold that was causing his skin to break. I couldn't afford a replacement dog and couldn't afford to have him taken to the vet. I knew the vet would be
Perturbed generated text 1 <|endoftext|>My dog died of a heart attack at the age of 88, his son said, and her death has shocked and brought closure to the family. (Published Wednesday, March 12, 2017)
A mother who was found dead at home with a heart attack on
Perturbed generated text 2 <|endoftext|>My dog died from a rare and potentially deadly form of a rare form of sickle cell disease.
A rare form of sickle cell is called hemizygaly in the families.
The family is an important part of the game and it's
Perturbed generated text 3 <|endoftext|>My dog died after being shot.
A woman in the United States died after a man in his 20s opened fire at her home in North Carolina and injured several others.
On March 12 a neighbor heard a woman screaming. After she ran outside to
Perturbed generated text 4 <|endoftext|>My dog died of a heart attack, after suffering from a heart attack.
The title text of this page has a a a
of of the work and work in to be an in a way, that the idea of the idea to a
Perturbed generated text 5 <|endoftext|>My dog died from a rare form of cancer that was not known before.
The rare form of brain cancer called glioblastomatosis is more common in people of European descent. People of European descent are also at greater risk of glioma
Perturbed generated text 6 <|endoftext|>My dog died from anaphase and I don't know how to give birth to a child with a rare genetic condition, an important personal health gain, with health - "" " The " " " "'The'"'" The book " The word
Perturbed generated text 7 <|endoftext|>My dog died from a rare form of cancer, the Daily Mail reports. "I have a really strong desire to help others and so I am happy to have the chance to help others to be happy and to love their loved ones and that's something I love
Perturbed generated text 8 <|endoftext|>My dog died because I didn't let him go.
I have a 6-year-old, 3-year-old, 1-year-old, 2-year-old, and 2-year-old. I have a very active and
Perturbed generated text 9 <|endoftext|>My dog died of a heart attack while while while I was in the house. I had the old man's head and body, and a large one, I have my hands and feet with me. I have a good time, and the best, as I am
Perturbed generated text 10 <|endoftext|>My dog died from a rare form of cancer, scientists have found.... James M. He he is is is is a
A lot of a lot of a fun!! The Great Escape The Great Escape! The Great Escape! The Great Escape
Unlike Bag-of-Words, related words cannot be highlighted. -D sentiment specifies a pre-learned "sentiment" discriminator. This discriminator discriminates between two classes, "very_positive" and "very_negative", and the class_label = 2 specified this time represents "very_positive". (Note that "very_negative" can be specified by setting class_label = 3). This is an example of trying to generate a positive sentence for the beginning that seems to generate only a negative sentence "My dog died". Let's look at each example. The generated sentence of the original model without attribute control is pessimistic. Regarding the controlled generated sentences, 2, 5, 7 etc. are relatively positive (or rather non-negative) sentences are generated. Words such as rare and love stand out. Since the negatives of 1, 3, 4, 6, 8, 9, 10 etc. have not disappeared or the sentences are unnatural, it is difficult to generate them properly if the beginning of the sentence and the attributes do not match. You can see the situation.
The above example was a combination of sentence beginnings and attributes that would be difficult to generate. Let's match the beginning of the sentence and the attributes a little more. Start with The potato and attribute positive_words.
python run_pplm.py -B positive_words --cond_text "The potato" --length 50 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.03 --window_length 5 --kl_scale 0.01 --gm_scale 0.99 --colorama --sample
Here are the results.
Unperturbed generated text <|endoftext|>The potato is probably the world's most widely eaten plant. But what if it's also the most dangerous?
In the last two decades, there's been a dramatic decrease in potato crop damage from crop rot and disease. The decline, which started in
Perturbed generated text 1 <|endoftext|>The potato-like, gluten-free, low-calorie, sweet, and nutritious sweet potato pie recipe. Easy to make, and perfect for those who love to eat sweet, healthy, and filling pie!
When my kids are home from school
Perturbed generated text 2 <|endoftext|>The potato has been a popular favorite since the 1980s. But with its recent popularity and rising popularity, is it time to eat your favorite potato again?
The potato is still a great food to enjoy and enjoy, with its healthy benefits and delicious flavor
Perturbed generated text 3 <|endoftext|>The potato chip craze is in full swing.
The popular snacks have been making the rounds in recent weeks as people seek out fresh and healthier alternatives to fried foods.
But there may have never been a better time to eat these crispy snacks than
Perturbed generated text 4 <|endoftext|>The potato is a very versatile, versatile vegetable and it is a great addition to many tasty salads, soups and stews.
The potato is the star of many salads and stirfries. I love the versatility of potatoes in many recipes.
Perturbed generated text 5 <|endoftext|>The potato is a common dish, so much so in fact that it is often served with pasta. It is often served with rice, or topped with a sweet and savoury sauce.
Serves 4
1 onion
2 cloves garlic
Perturbed generated text 6 <|endoftext|>The potato has become the new darling of American farmers in recent years. Its popularity is so great that it has even been featured in many successful television shows like "The Big Bang Theory".
But there has never been an easier way to prepare your favorite snack
Perturbed generated text 7 <|endoftext|>The potato is a favorite among the health-conscious, so what better time to try a new way to eat them? The recipe below is easy and healthy, and you can easily freeze it, freeze it for later, reheat it for breakfast or lunch,
Perturbed generated text 8 <|endoftext|>The potato salad that inspired the popular dish is one of a number of new varieties of the dish being sold at popular popular restaurants. (Photo: Thinkstock)
When it comes to classic American comfort food, a popular dish that's popular around the country
Perturbed generated text 9 <|endoftext|>The potato is a staple in many people's diet, and it is not only delicious in its own right, but is also a good protein source. It is easy to eat, nutritious, and healthy.
Potato, as we know it, originated
Perturbed generated text 10 <|endoftext|>The potato has been used as an ingredient in everything from salad dressing to soups for decades. However, it was once thought to be a poor performer in the kitchen. In recent years, scientists have shown potatoes to be a promising food source. The research shows
The first original generated sentence is a sentence that states objective facts and does not give a positive impression. You can see that the other sentences are mostly positive (although I'm not sure what 2 is saying, or 3 is better preceded by never ... ).
After all, it seems that it is necessary to consider the beginning of the sentence and the combination of attributes to some extent. This is also stated in the paper as "difficult to control depending on the attribute".
In this experiment, we generated a sentence with 50 words. At that time, it takes only 2-3 seconds to generate one sentence by the original GPT-2, but it takes about 22 seconds in the model by Bag-of-Words and 95 seconds in the model by the discriminator. It's about to take. It depends on the task used, but the bottleneck is that it takes so long to create a sentence of about 50 words. There are very few parameters to learn compared to the original model, but hyperparameters are $ \ alpha, \ gamma $, KL-Divergenece kl_loss, $ p_ {t + 1} $ when updating $ \ Delta H_t $. And $ \ tilde {p_ {t + 1}} $ to balance $ \ gamma_ {gm} $, and it is thought that it takes too much time to execute once, which affects the tuning cost. I will. I think this length is probably due to the multiple partial differential manipulations when updating $ H_t $. In a normal network, it only propagates in the forward direction, and there is no such operation. This problem may be solved if we can configure and learn a network that outputs $ \ tilde {H_t} $ when $ H_t $ is input without partial differentiation.
The update of $ H_t $, which is the main idea of this model, is limited to the language model that uses Transformer as a decoder. In the future, this method may not be available if different structures become mainstream.
In this article, we explained PPLM, a method for generating sentences that match the specified attributes when generating sentences with a model that uses Transformer as a decoder. The main idea of PPLM is to recursively update the Key and Value of Transformer's self-attention in the direction of generating the statement of the specified attribute by the externally connected model. This idea allows you to control attributes without having to retrain the original large model. From the test results, it can be seen that although it can be controlled in general, it is difficult to generate it if the beginning of the sentence and the attribute are incompatible. It also has the disadvantage that the execution time is longer than the original model.
Recommended Posts