Introduction

This article was written as an internship at NS Solutions (NSSOL).

This article consists of the following structure.

Overview
I will give a rough explanation of the model PPLM explained in this article.
Model configuration
I will explain the model overview of PPLM.
Designing and learning PPLM Attribute Model
We will explain the design and learning of the model in the main idea of "connecting small models" of PPLM with actual code.
Run PPLM
I will explain the flow of processing when executing the trained model with actual code.
Test example
Introducing the actually generated sentences.
Improvements
I will explain the points that there seems to be room for improvement in this model.

Overview

I will give a rough explanation of the model PPLM explained in this article.

What is Plug and Play Language Model?

The Plug and Play Language Model (PPLM) is the model proposed in PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION. .. Author Implementation also exists on Github.

This research is a research that tackles the task of controlled text generation. Controlled text generation is the task of generating sentences that match the specified attributes (polarity such as positive / negative and topics such as "politics" and "science") when the language generation model generates sentences.

What's amazing about PPLM?

The great thing about PPLM is that it realizes controlled text generation just by learning a new small model for the existing language generation model. This greatness can be seen by comparing it with existing methods. Existing research approaches to controlled text generation can be categorized as follows.

Fine-tuning existing language generative model separately for each attribute
Existing models are often large-scale models, and the cost of re-learning them is enormous.
Consider the score when decoding
The naturalness of the generated sentence tends to be impaired
It is difficult to control attributes with this method in the first place

The former has sufficient performance in controlling attributes, but has a big problem in terms of learning cost. When you want to add new attributes, you need to retrain the large model. The latter has no problem in terms of learning cost, but its performance seems to be significantly inferior to the former. The method proposed by PPLM has the advantage that the learning cost is very low while maintaining the performance comparable to the former.

Model configuration

PPLM is intended for language generative models that use Transformer as a Decoder. Since GPT-2 is used in the author's implementation, it will be described on the assumption that GPT-2 will be used in the future.

This is a figure quoted from the original paper of PPLM (https://openreview.net/pdf?id=H1edEyBKDS). However, the blue letters ($ H_1, H_2, x_0, x_1, x_2 $) are my additions. If you look only at the black arrow in the figure and ignore [Attribute Model p (a | x)], the model will be a model using the original Transformer Decoder (hereinafter referred to as the original model). The LM (Language Model) in the figure is a stack of L Transformer Decoder blocks. $ H_t $ holds the self-attention Key and Value pairs that occur in each of these l-layer blocks. That is, $ H_t = [({K_t} ^ {(1)}, { V_t} ^ {(1)}), ..., ({K_t} ^ {(l)}, {V_t} ^ {(l)})] $ Here, $ {K_t} ^ {(i)}, {V_t} ^ {(i)} $ are a pair of Key and Value created by the Decoder block of the i-th layer at a certain time. In the original model, this $ H_t $ is passed to LM along with the word output at the previous time. In other words $ o_{t+1}, H_{t+1} = \text{LM}(x_t, H_t) $ $ o_ {t + 1} $ is the output from LM, and after linear transformation by the matrix $ W $ as follows, the distribution of likelihood of all words $ p_ {t + 1} $ is obtained by applying the Softmax function. I will. $ x_ {t + 1} $ follows this distribution. $ x_{t+1} \sim p_{t+1} = \text{Softmax}(W o_{t+1}) $ The original model is to generate words one after another in this way. This is the story of PPLM. In PPLM, the connected Attribute Model receives $ H_t $ and $ p (a | x_ {t + 1} represents how plausible $ x_ {t + 1} $ is for the specified attribute $ a $. ) Update $ H_t $ so that $ is larger. Since $ \ tilde {p_ {t + 1}} $ is newly calculated based on the updated $ \ tilde {H_t} $ of this $ H_t $, the distribution of the likelihood of words changes. In the figure, $ p_3 $ output from the original model has a high likelihood of "ok", but $ \ tilde {p_3} $ has a high likelihood of "delicious". Is drawn.

Designing and learning PPLM Attribute Model

In PPLM, two methods are proposed for the construction of Attribute Model $ p (a | x) $. One uses Bag-of-Words and the other uses a discriminator. Here, we will explain these with implementation code.

Attribute Model by Bag-of-Words

Here, a set of keywords related to the attribute is created in advance. Attributes provided by the author implementation includes "computers", "fantasy", "kitchen", "legal" , "military", "politics", "positive_words", "religion", "science", "space". For example, "science" has 48 words such as "astronomy", "atom", "biology", "cell", and "chemical".

Given the keyword set $ \ {w_1, ..., w_k \} $ for an attribute $ a $, the distribution of word likelihood calculated by the original model $ p_ {t + 1} $ Using, $ p (a | x_ {t + 1}) $, which represents how plausible the output word $ x_ {t + 1} $ is for the attribute $ a , can be thought of as follows: I can do it. $ p(a|x_{t+1}) = \sum_{i}^k p_{t+1}[w_i] $ The right-hand side of this equation represents the probability that a word in the keyword set will appear, that is, the probability that a word related to the desired attribute will appear. Although not written in the original paper,p(a|x_{t+1})(This is in the paperp(a|x)Is written as)p(a|p_{t+1})As an attribute of how much the word likelihood distribution isa$It may be easier to understand if it is suitable for. In addition, unlike the discriminator described later, the Attribute Model using Bag-of-Words does not have learning parameters and does not learn.

Attribute Model by discriminator

The Attribute Model using Bag-of-Words mentioned above was a simple design, but there is a problem. That is, there are cases where it is difficult to express an attribute only with a set of keywords. In such cases, the discriminator model described in this section is useful. The attribute discriminator is the output wordxAttributes ofaRepresents the uniquenessp(a|x), Self-attention Key,Value pairH_tUsingp(a|H_t)Reinterpret as. In other wordsH_tAs input and attributesaWe will learn a discriminator that returns the likelihood of. In implementation, what this discriminator returns is the logarithm of the distribution of likelihood for all attributes. Let's take a look at the actual code below.

`run_pplm_discrim_train.py`


class Discriminator(torch.nn.Module):
    """Transformer encoder followed by a Classification Head"""

    def __init__(
            self,
            class_size,
            pretrained_model="gpt2-medium",
            cached_mode=False,
            device='cpu'
    ):
        super(Discriminator, self).__init__()
        self.tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model)
        self.encoder = GPT2LMHeadModel.from_pretrained(pretrained_model)
        self.embed_size = self.encoder.transformer.config.hidden_size
        self.classifier_head = ClassificationHead(
            class_size=class_size,
            embed_size=self.embed_size
        )
        self.cached_mode = cached_mode
        self.device = device

    def get_classifier(self):
        return self.classifier_head

    def train_custom(self):
        for param in self.encoder.parameters():
            param.requires_grad = False
        self.classifier_head.train()

    def avg_representation(self, x):
        mask = x.ne(0).unsqueeze(2).repeat(
            1, 1, self.embed_size
        ).float().to(self.device).detach()  #mask used to ignore 0 in padding
        hidden, _ = self.encoder.transformer(x)
        masked_hidden = hidden * mask
        avg_hidden = torch.sum(masked_hidden, dim=1) / (
                torch.sum(mask, dim=1).detach() + EPSILON
        )
        return avg_hidden

    def forward(self, x):
        if self.cached_mode:
            avg_hidden = x.to(self.device)
        else:
            avg_hidden = self.avg_representation(x.to(self.device))

        logits = self.classifier_head(avg_hidden)
        probs = F.log_softmax(logits, dim=-1)

        return probs


class ClassificationHead(torch.nn.Module):
    """Classification Head for  transformer encoders"""

    def __init__(self, class_size, embed_size):
        super(ClassificationHead, self).__init__()
        self.class_size = class_size
        self.embed_size = embed_size
        self.mlp = torch.nn.Linear(embed_size, class_size)

    def forward(self, hidden_state):
        logits = self.mlp(hidden_state)
        return logits

The Discriminator class consists of two parts: a pre-trained model and a ClassificationHead to be learned. We do not retrain the trained model at all. When executing PPLMp(a|x)(More accuratelyp(a|H_t)) Is used only for the ClassificationHead part. I will explain the processing of Discriminator. Input x is a mini-batch, and it is a Tensor in which sentences representing each word by ID are arranged for the batch size. This is processed by the pre-trained model and hidden is output.

hidden, _ = self.encoder.transformer(x)

This hidden is like a distributed representation of each word of input x. Next, since this hidden contains extra words due to padding due to Tensor's batch processing, mask it to ignore it. By doing this, the distributed representation of the words added by padding becomes 0 (masked_hidden). Finally, avg_hidden can be obtained by adding the distributed expressions in each sentence of this masked_hidden. If you think of the words in a sentence as being added together, this can be interpreted as a decentralized representation of the sentence. This avg_hidden is the input to ClassificationHead. ClassificationHead is a neural network with no intermediate layer and only input layer and output layer. The number of nodes in the input layer is the number of dimensions of the distributed representation of avg_hidden, and the number of nodes in the output layer is the number of attributes. The output is logit, which is passed through a softmax function to take the logarithm.

logits = self.classifier_head(avg_hidden)
probs = F.log_softmax(logits, dim=-1)

The value (negative log-likelihood) of this probs (= output_t) corresponding to the correct answer class (target_t) becomes loss, and this loss is backpropagated for learning.

loss = F.nll_loss(output_t, target_t)
loss.backward(retain_graph=True)
optimizer.step()

Run PPLM

After learning the Attribute Model in the previous chapter, the next step is to actually execute PPLM. Actually, a trained Attribute Model is prepared, so if you just want to check the operation, you do not need to learn it again. Also, learning is not required when using the Attribute Model by Bag-of-Words. As mentioned in [Model Configuration](#Model Configuration), PPLM controls the output word by using $ \ tilde {H_t} $ which is an updated version of $ H_t $ of the original model. This update is actually done as follows.

\begin{align}
\tilde{H_t} &= H_t + \Delta H_t \\
\Delta H_t & \leftarrow \Delta H_t + \alpha \frac{\nabla_{\Delta H_t} \log p(a | H_t + \Delta H_t)}{||\nabla_{\Delta H_t} \log p(a | H_t + \Delta H_t)||^{\gamma}}
\end{align}

$ \ Alpha and \ gamma $ are hyperparameters. What we are doing here is the calculation of the update $ \ Delta H_t $ that increases the likelihood of the attribute $ a $. This $ \ Delta H_t $ itself is calculated and updated multiple times, and then added to $ H_t $. This number is said to be 3-10 times (the default value in the implementation code is 3). This update is done in the perturb_past function of run_pplm.py. First, let's take a look at the implementation by Attribute Model using Bag-of-Words.

loss = 0.0
bow_logits = torch.mm(probs, torch.t(one_hot_bow))
bow_loss = -torch.log(torch.sum(bow_logits))
loss += bow_loss

Although the process is omitted, probs is equivalent to $ p_ {t + 1} $ and is the distribution of the likelihood of all words in the knowledge of the language model. one_hot_bow represents each word belonging to the attribute keyword set as a one-hot-vector for all words in the knowledge of the language model. The sum of bow_logits multiplied by these is $ \ sum_ {i} ^ k p_ {t + described in [Attribute Model by Bag-of-Words](-attribute-model by # bag-of-words). 1} [w_i] Corresponds to the calculation of $. The negative logarithm of this sum is bow_loss. This bow_loss corresponds to $-\ log p (a | H_t + \ Delta H_t) $.

Next, the implementation by Attribute Model using the discriminator is as follows.

ce_loss = torch.nn.CrossEntropyLoss()

prediction = classifier(new_accumulated_hidden / (curr_length + 1 + horizon_length))
label = torch.tensor(prediction.shape[0] * [class_label],
                     device=device,
                     dtype=torch.long)
discrim_loss = ce_loss(prediction, label)
loss += discrim_loss

The classifier is the ClassificationHead part of the discriminator learned in the previous chapter. I will explain new_accumulated_hidden. The hidden state output from the last layer of the 12-layer Transformer calculated by giving the updated $ \ tilde {H_t} $ to GPT-2 is displayed in [Attribute Model by discriminator](# by discriminator). -As with avg_hidden explained in (attribute-model), consider the sum of distributed expressions. This calculation is performed for the unupdated $ H_t $ in the same way, and the sum of these two is new_accumulated_hidden. You may not know this area unless you actually follow the code, but it may be okay to recognize that you are entering the hidden state in the classifier. The default value for curr_length is the number of input words for GPT-2 at the moment, and for horizon_length, 1 is the default value (horizon_length, the explanation of the head family is not clear what role it plays). The class_label used for label is given by the user when run_pplm.py is run. For example, a pre-assigned index for the positive class. Find the cross-entropy loss between this label and prediction. The discrim_loss calculated in this way also corresponds to $-\ log p (a | H_t + \ Delta H_t) $. The Bag-of-Words and the Attribute Model by the discriminator can be used together. In that case, add bow_loss and discrim_loss.

loss = 0.0
loss += bow_loss
loss += discrim_loss

Up to here\log p(a | H_t + \Delta H_t)Is calculated (\log p(a | H_t + \Delta H_t) = - \text{loss}), Then by finding the gradient\Delta H_tCan be updated, but this alone will not work. What I have thought so far isp(a|x)(Orp(a|H_t)) Was just increased.p(x)It does not consider itself. Therefore, there is a possibility that the generated sentence will be unnatural. This problem is solved by two approaches. One is the KL-Divergence between $ p_ {t + 1} $ and $ \ tilde {p_ {t + 1}} , that is. $ \text{kl_loss} = \sum_i \tilde{p_{t+1}}[w_i] \log{ \frac{ \tilde{p_{t+1}}[w_i] }{ p_{t+1}[w_i] } } $$ Is to make it smaller. Let's take a look at the implementation.

kl_loss = kl_scale * (
    (corrected_probs * (corrected_probs / unpert_probs).log()).sum()
)
loss += kl_loss

Here, corrected_probs is $ \ tilde {p_ {t + 1}} $ and unpert_probs is $ p_ {t + 1} $. kl_scale is a hyperparameter, and basically it should be set to 0.01. The calculated kl_loss is added to bow_loss, discrim_loss, or their sum. When updating $ \ Delta H_t $, all of them are moved together in the gradient direction.

KL-Divergence is the only countermeasure at the renewal design stage. Another approach is to actually sample the words according to their likelihood after calculating $ \ tilde {p_ {t + 1}} . Sampling is performed as follows. $ x_{t+1} \sim \frac{1}{\beta} \left( \tilde{p_{t+1}}^{\gamma_{gm}} {p_{t+1}}^{1-\gamma_{gm}} \right) $$ What this sampling means is that sampling is performed considering not only the updated distribution $ \ tilde {p_ {t + 1}} $ but also $ p_ {t + 1} $ before the update. $ \ Beta $ is a normalization coefficient for establishing it as a mere probability distribution, and $ \ tilde {p_ {t + 1}} ^ {\ gamma_ {gm}} {p_ {t + 1}} ^ {1- \ The sum of gamma_ {gm}} $. $ \ Gamma_ {gm} $ is a hyperparameter. When $ \ gamma_ {gm} $ approaches 1, it approaches $ \ tilde {p_ {t + 1}} $, and when it approaches 0, it approaches $ p_ {t + 1}. It approaches $. Actually, it seems that $ \ gamma_ {gm} $ should be set to $ 0.8 \ sim 0.95 $. In other words, the distribution after the update should be considered only a little, and the distribution before the update should be considered heavily. The implementation is as follows.

pert_probs = ((pert_probs ** gm_scale) * (unpert_probs ** (1 - gm_scale)))
pert_probs = top_k_filter(pert_probs, k=top_k, probs=True)
if torch.sum(pert_probs) <= 1:
    pert_probs = pert_probs / torch.sum(pert_probs)

if sample:
    last = torch.multinomial(pert_probs, num_samples=1)
else:
    _, last = torch.topk(pert_probs, k=1, dim=-1)

The first line calculates $ \ tilde {p_ {t + 1}} ^ {\ gamma_ {gm}} {p_ {t + 1}} ^ {1- \ gamma_ {gm}} $. In the second line, a filter is applied to leave only the k words with the highest likelihood so that words with too low a likelihood do not appear when sampling. The operation on lines 3-4 is the operation of dividing by $ \ beta $. Lines 6-7 are sampling according to the distribution of likelihood, and lines 8-9 are greedily sampling words with maximum likelihood.

The above is the idea of PPLM to update $ H_t $ and devise sampling.

Test example

Here is an actual test example. Author Implementation has two example instructions, so try them.

Bag-of-Words PPLM model example

First, about PPLM using Bag-of-Words. Use the following instructions implemented by the author as a trial.

python run_pplm.py -B military --cond_text "The potato" --length 50 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.03 --window_length 5 --kl_scale 0.01 --gm_scale 0.99 --colorama --sample

-Β military specifies the Bag-of-Words model of military attribute. Here are the results.

Unperturbed generated text <|endoftext|>The potato is probably the world's most widely eaten plant. But what if it's also the most dangerous?

In the last two decades, there's been a dramatic decrease in potato crop damage from crop rot and disease. The decline, which started in

Perturbed generated text 1 <|endoftext|>The potato-flour soup that is the best way to start a weekend! The following recipe is one of several that I have been working on over the past few months. I think it is the best of them. It uses all the elements of the

Perturbed generated text 2 <|endoftext|>The potato bomb and the anti-Semitic attack that killed four Jewish students at a Jewish school in France are the most recent examples of bomb threats targeting Israeli facilities. The latest bomb threat targeting a U.S. nuclear facility, the bomb was sent out over the

Perturbed generated text 3 <|endoftext|>The potato chip explosion has been a boon to the world's food industry since its release in late March. A handful of companies have already announced plans to produce chips using the chips, including Chipotle Mexican Grill Corp.'s parent company, Taco Bell Corp.'s

Perturbed generated text 4 <|endoftext|>The potato is a very popular and delicious vegetable in many countries, but it can also cause severe health problems for people. The health of your body depends on your diet. If your diet doesn't include enough protein to get through the meal, or if you are

Perturbed generated text 5 <|endoftext|>The potato plant, which is a member of the same family as wheat, can be found around the world. It's also used to make potato chips, bread, and other food products.

The Plant

The plant grows as a seed and produces

Perturbed generated text 6 <|endoftext|>The potato bomb has been a controversial weapon for years. The device is packed with bomb-like devices and packed on a bomb-filled potato bomb. It's a bomb that detonates in the bomb-packed potato bomb and explodes in the potato bomb. So

Perturbed generated text 7 <|endoftext|>The potato has a lot in common with the human earworm: The first, and only, time you hear it, you'll hear the sound of the potato in your ear as well.

It's the first sound you hear when your cat or dog

Perturbed generated text 8 <|endoftext|>The potato salad is coming to a restaurant near you!

The new restaurant, in the heart of downtown Chicago, will be named the Potato Salad.

A photo posted by @the_mike_barnes on Aug 7, 2016 at

Perturbed generated text 9 <|endoftext|>The potato is a staple in many people's diet, and it is an easy food to make in your home.

The best potato chips in the world are made by hand using only potatoes.

The potato is a staple in many people's diet

Perturbed generated text 10 <|endoftext|>The potato bomb is an improvised explosive device, typically containing one bomb and no more than 10 grams of explosive and containing no explosive material.

Bombardment of an aircraft aircraft, a tank truck or explosive device

Bombardment of an aircraft aircraft

Generates sentences independently for the number of num_samples specified at the time of execution (10 this time). Words related to attributes are emphasized in red. This can be displayed in red on standard output by specifying --colorama at runtime. Let's look at a generation example. First, there are no military elements in the original uncontrolled model generation statement. For controlled sentences, the word "bomb" is often found for 2, 6 and 10. However, I don't get the impression that it is a military statement for other examples. Military attributes starting with The potato The sentence may have been a little difficult.

Discriminator PPLM model example

Try a model that uses an attribute discriminator. Execution is the following statement of the author implementation.

python run_pplm.py -D sentiment --class_label 2 --cond_text "My dog died" --length 50 --gamma 1.0 --num_iterations 10 --num_samples 10 --stepsize 0.04 --kl_scale 0.01 --gm_scale 0.95 --sample

Here are the results.

Unperturbed generated text <|endoftext|>My dog died in February, after suffering from severe arthritis. He had been suffering with a terrible cold that was causing his skin to break. I couldn't afford a replacement dog and couldn't afford to have him taken to the vet. I knew the vet would be

Perturbed generated text 1 <|endoftext|>My dog died of a heart attack at the age of 88, his son said, and her death has shocked and brought closure to the family. (Published Wednesday, March 12, 2017)

A mother who was found dead at home with a heart attack on

Perturbed generated text 2 <|endoftext|>My dog died from a rare and potentially deadly form of a rare form of sickle cell disease.

A rare form of sickle cell is called hemizygaly in the families.

The family is an important part of the game and it's

Perturbed generated text 3 <|endoftext|>My dog died after being shot.

A woman in the United States died after a man in his 20s opened fire at her home in North Carolina and injured several others.

On March 12 a neighbor heard a woman screaming. After she ran outside to

Perturbed generated text 4 <|endoftext|>My dog died of a heart attack, after suffering from a heart attack.

The title text of this page has a a a

of of the work and work in to be an in a way, that the idea of the idea to a

Perturbed generated text 5 <|endoftext|>My dog died from a rare form of cancer that was not known before.

The rare form of brain cancer called glioblastomatosis is more common in people of European descent. People of European descent are also at greater risk of glioma

Perturbed generated text 6 <|endoftext|>My dog died from anaphase and I don't know how to give birth to a child with a rare genetic condition, an important personal health gain, with health - "" " The " " " "'The'"'" The book " The word

Perturbed generated text 7 <|endoftext|>My dog died from a rare form of cancer, the Daily Mail reports. "I have a really strong desire to help others and so I am happy to have the chance to help others to be happy and to love their loved ones and that's something I love

Perturbed generated text 8 <|endoftext|>My dog died because I didn't let him go.

I have a 6-year-old, 3-year-old, 1-year-old, 2-year-old, and 2-year-old. I have a very active and

Perturbed generated text 9 <|endoftext|>My dog died of a heart attack while while while I was in the house. I had the old man's head and body, and a large one, I have my hands and feet with me. I have a good time, and the best, as I am

Perturbed generated text 10 <|endoftext|>My dog died from a rare form of cancer, scientists have found.... James M. He he is is is is a

A lot of a lot of a fun!! The Great Escape The Great Escape! The Great Escape! The Great Escape

Unlike Bag-of-Words, related words cannot be highlighted. -D sentiment specifies a pre-learned "sentiment" discriminator. This discriminator discriminates between two classes, "very_positive" and "very_negative", and the class_label = 2 specified this time represents "very_positive". (Note that "very_negative" can be specified by setting class_label = 3). This is an example of trying to generate a positive sentence for the beginning that seems to generate only a negative sentence "My dog died". Let's look at each example. The generated sentence of the original model without attribute control is pessimistic. Regarding the controlled generated sentences, 2, 5, 7 etc. are relatively positive (or rather non-negative) sentences are generated. Words such as rare and love stand out. Since the negatives of 1, 3, 4, 6, 8, 9, 10 etc. have not disappeared or the sentences are unnatural, it is difficult to generate them properly if the beginning of the sentence and the attributes do not match. You can see the situation.

Try an example that seems a little easier to generate

The above example was a combination of sentence beginnings and attributes that would be difficult to generate. Let's match the beginning of the sentence and the attributes a little more. Start with The potato and attribute positive_words.

python run_pplm.py -B positive_words --cond_text "The potato" --length 50 --gamma 1.5 --num_iterations 3 --num_samples 10 --stepsize 0.03 --window_length 5 --kl_scale 0.01 --gm_scale 0.99 --colorama --sample

Here are the results.

Unperturbed generated text <|endoftext|>The potato is probably the world's most widely eaten plant. But what if it's also the most dangerous?

In the last two decades, there's been a dramatic decrease in potato crop damage from crop rot and disease. The decline, which started in

Perturbed generated text 1 <|endoftext|>The potato-like, gluten-free, low-calorie, sweet, and nutritious sweet potato pie recipe. Easy to make, and perfect for those who love to eat sweet, healthy, and filling pie!

When my kids are home from school

Perturbed generated text 2 <|endoftext|>The potato has been a popular favorite since the 1980s. But with its recent popularity and rising popularity, is it time to eat your favorite potato again?

The potato is still a great food to enjoy and enjoy, with its healthy benefits and delicious flavor

Perturbed generated text 3 <|endoftext|>The potato chip craze is in full swing.

The popular snacks have been making the rounds in recent weeks as people seek out fresh and healthier alternatives to fried foods.

But there may have never been a better time to eat these crispy snacks than

Perturbed generated text 4 <|endoftext|>The potato is a very versatile, versatile vegetable and it is a great addition to many tasty salads, soups and stews.

The potato is the star of many salads and stirfries. I love the versatility of potatoes in many recipes.

Perturbed generated text 5 <|endoftext|>The potato is a common dish, so much so in fact that it is often served with pasta. It is often served with rice, or topped with a sweet and savoury sauce.

Serves 4

1 onion

2 cloves garlic

Perturbed generated text 6 <|endoftext|>The potato has become the new darling of American farmers in recent years. Its popularity is so great that it has even been featured in many successful television shows like "The Big Bang Theory".

But there has never been an easier way to prepare your favorite snack

Perturbed generated text 7 <|endoftext|>The potato is a favorite among the health-conscious, so what better time to try a new way to eat them? The recipe below is easy and healthy, and you can easily freeze it, freeze it for later, reheat it for breakfast or lunch,

Perturbed generated text 8 <|endoftext|>The potato salad that inspired the popular dish is one of a number of new varieties of the dish being sold at popular popular restaurants. (Photo: Thinkstock)

When it comes to classic American comfort food, a popular dish that's popular around the country

Perturbed generated text 9 <|endoftext|>The potato is a staple in many people's diet, and it is not only delicious in its own right, but is also a good protein source. It is easy to eat, nutritious, and healthy.

Potato, as we know it, originated

Perturbed generated text 10 <|endoftext|>The potato has been used as an ingredient in everything from salad dressing to soups for decades. However, it was once thought to be a poor performer in the kitchen. In recent years, scientists have shown potatoes to be a promising food source. The research shows

The first original generated sentence is a sentence that states objective facts and does not give a positive impression. You can see that the other sentences are mostly positive (although I'm not sure what 2 is saying, or 3 is better preceded by never ... ).

After all, it seems that it is necessary to consider the beginning of the sentence and the combination of attributes to some extent. This is also stated in the paper as "difficult to control depending on the attribute".

Improvement points

Long execution time

In this experiment, we generated a sentence with 50 words. At that time, it takes only 2-3 seconds to generate one sentence by the original GPT-2, but it takes about 22 seconds in the model by Bag-of-Words and 95 seconds in the model by the discriminator. It's about to take. It depends on the task used, but the bottleneck is that it takes so long to create a sentence of about 50 words. There are very few parameters to learn compared to the original model, but hyperparameters are $ \ alpha, \ gamma $, KL-Divergenece kl_loss, $ p_ {t + 1} $ when updating $ \ Delta H_t $. And $ \ tilde {p_ {t + 1}} $ to balance $ \ gamma_ {gm} $, and it is thought that it takes too much time to execute once, which affects the tuning cost. I will. I think this length is probably due to the multiple partial differential manipulations when updating $ H_t $. In a normal network, it only propagates in the forward direction, and there is no such operation. This problem may be solved if we can configure and learn a network that outputs $ \ tilde {H_t} $ when $ H_t $ is input without partial differentiation.

Limited to Transformer usage models

The update of $ H_t $, which is the main idea of this model, is limited to the language model that uses Transformer as a decoder. In the future, this method may not be available if different structures become mainstream.

Summary

In this article, we explained PPLM, a method for generating sentences that match the specified attributes when generating sentences with a model that uses Transformer as a decoder. The main idea of PPLM is to recursively update the Key and Value of Transformer's self-attention in the direction of generating the statement of the specified attribute by the externally connected model. This idea allows you to control attributes without having to retrain the original large model. From the test results, it can be seen that although it can be controlled in general, it is difficult to generate it if the beginning of the sentence and the attribute are incompatible. It also has the disadvantage that the execution time is longer than the original model.

References

What I read

PLUG AND PLAY LANGUAGE MODELS: A SIMPLE APPROACH TO CONTROLLED TEXT GENERATION
Original paper. Written in plain English and very easy to read, please read it if you are interested.
Language Models are Unsupervised Multitask Learners
Original paper of GPT-2. The basic model configuration is written in the original paper of GPT (described later).
The Illustrated GPT-2 (Visualizing Transformer Language Models)
An article that carefully illustrates GPT-2. It's very easy to understand.
The Illustrated Transformer
An article that carefully illustrates Transformer. It's very easy to understand.
Make and understand Transformer / Attention
Transformer and its elemental technology, self-attention, and what is attention in the first place? It is a polite article starting from. You can deepen your understanding by reading it together with The Illustrated Transformer above.

Other / Related Literature

Improving Language Understanding by Generative Pre-Training
Old version of GPT-2, original paper of GPT. The model configuration is described here.
Attention Is All You Need
The original paper that invented the well-known Transformer.

[PYTHON] PPLM: A simple deep learning technique to generate sentences with specified attributes