Normal caption generation example

like this If you include an image of a horse, you will get a caption like this. It seems that you can see two horse-like things. The probability of a sentence appearing is calculated from the probability of a word appearing. Three sentences that are easy to come out are displayed. It is judged that the smaller the number on the left is, the more appropriate the sentence is for the image. (Actually, the sign inversion of the sum of the logarithms of softmax for each word divided by the number of words)

Also, if you insert an image with random pixel values, the following sentence will be generated. Although it is a sentence, the number is large, that is, it is not possible to judge what is in the image.

fooling image generation result

I was able to generate it well for the time being.

Two sheets were generated. Neither is known to humans, and machines have a high probability of generating sentences about horses. (= The number is smaller than the previous example)

Above: direct encoding, the pixel of the image is the direct gene Bottom: indirect encoding, pixels have some correlation In the paper, the indirect encoding had a beautiful pattern and was exhibited as art, but it didn't work just by creating an NN and giving it a correlation. (Maybe it was too good)

How did you do that

The image was evolved so that the probability of generating one sentence is high. At the top of the sentence that was finally generated in the first example "a couple of horses are standing in a field" Was selected, and the image was evolved so that the probability of generating this sentence was high. Eight new individuals were generated each time, leaving eight excellent individuals, and direct encoding gave such a result in about 300 generations.

About the generative model

This time, a fooling image was generated for the caption generation model Show, Attend and Tell. The BLEU value for COCO of the model was 0.689 / 0.503 / 0.359 / 0.255.

Summary

Using an evolutionary algorithm, we succeeded in generating a fooling image that increases the probability of generating a sentence for the generative model. If you feel like this image can be fooled to other models trained on the same CNN, or if you evolve it for multiple sentences, try it.

[PYTHON] Created a fooling image for the caption generative model

Normal caption generation example

fooling image generation result

How did you do that

About the generative model

Summary