[PYTHON] Image caption generation with Chainer

Overview

I implemented image caption generation using Chainer. When you enter an image, the description will be generated. The source code is below. https://github.com/dsanno/chainer-image-caption

I used the algorithm in the following paper. Show and tell: A neural image caption generator

Some people have already implemented caption generation in Chainer, so I referred to that as well. Image caption generation by CNN and LSTM ~ Satoshi's Blog from Bloomington

Caption generative model

image_caption_model.png

The caption generation model used in the paper is roughly divided into three networks.

--Convert image to vector $ {\ rm CNN} $ $ {\ rm CNN} $ includes GoogleNet and VGG_ILSVRC_19_layers Use an existing model for image classification such as / 3785162f95cd2d5fee77 # file-readme-md). --Word embedding $ W_e $ --Enter a vector and output the probability of occurrence of the next word $ {\ rm LSTM} $

Model used for implementation

GPU memory was not enough if it was implemented as it is in the paper, so I changed it and implemented it.

--Convert image to vector $ {\ rm CNN} $ (Input: 224 x 224 x 3D Output: 4096 dimensions) --Matrix $ W_I $ that converts the image feature vector to the input of $ {\ rm LSTM} $ (input: 4096 dimensions output: 512 dimensions) --Word embedding (word to vector conversion) $ W_e $ (input: word ID output: 512 dimensions) -$ {\ rm LSTM} $ (input 512 dimensions output: 512 dimensions) -Convert the output of $ {\ rm LSTM} $ to the probability of word occurrence $ W_w $ (input 512 dimensions output: dimension)

I will explain based on the model in the following papers, but I think it is not difficult to replace it with the model actually used.

Model learning

The learning targets are $ W_e $ and $ {\ rm LSTM} $. $ {\ rm CNN} $ uses the trained parameters as they are.

The training data are the image $ I $ and the word string $ \ {S_t \} (t = 0 ... N) $. However, $ S_0 $ is the statement start symbol \ <S > and $ S_N $ is the terminal symbol \ </ S >. Learn as follows.

  1. Input the image $ I $ into $ {\ rm CNN} $ and extract the output of a specific intermediate layer as a feature vector.
  2. Enter the feature vector in $ {\ rm LSTM} $.
  3. Enter $ S_t $ in order from $ t = 0 $ to $ N-1 $ and get $ p_ {t + 1} $ at each step.
  4. Minimize the cost obtained from the probability of outputting $ S_ {t + 1} $ $ p_ {t + 1} (S_ {t + 1}) $

Negative log-likelihood as a cost function in the paper

L(I,S)=-\sum_{t=1}^{N}\log p_t(S_t)

I used to use softmax cross entropy in my implementation. Also, in the paper, the parameter was updated by SGD without momentum, but in my implementation, I used Adam (parameter is the recommended value of Adam paper). .. I also tried the log-likelihood and SGD implementation, but it seems that there is no merit just because the learning converges slowly, but I do not understand why it is adopted in the paper. I also used dropout as in the paper. The paper also mentioned that "ensembling models" were used, but I didn't implement it because I didn't know the specific implementation method.

Caption generation

When generating a caption using a trained model, the word occurrence probabilities are calculated in order from the beginning as shown below, and the word string with the highest product of word appearance probabilities is used as the caption.

  1. Input the image to $ {\ rm CNN} $ and extract the output of a specific intermediate layer as a feature vector.
  2. Enter the feature vector in $ {\ rm LSTM} $.
  3. Convert the statement start symbol \ <S > to a vector using $ W_e $ and enter it in $ {\ rm LSTM} $.
  4. Since the probability of word occurrence is known from the output of $ {\ rm LSTM} $, select the top $ M $ words.
  5. Convert the word output in the previous step into a vector using $ W_e $ and enter it in $ {\ rm LSTM} $.
  6. From the output of $ {\ rm LSTM} $, calculate the product of the probabilities of the words output so far, and select the top M word strings.
  7. Repeat steps 5 and 6 until the word output is terminal \ </ S >.

In this implementation, $ M = 20 $

Training data

For the training data, we used the image data set with Annotation of MSCOCO. However, instead of the data distributed by MSCOCO, I used the data distributed on the following sites. The data distributed on this site are the feature vector data extracted from the image using VGG_ILSVRC_19_layers and the Annotation word string data. By using this data, we were able to save the trouble of extracting the feature vector from the image and the trouble of preprocessing Annotation (dividing the sentence into words).

Deep Visual-Semantic Alignments for Generating Image Descriptions

According to the following site, MSCOCO's Annotation data seems to be difficult to handle due to severe notational fluctuations (sentences start with uppercase or lowercase letters, with or without periods).

I summarized the Microsoft COCO (MS COCO) dataset-I can have 3 cups of rice on the topic of artificial intelligence

Of the words included in the training data, only the words that appear 5 times or more were used, and the others were learned as unknown words.

Evaluation of generated captions

It seems that there are indicators such as BLEU, METEOR, and CIDER to evaluate the generated captions, but this time I did not calculate the indicators.

Caption generation example

Captions were generated using public domain images downloaded from PublicDomainPictures.net. Place the top 5 of the generated character strings.

clock

``` a clock on the side of a building a clock that is on the side of a building a clock on the side of a brick building a close up of a street sign on a pole a clock that is on top of a building ```

Is traffic control in progress? Police officer

``` a man riding a skateboard down a street a man riding a skateboard down a road a man riding a skateboard down the street a man riding a skateboard down a sidewalk a man riding a skateboard down the side of a road ``` skateboard. .. .. ??

Woman with a tennis racket

``` a woman holding a tennis racquet on a tennis court a man holding a tennis racquet on a tennis court a woman holding a tennis racquet on a court a woman holding a tennis racquet on top of a tennis court a man holding a tennis racquet on a court ``` The second and fifth are "man", but sometimes I mistake man for woman.

Living sofa

``` a cat laying on top of a bed a cat sitting on top of a bed a cat sitting on top of a couch a black and white cat laying on a bed a cat laying on a bed in a room ``` It seems that the cushion is mistaken for a cat.

Some were generated correctly, while others were clearly wrong.

References

Recommended Posts

Image caption generation with Chainer
Gradation image generation with Python [1] | np.linspace
Seq2Seq (1) with chainer
Easily try automatic image generation with DCGAN-tensorflow
Do image recognition with Caffe model Chainer Yo!
[Small story] Test image generation with Python / OpenCV
Image processing with MyHDL
Image recognition with keras
Use tensorboard with Chainer
Image processing with Python
Image Processing with PIL
JPEG image generation by specifying quality with Python + OpenCV
Automatic quiz generation with COTOHA
Image processing with Python (Part 2)
Image processing with PIL (Pillow)
Try implementing RBM with chainer.
Image editing with python OpenCV
Learn elliptical orbits with Chainer
Sentence generation with GRU (keras)
Sorting image files with Python (2)
Sorting image files with Python (3)
Seq2Seq (3) ~ CopyNet Edition ~ with chainer
Use chainer with Jetson TK1
Easy image classification with TensorFlow
Create Image Viewer with Tkinter
Image processing with Python (Part 1)
Tweet with image in Python
Sorting image files with Python
Neural network starting with Chainer
Image processing with Python (Part 3)
Implemented Conditional GAN with chainer
Get image features with OpenCV
Implemented SmoothGrad with Chainer v2
Deep Embedded Clustering with Chainer 2.0
A little stuck with chainer
Image recognition with Keras + OpenCV
[Python] Image processing with scikit-image
Cut out an image with python
Real-time image processing basics with opencv
[Python] Using OpenCV with Python (Image Filtering)
Accelerate query generation with SQLAlchemy ORM
[Python] Using OpenCV with Python (Image transformation)
Image segmentation with scikit-image and scikit-learn
[Chainer] Learning XOR with multi-layer perceptron
Image processing with Python 100 knocks # 3 Binarization
Image classification with wide-angle fundus image dataset
Let's do image scraping with Python
First Anime Face Recognition with Chainer
Inference works with Chainer 2.0 MNIST sample
Password generation in texto with python
Using Chainer with CentOS7 [Environment construction]
CSRF countermeasure token generation with Python
Find image similarity with Python + OpenCV
Try blurring the image with opencv2
Convert PDF to image with ImageMagick
Image processing with Python 100 knocks # 2 Grayscale
Try Common Representation Learning with chainer
Compare two images with image hash
I tried sentence generation with GPT-2
Send image with python, save with php
Seq2Seq (2) ~ Attention Model edition ~ with chainer