Introduction

Dec-01-2016 13-02-04.gif

I want to learn blurred RNN (Chainer meetup 01) was very interesting, so I implemented it with Chainer as well. Also, I made an app that allows you to check the trained model directly from the browser.

First of all, what I did was basically the second brew of the original slide, and speaking of the difference, I tried experimenting with increasing learning data (it was not learned well), and I made a web application. That's about it.

The web application I made looks like this. Nov-30-2016 21-21-21.gif

The one I made this time is open to the public on GitHub. The documentation may be written soon.

Model for "blurring"

There is research that it is possible to generate a caption of an image by inputting the features of the image extracted using the Convolutional Neural Network (CNN) to the Recurrent Neural Network (RNN). Famous places include [Karpathy + 2015] and [Vinyals + 2014]. Can be mentioned.

(From [Vinyals + 2014])

In this figure, an image is input, and after passing through a multi-layered CNN, it is input to LSTM (a type of RNN), and an explanation is generated word by word. It's a simple model, but it's known to work surprisingly well, you can see an actual generation example here (this is Karpathy)

By the way, if you can ** translate the input image into "description", then you can also * translate the input image into "blurred"? In other words, is it possible to "blur" ** with a neural network? Based on this intuition, by implementing these models with Chainer, we created a "fully automatic defocused neural network".

Data resources used

CNN trained model
Blurred data for the image

For 1, it was easy to use the trained model provided for Caffe. This time I used CaffeNet, but I think other networks may be used. However, since the output of the fc7 layer (fully connected layer) is required to extract the features from the image, it looks like GoogleNet. Lighter models are (probably) unusable.

Regarding 2, there is a wonderful web service called bokete-a web service that blurs a word with photos, so I will crawl from here and collect it with enthusiasm. As mentioned by the former slide author, each blurred page has a simple structure of "1 image + 1 text", so I think it is not so difficult to collect data.

environment

Chainer (ver. 1.17.0)
Python 2.7 series

Feature extraction from the original image

Chainer is very useful because it can load Caffe's trained models. I defined the following method in Model Class and used the output.

`Model.py`


def encode_image(self, img_array):
    batchsize = img_array.shape[0]
    if self.config["use_caffenet"]:
        img_x = chainer.Variable(img_array, volatile='on')
        y = self.enc_img(
            inputs={"data": img_x},
            outputs={"fc7"})[0]
    else:
        x = self.xp.random.rand(batchsize, 4096).astype(np.float32)
        y = chainer.Variable(x)
    y.volatile = 'off'
    return self.img2x(y)

The image file vectorized by PIL etc. is stored in img_array. For this, I referred to Reading caffemodel with Chainer and classifying images --Qiita.

Input the original image to LSTM

Input the features of the extracted image into the LSTM. This is also as defined in Model Class.

`Model.py`


def __call__(self, x, img_vec=None):
    if img_vec is not None:
        h0 = self.embed_mat(x) + img_vec
    else:
        h0 = self.embed_mat(x)
    h1 = self.dec_lstm(h0)
    y = self.l1(h1)
    return y

Actually, we want to input the feature only when the time t = 0, so The image vector is given only when n = 0 in the calculation process.

`Trainer.py`


def _calc_loss(self, batch):
    boke, img = batch
    boke = self.xp.asarray(boke, dtype=np.int32)
    img = self.xp.asarray(img, dtype=np.float32)

    # 1.Put the vectorized image into CNN and make it a feature vector
    img_vec = self.model.predictor.encode_image(img)

    # 2.Learn to decode boke
    accum_loss = 0
    n = 0
    for curr_words, next_words in zip(boke.T, boke[:, 1:].T):
        if n == 0:
            accum_loss += self.model(curr_words, img_vec, next_words)
        else:
            accum_loss += self.model(curr_words, next_words)
        n += 1
    return accum_loss

You may be wondering, "Why don't you enter the features of the image every hour?", But [Karpathy + 2015]

Note that we provide the image context vector b v to the RNN only at the ﬁrst iteration, which we found to work better than at each time step.

It seems that you can get better output only when t = 0 (I have not actually tried it).

Learn Bokeh: Large Data

The pioneer I want to learn a blurred RNN (Chainer meetup 01) learned with 500 samples, and then 20,000 samples. It was written that I wanted to try it in, so I also wanted to do it on that scale, so I tried learning with about 30,000 samples. (Word Embedding, hidden layer of LSTM are both 100 dimensions, batch size 16 Dropout, etc.)

However, the loss did not decrease well.

`average_loss.log`


"average_loss":[
    10.557335326276828,
    9.724091438064605,
    9.051927699901125,
    8.728849313754363,
    8.36422316245738,
    8.1049892753394,
    7.999240087562069,
    7.78314874008182,
    7.821357278519156,
    7.629313596859783
]

(By the way, it is the loss of training data)

At the stage of turning about 30 epochs, the loss did not decrease. By the way, I tried to blur the image at hand, but I couldn't get a decent output ...

This is probably because there were only about 2000 images of the original blur for 30,000 blurs. (Multiple blurs are added to one image for the convenience of acquiring data from "blurred") Since there are probably 10 or more correct answers (translation destinations) for one image, the parameters of the model may not be adjusted in a unique direction. I think.

Learn bokeh: small data

Since "I want to check just by reducing the loss", I made an experiment with small-scale data (about 300 blurs), taking into consideration that one blur corresponds to one image:

`average_loss.log`


"average_loss": [
    6.765932078588577,
    1.7259380289486477,
    0.7160143222127642,
    0.3597904167004994,
    0.1992428061507997
]

Certainly the loss has decreased. (Turned up to a total of 100 epochs)

From this, even in the case of large-scale data, it can be expected that "loss will decrease (= can be learned) if there is a one-to-one correspondence between image and blur".

Learning results

When I tried it with the image at hand, I got the following output. ____________________________2016-10-31_13.11.20.png ** (Is this out of focus ...?) **

Creating a web app

I learned a lot, and it's boring if I can't check it from the browser, so I made a web application. (The code is open to the public)

It is a state.

You can see the statistics of the data used for learning. Backend_API 2.png

By pressing the Generate button, you can see the blurring of the training / development data.

Chainer's trained model is loaded behind the web application, and when you press a button on the browser side, the blur generation method fires.

in conclusion

This time, I used the image description generation models of [Karpathy + 2015] and [Vinyals + 2014] to learn and generate blur, but I don't think this model is the best for dealing with blur. Since this model is designed and evaluated on the assumption that there is only one correct answer in the description of the image **, it is an "arbitrary correct answer (= interesting)" such as "blurring on the image". I don't think it's suitable for "blurred" data. Actually, as a result of trying to learn by giving multiple correct answer data (blurring) to one image, I suffered from the phenomenon that loss does not decrease.

Also, even if the loss on the training data is reduced, the loss on the development data will probably not be reduced. (It should definitely be necessary to remove more input / output domains e.g. Narrow down the input image to the ossan, do not fill in the blanks, etc.)

In the first place, is the approach of trying to "generate blur on the input image" appropriate? Is there a more straightforward approach? For example, if you intentionally add a completely different image description to the input image **, it would be an interesting blur.

For example like this Screenshot 2016-12-01 18.05.56.png (It is an image)

Is a neural network really necessary for interesting bokeh? It's a very annoying place.

[PYTHON] Implementation of "blurred" neural network using Chainer

Introduction

Model for "blurring"

Data resources used

environment

Feature extraction from the original image

Model.py

Input the original image to LSTM

Model.py

Trainer.py

Learn Bokeh: Large Data

average_loss.log

Learn bokeh: small data

average_loss.log

Learning results

Creating a web app

in conclusion

`Model.py`

`Model.py`

`Trainer.py`

`average_loss.log`

`average_loss.log`