I want to learn blurred RNN (Chainer meetup 01) was very interesting, so I implemented it with Chainer as well. Also, I made an app that allows you to check the trained model directly from the browser.
First of all, what I did was basically the second brew of the original slide, and speaking of the difference, I tried experimenting with increasing learning data (it was not learned well), and I made a web application. That's about it.
The web application I made looks like this.
The one I made this time is open to the public on GitHub. The documentation may be written soon.
There is research that it is possible to generate a caption of an image by inputting the features of the image extracted using the Convolutional Neural Network (CNN) to the Recurrent Neural Network (RNN). Famous places include [Karpathy + 2015] and [Vinyals + 2014]. Can be mentioned.
(From [Vinyals + 2014])In this figure, an image is input, and after passing through a multi-layered CNN, it is input to LSTM (a type of RNN), and an explanation is generated word by word. It's a simple model, but it's known to work surprisingly well, you can see an actual generation example here (this is Karpathy)
By the way, if you can ** translate the input image into "description", then you can also * translate the input image into "blurred"? In other words, is it possible to "blur" ** with a neural network? Based on this intuition, by implementing these models with Chainer, we created a "fully automatic defocused neural network".
For 1, it was easy to use the trained model provided for Caffe. This time I used CaffeNet, but I think other networks may be used. However, since the output of the fc7 layer (fully connected layer) is required to extract the features from the image, it looks like GoogleNet. Lighter models are (probably) unusable.
Regarding 2, there is a wonderful web service called bokete-a web service that blurs a word with photos, so I will crawl from here and collect it with enthusiasm. As mentioned by the former slide author, each blurred page has a simple structure of "1 image + 1 text", so I think it is not so difficult to collect data.
Chainer is very useful because it can load Caffe's trained models. I defined the following method in Model Class and used the output.
Model.py
def encode_image(self, img_array):
batchsize = img_array.shape[0]
if self.config["use_caffenet"]:
img_x = chainer.Variable(img_array, volatile='on')
y = self.enc_img(
inputs={"data": img_x},
outputs={"fc7"})[0]
else:
x = self.xp.random.rand(batchsize, 4096).astype(np.float32)
y = chainer.Variable(x)
y.volatile = 'off'
return self.img2x(y)
The image file vectorized by PIL etc. is stored in img_array. For this, I referred to Reading caffemodel with Chainer and classifying images --Qiita.
Input the features of the extracted image into the LSTM. This is also as defined in Model Class.
Model.py
def __call__(self, x, img_vec=None):
if img_vec is not None:
h0 = self.embed_mat(x) + img_vec
else:
h0 = self.embed_mat(x)
h1 = self.dec_lstm(h0)
y = self.l1(h1)
return y
Actually, we want to input the feature only when the time t = 0, so The image vector is given only when n = 0
in the calculation process.
Trainer.py
def _calc_loss(self, batch):
boke, img = batch
boke = self.xp.asarray(boke, dtype=np.int32)
img = self.xp.asarray(img, dtype=np.float32)
# 1.Put the vectorized image into CNN and make it a feature vector
img_vec = self.model.predictor.encode_image(img)
# 2.Learn to decode boke
accum_loss = 0
n = 0
for curr_words, next_words in zip(boke.T, boke[:, 1:].T):
if n == 0:
accum_loss += self.model(curr_words, img_vec, next_words)
else:
accum_loss += self.model(curr_words, next_words)
n += 1
return accum_loss
You may be wondering, "Why don't you enter the features of the image every hour?", But [Karpathy + 2015]
Note that we provide the image context vector b v to the RNN only at the first iteration, which we found to work better than at each time step.
It seems that you can get better output only when t = 0
(I have not actually tried it).
The pioneer I want to learn a blurred RNN (Chainer meetup 01) learned with 500 samples, and then 20,000 samples. It was written that I wanted to try it in, so I also wanted to do it on that scale, so I tried learning with about 30,000 samples. (Word Embedding, hidden layer of LSTM are both 100 dimensions, batch size 16 Dropout, etc.)
However, the loss did not decrease well.
average_loss.log
"average_loss":[
10.557335326276828,
9.724091438064605,
9.051927699901125,
8.728849313754363,
8.36422316245738,
8.1049892753394,
7.999240087562069,
7.78314874008182,
7.821357278519156,
7.629313596859783
]
(By the way, it is the loss of training data)
At the stage of turning about 30 epochs, the loss did not decrease. By the way, I tried to blur the image at hand, but I couldn't get a decent output ...
This is probably because there were only about 2000 images of the original blur for 30,000 blurs. (Multiple blurs are added to one image for the convenience of acquiring data from "blurred") Since there are probably 10 or more correct answers (translation destinations) for one image, the parameters of the model may not be adjusted in a unique direction. I think.
Since "I want to check just by reducing the loss", I made an experiment with small-scale data (about 300 blurs), taking into consideration that one blur corresponds to one image:
average_loss.log
"average_loss": [
6.765932078588577,
1.7259380289486477,
0.7160143222127642,
0.3597904167004994,
0.1992428061507997
]
Certainly the loss has decreased. (Turned up to a total of 100 epochs)
From this, even in the case of large-scale data, it can be expected that "loss will decrease (= can be learned) if there is a one-to-one correspondence between image and blur".
When I tried it with the image at hand, I got the following output. ** (Is this out of focus ...?) **
I learned a lot, and it's boring if I can't check it from the browser, so I made a web application. (The code is open to the public)
It is a state. You can see the statistics of the data used for learning. By pressing the Generate button, you can see the blurring of the training / development data.Chainer's trained model is loaded behind the web application, and when you press a button on the browser side, the blur generation method fires.
This time, I used the image description generation models of [Karpathy + 2015] and [Vinyals + 2014] to learn and generate blur, but I don't think this model is the best for dealing with blur. Since this model is designed and evaluated on the assumption that there is only one correct answer in the description of the image **, it is an "arbitrary correct answer (= interesting)" such as "blurring on the image". I don't think it's suitable for "blurred" data. Actually, as a result of trying to learn by giving multiple correct answer data (blurring) to one image, I suffered from the phenomenon that loss does not decrease.
Also, even if the loss on the training data is reduced, the loss on the development data will probably not be reduced. (It should definitely be necessary to remove more input / output domains e.g. Narrow down the input image to the ossan, do not fill in the blanks, etc.)
In the first place, is the approach of trying to "generate blur on the input image" appropriate? Is there a more straightforward approach? For example, if you intentionally add a completely different image description to the input image **, it would be an interesting blur.
For example like this (It is an image)
Is a neural network really necessary for interesting bokeh? It's a very annoying place.
Recommended Posts