Explanation of Noise2Noise (How to learn a noise removal network only with images with noise)

I would like to explain and summarize the paper on noise removal Noise2 Noise: Learning Image Restoration without Clean Data. Since it was a 2018 paper and had a big impact, it seems old to write an introductory article, but Qiita had an article about implementation, but there was no article explaining the content, so a memorandum I will leave it as. Basically, it is written with an emphasis on comprehensibility so that anyone can read it.

Overview

In recent years, through deep learning, the approach of restoring the correct signal from a damaged signal (high resolution, De-JEPG, colorization, etc.) has been highly successful. However, if you take these approaches, you need to have a dataset that matches clean and corrupted data. If this can be converted in a simple way, such as "color image to black and white image" or "high resolution image to low resolution image", you can easily prepare a dataset, but for many tasks Is not. Even if you think about collecting clean images, all the data may continue to meet the conditions such as long exposure and still subject. Even if you look at ImageNet etc., you can see images that contain noise when taking pictures.

Formulation

In this paper, to solve the problems of these datasets, we explain how to learn a network to decode the correct signal from the corrupted signal given only the dataset of the corrupted signal. I will.

First, formulate the problem. The following formula is an input signal in which $ \ hat {x} _i $ is damaged, and if it is an image, it is a noise image. If $ y_i $ is a clean output, an image, it indicates a denoised image.

\underset{\theta}{argmin} \displaystyle \sum_i L(f_\theta(\hat{x}_i),y_i)

Basically, you want to learn the function $ f \ _ {\ theta} $ that adjusts the parameter $ \ theta $ to convert $ \ hat {x} _i $ to $ y \ _i $. Noise removal problem setting. At this time, the corrupted input $ \ hat {x} $ must be a random variable generated according to a clean target ($ \ hat {x} \ thicksim p (\ hat {x} | y \ _i) $).

The point

The characteristic points of the dissertation are the following three

Learn to convert a clean image from a noise image with only a dataset of noise images (corrupted signals)
Can produce results that are about the same as or rather better than using a dataset of noise and clean images.
Learning the conversion of a noise image to a noise image (Noise2Noise) is equivalent to learning the conversion of a noise image to a clean image (Noise2Clearn).

I feel like I can really do it. Especially at the end, I want to say that your head is okay. However, as the title of the paper suggests, noise to noise is the key idea of this paper.

Theoretical background

First, let's think about the regression model (regressor). In addition, in this theoretical background explanation, all models and classifications assume regressors, and classification models (classifications) are not considered.

Therefore, let's start with a very simple example of regression.

First assume an unreliable room temperature dataset of measurements {$ \ {y \ _1, y_2, y_3 } $} In other words, you can think of it as multiple temperature measurements taken at several points in the room. However, I don't know if the measurement is bad or the thermometer itself is bad, but I will assume that there is an error between the true room temperature and the measured value.

At this time, the most common strategy for estimating the unknown true room temperature is to minimize the error from the measured value based on some loss function = $ z $ with the smallest mean deviation. Is to ask.

\underset{z}{argmin} \mathbb{E}_y\{L(z, y)\}

This is, for example, if you try to minimize the L2 loss (z-y) ^ 2, z will be the simple arithmetic mean.

z = \mathbb{E}_y\{y\}.

I think this is intuitive and easy to understand.

Similarly, when L1 loss is used, the median value of the observation data set is obtained as the optimum solution.

Training with a neural regressor can be seen as a generalization of the above method. Here is a formalization of the training task with input and target pairs as follows.

\underset{\theta}{argmin} \mathbb{E}_{(x,y)} \{ L(f_\theta(x),y) \}

These formulas are common DNN formulas, but can be rewritten to the following conditional probabilities.

\underset{\theta}{argmin} \mathbb{E}_x \{\mathbb{E}_{y|x} \{ L(f_\theta(x),y) \}\}

There are important points that the neural regressor hides from these equations.

In other words, the learning of the regressor seems to be learning the conversion from x to y corresponding to 1: 1, but in reality, there are multiple y corresponding to x, so it can be said to be a 1: n mapping.

This is easy to understand by giving a concrete example, but for high resolution tasks, the output high resolution image $ (at least humans think it is higher resolution than x) for the input low resolution image $ x $. It can be said that there are multiple y $.

Similarly, in the auto-coloring task, it can be said that there are multiple output color images for the input black-and-white image.

Therefore, many tasks using neural regressors seem to connect points, but in reality, it is thought that they are learning to connect points.

If these outputs were trained on L2 loss, as explained earlier in the room temperature dataset, the output would eventually learn to output the average of all plausible explanations.

As a result, the inferred output of NN contains spatial blur.

This blur has plagued researchers on many tasks. For example, there is a problem that high-resolution images and GAN generation results are output with smoothed images such as Gaussian filters, and researchers are working on this improvement.

However, it can be said that the problem caused by this blur is producing an unexpected by-product in this case.

In other words, even if the target during training is contaminated with uniform random numbers (for example, Gaussian noise and salt & pepper), this averaging ability causes the trained network to output the same output as the result of training with a clean target. .. Therefore, it can be said that the obtained $ f_ \ theta $ is an equivalent function when optimized in the expression 1 and the following expression.

\underset{\theta}{argmin} \sum_i L(f_{\theta}(\hat{x}_i),\hat{y}_i)

This formula eliminates the clean target $ \ hat {x} $ that we needed earlier.

This theory is the most important and fundamental theory in Noise2Noise.

So far, I will explain the theory for the time being, and then move on to the experimental phase.

Experiment

coming soon...