Introduction

When I was reading the dissertation diagonally at Papers with Code, I was introduced to the technique of color imaging of black and white images, which I wanted to learn once. I have translated the outline, so I hope you find it helpful.

Instance-aware Image Colorization https://paperswithcode.com/paper/instance-aware-image-colorization

A color imaging technology for black and white images using object division was recently posted on arxiv.

Summary: Abstract

The problem with color imaging is that it contains multimodal [* 1] uncertainties.
The existing model fails when there are multiple objects because it trains and colorizes the entire image.
The authors used off-the-shelf object detectors to segment and characterize at the image level.
We found excellent performance compared to existing methods.

Multimodal: I understand that it means the five senses of animals. Being able to intuitively recognize who the object is.

1. Background: Introduction

Converting black-and-white images into plausible color images is a hot research theme. However, predicting two missing channels from a black-and-white image poses an inherently difficult problem. In addition, because there are multiple options for coloring objects, the coloring process may have multiple interpretations (eg, white, black, red for vehicles, etc.).

The conventionally reported technology has a problem that it is not well colored when there are many objects on a cluttered background (see the figure below).

In this paper, in order to solve the above problems, we have realized a new deep learning framework and color coding that is conscious of region division. As a particular point, it was found that ** clearly separating the object from the background ** is effective in improving the colorization performance.

The authors' framework consists of the following three.

Region division and ready-made pre-learning model for generating divided object images
Two backbone networks learned for colorization of split objects and entire images
Fusion module for selectively mixing features extracted from the layers of two colorized networks

2. Related technologies: Related works

Learning-based colorization

In recent years, attention has been paid to the automation of colorization processing using machine learning. In existing research, deep convolutional neural networks have become the mainstream for learning color predictions from large datasets.

Image generation and manipulation based on area division: Instance-aware image synthesis and manipulation

The process that considers the area division makes the separation between the object and the ground clear, which facilitates the composition and operation of the visual appearance.

Compared to DC-GAN and FineGAN, which focus on a single object, it can handle complex areas.
Compared to InstaGAN, a technology that makes overlapping look natural, it is possible to consider the possibility that all overlap at the same time.
Use learned weighting in many region compositing compared to Pix2PixHD, which uses region partition boundaries to improve compositing quality

3. Overview: Overview

In this system, the black-and-white image $ X ∈ R ^ {H × W × 1} $ is input, and the two missing color channels $ Y ∈ R ^ {H × W × 2} $ are $ CIE L ∗ a ∗. b ∗ End-to-end prediction within the color space $.

The figure below shows the network configuration. First, a pre-learned object detector is used to obtain multiple object bounding boxes $ (B_i) ^ N_ {i = 1} $ ($ N $ is the number of instances) from a black and white image.

Next, the image cut out from the black-and-white image is resized using the detected bounding box to generate an instance image $ (X_i) ^ N_ {i = 1} $.

Next, each instance image $ X_i $ and the input grayscale image $ X $ are sent to the instance colorization network and the full image colorization network, respectively. Here, the extracted feature maps of the instance image $ X_i $ and the grayscale image $ X $ in the $ j $ th network layer are called $ f ^ {Xi} _j $ and $ f ^ X_j $.

Finally, we use a fusion module that fuses the instance features $ (f_j ^ {Xi}) ^ N_ {i = 1} $ of each layer and the full image features $ {f_j ^ X} $. All fused image features $ f ^ X_j $ are transferred to the $ j + 1 $ th layer. Repeat this step until the last layer to get the predicted color image $ Y $.

In this research, we adopt a sequential approach of first learning the entire image network, then learning the instance network, and finally freezing the above two networks to learn the fusion module.

4. Method: Method

4.1 Object detection Object detection

Color the image using the detected object instance. For this purpose, a commercially available pre-trained network Mask R-CNN was used as the object detector.

4.3. Fusion module: Fusion module

The fusion module receives input similar to the following: The fusion module has (1) full image features $ f ^ X_j $, (2) a bundle of instance features and the corresponding object boundary box $ (f_j ^ {Xi}) ^ N_ {i = 1} $. Input. For both types of features, we devise a small neural network with three convolution layers to predict the full image weight map $ W_F $ and the per-instance weight map $ W_I ^ i $.

4.4. Loss Function and Training

Follow the steps below to learn the entire network. First, it learns all image colorization and transfers the learned weights to the instance colorization network for initialization. Next, learn the instance coloring network. Finally, we release the weights of all image models and instance models and move on to learning the fusion module.

5. Experiments: Experiments

5.1. Experimental setting: Experimental setting

Dataset: Dataset

Uses 3 datasets: ImageNet, COCO-Stuff, Places205

Training method: Training details

The following three training processes were performed on the ImageNet dataset.

All image colorization network: Initialized with weight parameters of existing model (learning rate $ 10 ^ {-5} $)
Region-based network: Fine-tuning the model with instances extracted from the dataset
Fusion module: Fusion with 13-layer neural network

Optimized method uses ADAM ($ \ beta_1 = 0.99, \ beta_2 = 0.999 $)
Trained for 3 days using a single RTX 2080 Ti GPU (ImageNet)

5.2. Quantitative comparisons

Comparisons with the state-of-the-arts.

The table above shows a comparison of quantitative values for the three datasets. All indicators scored better than previous methods.

※ LPIPS: Distance between the original image and the regenerated image after projecting into the latent space (the lower the distance, the closer and similar) SSIM: Peripheral pixel average, variance / covariance based on brightness, contrast, and structure PSNR: Two images squared by the difference in pixel brightness between the same positions (higher is higher quality)

User study Show participants the pair of colored results and ask their preferences (compulsory selection comparison). As a result, the authors' method was preferred on average compared to Zhanget al. (61% vs. 39%) and DeOldify (72% vs. 28%). Interestingly, DeOld-ify does not give the exact coloring results evaluated in benchmark experiments, but saturated coloring results may be preferred by users.

5.7 Failure cases: Failure cases

The figure above shows two examples of failures. The authors' approach can result in visible artifacts that appear to be washed out of color or straddle the boundaries of objects.

6. Conclusions: Conclusions

In this study, features were extracted from the instance branch and the full image branch by cutting out an image using a ready-made object detection model. Then, it was confirmed that a better feature map can be obtained by fusing with the newly proposed fusion module. As a result of the experiment, it was shown that the results of this study were superior to the existing method in the dataset of three branch marks.

At the end

I learned the technology of color imaging that incorporates the area segmentation (instance segmentation) technology. I understood the technology itself, but I found it difficult to quantitatively discuss that it is a plausible image when it is converted to a color image. If you have multiple choices, such as car color or vegetation color, how do you decide which algorithm is most plausible?

The authors are also testing to let people judge, but if an algorithm can be created in this multimodal area, it will be a technology with a more artificial intelligence feeling.

[PYTHON] (Reading the paper) Instance-aware Image Colorization (Region division: Color imaging using instance segmentation)