In a word

I tried to touch graph-rcnn, which is a model that generates scene-graph used in tasks such as VQA, so I summarized it. The paper is here The original implementation code is here The code with the visualization result is here

Article for what?

Explanation of Graph R-CNN for Scene Graph Generation
A guidepost for turning the implementation of ↑

What is VQA?

VQA is an abbreviation of Visual Question Answering, and it is a task to select the correct answer when a question sentence with an image and choices is given. As explained in the article here, it looks like the image below. https://arxiv.org/pdf/1505.00468.pdf

Since it is necessary to understand the content of the image and the content of the question text, it can be said that it is a technology that combines CV and NLP.

What is a scene graph?

In order to solve the above VQA task, scene graph generation is proposed as a process on the CV side. スクリーンショット 2020-03-07 13.42.20.png A scene graph is a graph that detects objects in an image as shown in the figure, and expresses the positional relationship between the detected objects and the semantic relationship between the objects (use, wear, etc.). From the photograph of a man riding a skateboard, various objects (skateboard, pants, shirts, etc.) surrounding the man are described as nodes, and the positional relationship of each is described as an edge. Scene graph generation, which can describe the relationship between objects from an image, can be applied not only to VQA but also to various tasks such as captioning that connect CV and NLP.

Method

スクリーンショット 2020-03-07 14.21.10.png As the flow of the proposed method.

Bounding box detected by Mask RCNN
Extract the combination of Bounding boxes by Relation proposal network
Refine the combinations extracted by attentional GCN is. Of these, the main contributions are 2 and 3.

1. Object detection by maskRCNN

Object detection is performed by mask RCNN. From this, the size, location information, and class information of the Bounding box are estimated.

2. Combination extraction by Relation proposal network

This is the first contribution of this paper. We have set up a relation proposal network (RelPN) like the relation proposal network (RePN) on RCNN. Since it is difficult to estimate the relationship label for each combination of Bounding boxes, the relationship label $ f (p_i, p_j) $ is used for the class logits $ p_i and p_j $ of the two Bounding boxes. f(p_i, p_j) = \phi(p_i)\cdot\\psi(p_j) Is calculated. Where $ \ phi () and \ psi () $ are multi-layer perceptrons. In other words, the relationship label is estimated by multiplying the features of two objects. Also, here, the object pairs are narrowed down to K candidates in descending order of score.

3. Determination of graph structure by attentional Graph Convolution

This is the second contribution of this paper. Attention structure is added to the graph convolution. In normal graph convolution, the node $ z_i $ convolution uses the incidence matrix $ \ alpha $. z_i^{l+1}=sigmoid(z_i^l + \sum_{j}\alpha_{ij}Wz_j^l) It becomes. W is the weight. Here, each element of the connection matrix takes 0 or 1, and if it is 0, i and j are not connected, if it is 1, they are connected, and so on. On the other hand, in the graph convolution with attention of the proposed method, the value of this incidence matrix is set to be a real value from 0 to 1. Specifically, the learning weights $ W_a and w_h $ are used. \alpha_{ij}=softmax(w_h^Tsigmoid(W_a\[z_i^l, z_j^l\])) Learn as. [・, ・] Is a combination.

In addition to this, the paper also proposed a new evaluation index (SGGen +) for the generated scene graph.

Try turning the code

The implementation is on github with a detailed explanation, but I will post the general flow

1. Data set preparation

After cloning and putting the requirements, download the dataset. The dataset uses visual genome. Jump to this page for download. スクリーンショット 2020-03-07 15.33.18.png

Download 1 ~ 5. Here, 1 is not prepared yet, so it is under it スクリーンショット 2020-03-07 15.35.39.png Follow this procedure. You need to download and run the script, but you don't have to clone the entire repository. However, this script seems to run on python2 series, so you need to switch only the python version here. With these steps, under data

data/vg/imdb_1024.h5
data/vg/bbox_distribution.npy
data/vg/proposals.h5
data/vg/VG-SGG-dicts.json
data/vg/VG-SGG.h5

If you can prepare, you are ready.

2. Model training

Then run the code. First train the object detection model, then train the graph generative model. (The "train scene graph generation model jointly" that can do both at once did not work due to an error)

python main.py --config-file configs/faster_rcnn_res101.yaml
python main.py --config-file configs/sgg_res101_step.yaml --algorithm The name of the algorithm you want to try

3. Model validation

The following code is used to verify the model. You can visualize the inference result with the --visualize option.

python main.py --config-file configs/sgg_res101_step.yaml --inference --visualize

4. Result

If you turn the test code of 3, you will see the result like this.

020-03-02 05:05:03,016 scene_graph_generation.inference INFO: ===================sgdet(motif)=========================
2020-03-02 05:05:03,017 scene_graph_generation.inference INFO: sgdet-recall@20: 0.0300
2020-03-02 05:05:03,018 scene_graph_generation.inference INFO: sgdet-recall@50: 0.0563
2020-03-02 05:05:03,019 scene_graph_generation.inference INFO: sgdet-recall@100: 0.0699
2020-03-02 05:05:03,019 scene_graph_generation.inference INFO: =====================sgdet(IMP)=========================
2020-03-02 05:05:03,020 scene_graph_generation.inference INFO: sgdet-recall@20: 0.03372315977691639
2020-03-02 05:05:03,021 scene_graph_generation.inference INFO: sgdet-recall@50: 0.06264976651796783
2020-03-02 05:05:03,022 scene_graph_generation.inference INFO: sgdet-recall@100: 0.07724741486207399

Also, --visualize will generate such a photo under ./visualize.

Visualization...?

However, don't you think that the scene graph, which is important, cannot be visualized only by the result of object detection in the above figure? So, I added a python script that can visualize the scene graph (here). The visualization result of the scene graph for the above image is as follows.

You have successfully extracted the relationships between nearby objects! I want to be able to reflect the relationship label on the edge as well.

Conclusion

It was an explanation of Graph R-CNN for Scene Graph Generation that generates a scene graph from an image. The interesting task & implementation instructions are kind, so give it a try!

[PYTHON] I touched graph-rcnn which generates a scene graph