I tried to touch graph-rcnn, which is a model that generates scene-graph used in tasks such as VQA, so I summarized it. The paper is here The original implementation code is here The code with the visualization result is here
VQA is an abbreviation of Visual Question Answering, and it is a task to select the correct answer when a question sentence with an image and choices is given. As explained in the article here, it looks like the image below. https://arxiv.org/pdf/1505.00468.pdf
Since it is necessary to understand the content of the image and the content of the question text, it can be said that it is a technology that combines CV and NLP.
In order to solve the above VQA task, scene graph generation is proposed as a process on the CV side. A scene graph is a graph that detects objects in an image as shown in the figure, and expresses the positional relationship between the detected objects and the semantic relationship between the objects (use, wear, etc.). From the photograph of a man riding a skateboard, various objects (skateboard, pants, shirts, etc.) surrounding the man are described as nodes, and the positional relationship of each is described as an edge. Scene graph generation, which can describe the relationship between objects from an image, can be applied not only to VQA but also to various tasks such as captioning that connect CV and NLP.
As the flow of the proposed method.
Object detection is performed by mask RCNN. From this, the size, location information, and class information of the Bounding box are estimated.
This is the first contribution of this paper. We have set up a relation proposal network (RelPN) like the relation proposal network (RePN) on RCNN. Since it is difficult to estimate the relationship label for each combination of Bounding boxes, the relationship label $ f (p_i, p_j) $ is used for the class logits $ p_i and p_j $ of the two Bounding boxes.
This is the second contribution of this paper. Attention structure is added to the graph convolution.
In normal graph convolution, the node $ z_i $ convolution uses the incidence matrix $ \ alpha $.
In addition to this, the paper also proposed a new evaluation index (SGGen +) for the generated scene graph.
The implementation is on github with a detailed explanation, but I will post the general flow
After cloning and putting the requirements, download the dataset. The dataset uses visual genome. Jump to this page for download.
Download 1 ~ 5. Here, 1 is not prepared yet, so it is under it
Follow this procedure. You need to download and run the script, but you don't have to clone the entire repository. However, this script seems to run on python2 series, so you need to switch only the python version here.
With these steps, under data
data/vg/imdb_1024.h5
data/vg/bbox_distribution.npy
data/vg/proposals.h5
data/vg/VG-SGG-dicts.json
data/vg/VG-SGG.h5
If you can prepare, you are ready.
Then run the code. First train the object detection model, then train the graph generative model. (The "train scene graph generation model jointly" that can do both at once did not work due to an error)
python main.py --config-file configs/faster_rcnn_res101.yaml
python main.py --config-file configs/sgg_res101_step.yaml --algorithm The name of the algorithm you want to try
The following code is used to verify the model. You can visualize the inference result with the --visualize option.
python main.py --config-file configs/sgg_res101_step.yaml --inference --visualize
If you turn the test code of 3, you will see the result like this.
020-03-02 05:05:03,016 scene_graph_generation.inference INFO: ===================sgdet(motif)=========================
2020-03-02 05:05:03,017 scene_graph_generation.inference INFO: sgdet-recall@20: 0.0300
2020-03-02 05:05:03,018 scene_graph_generation.inference INFO: sgdet-recall@50: 0.0563
2020-03-02 05:05:03,019 scene_graph_generation.inference INFO: sgdet-recall@100: 0.0699
2020-03-02 05:05:03,019 scene_graph_generation.inference INFO: =====================sgdet(IMP)=========================
2020-03-02 05:05:03,020 scene_graph_generation.inference INFO: sgdet-recall@20: 0.03372315977691639
2020-03-02 05:05:03,021 scene_graph_generation.inference INFO: sgdet-recall@50: 0.06264976651796783
2020-03-02 05:05:03,022 scene_graph_generation.inference INFO: sgdet-recall@100: 0.07724741486207399
Also, --visualize will generate such a photo under ./visualize
.
However, don't you think that the scene graph, which is important, cannot be visualized only by the result of object detection in the above figure? So, I added a python script that can visualize the scene graph (here). The visualization result of the scene graph for the above image is as follows.
You have successfully extracted the relationships between nearby objects! I want to be able to reflect the relationship label on the edge as well.
It was an explanation of Graph R-CNN for Scene Graph Generation that generates a scene graph from an image. The interesting task & implementation instructions are kind, so give it a try!
Recommended Posts