Grad-CAM is a good way to visualize where the model is looking to determine. However, as a personal complaint, Grad-CAM has a low resolution of (14,14) compared to the image size (224,224). The reason why the resolution of Grad-CAM is low like this is that the VGG16 model has a total of 4 poolings. However, except for the pooling layer, only short-distance features can be extracted and long-distance features cannot be extracted. I wondered if I could get a high resolution Grad-CAM by using dilated convolution, so I created an equivalent model of VGG16 that uses dilated convolution and experimented. As a result, the resolution of Grad-CAM increased, but it did not return to the original high resolution. Left: Normal Grad-CAM, Right: Grad-CAM using dilated convolution
As shown in the figure below, this is a method of convolving a toothless filter with a gap. If you increase dilation_rate, you can convolve over long distances with a small filter size without using pooling. If you use this, the image size will not be reduced because pooling is not used.
I created a model written in Keras below. This model is named the dilated_VGG16 model for convenience. It can calculate long-distance convolutions with size (224,224) by adjusting dilation_rate. Therefore, the resolution before full combination has a resolution of (224,224) instead of (14,14). The name of the final convolution layer is'block5_conv3'for later Grad-CAM. Notice that the VGG16 model and the dilated_VGG16 model have the same number of parameters.
python
inputs = Input(shape=(224,224,3))
x = Conv2D( 64, (3, 3), padding='same', activation='relu', dilation_rate=1)(inputs)
x = Conv2D( 64, (3, 3), padding='same', activation='relu', dilation_rate=1)(x)
x = Conv2D(128, (3, 3), padding='same', activation='relu', dilation_rate=2)(x)
x = Conv2D(128, (3, 3), padding='same', activation='relu', dilation_rate=2)(x)
x = Conv2D(256, (3, 3), padding='same', activation='relu', dilation_rate=4)(x)
x = Conv2D(256, (3, 3), padding='same', activation='relu', dilation_rate=4)(x)
x = Conv2D(256, (3, 3), padding='same', activation='relu', dilation_rate=4)(x)
x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=8)(x)
x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=8)(x)
x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=8)(x)
x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=16)(x)
x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=16)(x)
x = Conv2D(512, (3, 3), padding='same', activation='relu', dilation_rate=16, name='block5_conv3')(x)
x = MaxPooling2D(pool_size=32)(x)
x = Flatten()(x)
x = Dense(4096, activation='relu')(x)
x = Dense(4096, activation='relu')(x)
y = Dense(1000, activation='softmax')(x)
model = Model(inputs=inputs, outputs=y)
dilated_VGG16
python
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
conv2d_2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
conv2d_3 (Conv2D) (None, 224, 224, 128) 73856
_________________________________________________________________
conv2d_4 (Conv2D) (None, 224, 224, 128) 147584
_________________________________________________________________
conv2d_5 (Conv2D) (None, 224, 224, 256) 295168
_________________________________________________________________
conv2d_6 (Conv2D) (None, 224, 224, 256) 590080
_________________________________________________________________
conv2d_7 (Conv2D) (None, 224, 224, 256) 590080
_________________________________________________________________
conv2d_8 (Conv2D) (None, 224, 224, 512) 1180160
_________________________________________________________________
conv2d_9 (Conv2D) (None, 224, 224, 512) 2359808
_________________________________________________________________
conv2d_10 (Conv2D) (None, 224, 224, 512) 2359808
_________________________________________________________________
conv2d_11 (Conv2D) (None, 224, 224, 512) 2359808
_________________________________________________________________
conv2d_12 (Conv2D) (None, 224, 224, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 224, 224, 512) 2359808
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 512) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 25088) 0
_________________________________________________________________
dense_1 (Dense) (None, 4096) 102764544
_________________________________________________________________
dense_2 (Dense) (None, 4096) 16781312
_________________________________________________________________
dense_3 (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________
python
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________
The problem with the dilated_VGG16 model is that it does not use pooling, so the image size is large and it takes a very long time to learn. Since the convolution time of the deep layer takes ** 16 * 16 = 256 times ** of VGG16 from the image ratio, it seems that it is probably not realistic to train with this model. I wrote the following and diverted the weight of VGG16 to dilated_VGG16. This is possible because VGG16 and dilated_VGG16 have the same number of parameters.
python
model1 = build_dilated_model()
model2 = VGG16(include_top=True, weights='imagenet')
model1.set_weights(model2.get_weights())
I made the usual image classification prediction with dilated_VGG16 using VGG16 weights. The classification accuracy was very degraded with dilated_VGG16, but it seems to be somewhat effective.
Model prediction:
Saint_Bernard (247) with probability 0.029
boxer (242) with probability 0.026
whippet (172) with probability 0.020
tiger_cat (282) with probability 0.019
vacuum (882) with probability 0.017
Model prediction:
boxer (242) with probability 0.420
bull_mastiff (243) with probability 0.282
tiger_cat (282) with probability 0.053
tiger (292) with probability 0.050
Great_Dane (246) with probability 0.050
I was asked to write the Grad-CAM result for boxer prediction. A normal VGG16 Grad-CAM map has only (14,14) resolution, while dilated_VGG16 has (224,224) resolution. However, a grid pattern appeared and the resolution was not high. Left: Normal Grad-CAM, Right: Grad-CAM using dilated convolution
I wondered if I could get a high resolution Grad-CAM by doing dilated convolution, but it didn't work. When I searched, there seemed to be a solution for the grid pattern of dilated convolution, and the following paper was found. https://www.cs.princeton.edu/~funk/drn.pdf (I haven't read the contents ...)
Recommended Posts