[PYTHON] "Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!

While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to. Part 17 ← → Part 19

I was able to confirm that the script of the book can be replaced with Keras and executed on Google Colaboratory.

so,

This time I decided to let Kaggle distinguish between cat and dog datasets. When I did it with Self-study memo No. 6-2, the correct answer rate was 60%, which was a little better than the answer. At this time, memory errors occurred frequently when creating data, so the number of training data was reduced, and since the neural network was a two-layer fully connected net that was understood at this point, it was converted to one dimension. did. In Self-study memo 12, I tried to process with a convolution neural network, but I could not learn properly due to a memory error, and the correct answer rate was 50%. And the same level as the address.

This time, I will try the convolution neural network again without worrying about the memory error in Google Colaboratory.

Upload photos to Google Drive

The data for cats and dogs is about 400MB each, and there is a few giga of free space in My Drive, so there is no problem with capacity. However, when I tried to upload this for each folder, it took too long and timed out. Since there are 12500 files in one folder, it may be a problem with the number of files, but the dog folder stopped in the middle, but the cat folder can all be uploaded, so I do not understand the cause well. It can't be helped because it took a long time because the line was busy, the load on the server was high, and the circumstances at that time.

Will the colab screen time out if left unattended for a long time?

There was a story that if you leave the script running for a long time and do not operate the screen, it will time out, so the process of shaping the image is divided into dogs and cats and saved separately, then one I summarized it in.

I processed the image in the same way as I did in Self-study memo 12, but self-study memo 12 was the data for DeepConvNet of the book. , Channels_first (batch, channels, height, width) was processed by transpose so that it would be in the format. Keras is in the channels_last (batch, height, width, channels) format, so it is saved without transposing.

I tried to learn with the Keras version of DeepConvNet created last time

#Preparation of input data
from google.colab import drive
drive.mount('/content/drive')

import os
import pickle

mnist_file = '/content/drive/My Drive/Colab Notebooks/deep_learning/dataset/catdog.pkl'
with open(mnist_file, 'rb') as f:
    dataset = pickle.load(f)

x_train = dataset['train_img']  / 255.0 #It is normalized.
t_train = dataset['train_label']
x_test  = dataset['test_img'] / 255.0
t_test  = dataset['test_label']

print(x_train.shape)

(23411, 80, 80, 3)

The image data is an integer value from 0 to 255, but this is divided by 255.0 and converted so that it falls within the range of 0.0 to 1.0. Doing this speeded up learning. Without normalization, the correct answer rate did not reach 60% even after 5 epochs, but after normalizing, it exceeded 60% in the 4th time.

#TensorFlow and tf.import keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.layers import Dense, Activation, Flatten, Conv2D, MaxPooling2D, Dropout

#Import helper library
import numpy as np
import matplotlib.pyplot as plt

def create_model(input_shape, output_size, hidden_size):
  import numpy as np
  import matplotlib.pyplot as plt
  
  filter_num = 16
  filter_size = 3
  filter_stride = 1
  filter_num2 = 32
  filter_num3 = 64
  pool_size_h=2
  pool_size_w=2
  pool_stride=2

  model = keras.Sequential(name="DeepConvNet")
  model.add(keras.Input(shape=input_shape))
  model.add(Conv2D(filter_num, filter_size, strides=filter_stride, padding="same", activation="relu", kernel_initializer='he_normal'))
  model.add(Conv2D(filter_num, filter_size, strides=filter_stride, padding="same", activation="relu", kernel_initializer='he_normal'))
  model.add(MaxPooling2D(pool_size=(pool_size_h, pool_size_w),strides=pool_stride))

  model.add(Conv2D(filter_num2, filter_size, strides=filter_stride, padding="same", activation="relu", kernel_initializer='he_normal'))
  model.add(Conv2D(filter_num2, filter_size, strides=filter_stride, padding="same", activation="relu", kernel_initializer='he_normal'))
  model.add(MaxPooling2D(pool_size=(pool_size_h, pool_size_w),strides=pool_stride))

  model.add(Conv2D(filter_num3, filter_size, strides=filter_stride, padding="same", activation="relu", kernel_initializer='he_normal'))
  model.add(Conv2D(filter_num3, filter_size, strides=filter_stride, padding="same", activation="relu", kernel_initializer='he_normal'))
  model.add(MaxPooling2D(pool_size=(pool_size_h, pool_size_w),strides=pool_stride))

  model.add(keras.layers.Flatten())
  model.add(Dense(hidden_size, activation="relu", kernel_initializer='he_normal')) 
  model.add(Dropout(0.5))
  model.add(Dense(output_size))
  model.add(Dropout(0.5))
  model.add(Activation("softmax")) 

  #Compiling the model
  model.compile(loss="sparse_categorical_crossentropy", 
                optimizer="adam", 
                metrics=["accuracy"])

  return model

input_shape=(80,80,3)
output_size=2
hidden_size=100
model = create_model(input_shape, output_size, hidden_size)

model.summary()

Model: "DeepConvNet" Layer (type)　　　　　　　　　　Output Shape　　　　　　Param #

conv2d (Conv2D) 　　　　　　　(None, 80, 80, 16)　　　　　448
conv2d_1 (Conv2D) 　　　　　　(None, 80, 80, 16)　　　　　2320
max_pooling2d (MaxPooling2D)　(None, 40, 40, 16)　　　　　0
conv2d_2 (Conv2D)　　　　　　　(None, 40, 40, 32)　　　　　4640
conv2d_3 (Conv2D)　　　　　　　(None, 40, 40, 32)　　　　　9248
max_pooling2d_1 (MaxPooling2　(None, 20, 20, 32)　　　　　0
conv2d_4 (Conv2D)　　　　　　　(None, 20, 20, 64)　　　　　18496
conv2d_5 (Conv2D)　　　　　　　(None, 20, 20, 64)　　　　　36928
max_pooling2d_2 (MaxPooling2　(None, 10, 10, 64)　　　　　0
flatten (Flatten) 　　　　　　　(None, 6400)　　　　　　　　0
dense (Dense)　　　　　　　　　　(None, 100)　　　　　　　　640100
dropout (Dropout) 　　　　　　　(None, 100)　　　　　　　　0
dense_1 (Dense)　　　　　　　　　(None, 2)　　　　　　　　　202
dropout_1 (Dropout)　　　　　　　(None, 2) 　　　　　　　　　0
activation (Activation)　　　　　(None, 2)　　　　　　　　　 0

Total params: 712,382 Trainable params: 712,382 Non-trainable params: 0

Except for input_shape and output_size, it is the same as the script created in Part 17.

model.fit(x_train, t_train,  epochs=10, batch_size=128)
test_loss, test_acc = model.evaluate(x_test,  t_test, verbose=2)

Epoch 1/10 195/195 [==============================] - 385s 2s/step - loss: 0.7018 - accuracy: 0.5456 Epoch 2/10 195/195 [==============================] - 385s 2s/step - loss: 0.6602 - accuracy: 0.5902 Epoch 3/10 195/195 [==============================] - 383s 2s/step - loss: 0.6178 - accuracy: 0.6464 Epoch 4/10 195/195 [==============================] - 383s 2s/step - loss: 0.5844 - accuracy: 0.6759 Epoch 5/10 195/195 [==============================] - 383s 2s/step - loss: 0.5399 - accuracy: 0.7090 Epoch 6/10 195/195 [==============================] - 383s 2s/step - loss: 0.5001 - accuracy: 0.7278 Epoch 7/10 195/195 [==============================] - 382s 2s/step - loss: 0.4676 - accuracy: 0.7513 Epoch 8/10 195/195 [==============================] - 382s 2s/step - loss: 0.4485 - accuracy: 0.7611 Epoch 9/10 195/195 [==============================] - 380s 2s/step - loss: 0.4295 - accuracy: 0.7713 Epoch 10/10 195/195 [==============================] - 382s 2s/step - loss: 0.4099 - accuracy: 0.7788 4/4 - 0s - loss: 0.3249 - accuracy: 0.8500

The correct answer rate is 85%. I tried to display the judgment result.

#Predict
predictions = model.predict(x_test)

def plot_image(i, predictions_array, t_label, img):
    class_names = ['cat', 'dog']
    predictions_array = predictions_array[i]
    img = img[i].reshape((80, 80, 3))
    true_label = t_label[i]
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(img, cmap=plt.cm.binary)

    predicted_label = np.argmax(predictions_array)
    if predicted_label == true_label:
        color = 'blue'
    else:
        color = 'red'

    plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                    100*np.max(predictions_array),
                                    class_names[true_label]),
                                    color=color)

#Shows X test images, predicted labels, and correct labels.
#Correct predictions are shown in blue and wrong predictions are shown in red.
num_rows = 10
num_cols = 10
num_images = num_rows*num_cols

plt.figure(figsize=(2*num_cols, 2.5*num_rows))
for i in range(num_images):
    plt.subplot(num_rows, num_cols, i+1)
    plot_image(i, predictions, t_test, x_test)
plt.show()

Nine mistaken cats for dogs and six mistaken dogs for cats. If your face is big and facing straight ahead, it looks like there is no mistake. It may be mistaken for a sideways or small face. Is it a point that all cats have triangular ears and many dogs have lop ears?

By saying

I wanted to know what I was paying attention to in the image, so I decided to use GRAD-CAM to see it.

Grad-CAM Gradient-weighted Class Activation Mapping (Grad-CAM) It seems to be called a gradient-weighted class activation mapping method. Is it okay with Gradcam?

The Grad gradient came up in the loss function, but it seems that the larger the gradient has the greatest effect on the classification. I will give up pursuing further, run the program, and see only the results.

For the program for Grad-CAM calculation, refer to here. → I implemented Grad-CAM with keras and tensorflow

import numpy as np
import cv2

#For Grad-CAM calculation
from tensorflow.keras import models
import tensorflow as tf

def grad_cam(input_model, x, layer_name):
    """
    Args: 
        input_model(object):Model object
        x(ndarray):image
        layer_name(string):The name of the convolution layer
    Returns:
        output_image(ndarray):Colored image of the original image
    """

    #Image preprocessing
    #Since there is only one image to read, mode must be increased..I can't predict
    h, w, c = x.shape
    IMAGE_SIZE = (h, w)
    X = np.expand_dims(x, axis=0)
    preprocessed_input = X.astype('float32') / 255.0 

    grad_model = models.Model([input_model.inputs], [input_model.get_layer(layer_name).output, input_model.output])

    with tf.GradientTape() as tape:
        conv_outputs, predictions = grad_model(preprocessed_input)
        class_idx = np.argmax(predictions[0])
        loss = predictions[:, class_idx]

    #Calculate gradient
    output = conv_outputs[0]
    grads = tape.gradient(loss, conv_outputs)[0]

    gate_f = tf.cast(output > 0, 'float32')
    gate_r = tf.cast(grads > 0, 'float32')

    guided_grads = gate_f * gate_r * grads

    #Average the weights and multiply by the output of the layer
    weights = np.mean(guided_grads, axis=(0, 1))
    cam = np.dot(output, weights)

    #Scale the image to the same size as the original image
    cam = cv2.resize(cam, IMAGE_SIZE, cv2.INTER_LINEAR)
    #Instead of ReLU
    cam  = np.maximum(cam, 0)
    #Calculate heatmap
    heatmap = cam / cam.max()

    #Pseudo-color monochrome images
    jet_cam = cv2.applyColorMap(np.uint8(255.0*heatmap), cv2.COLORMAP_JET)
    #Convert to RGB
    rgb_cam = cv2.cvtColor(jet_cam, cv2.COLOR_BGR2RGB)
    #Combined with the original image
    output_image = (np.float32(rgb_cam) / 2  + x / 2 )  
    return output_image , rgb_cam

from keras.preprocessing.image import array_to_img, img_to_array, load_img

predictions = model.predict(x_test)

def hantei_hyouji(i, x_test, t_test, predictions, model):
  class_names = ['cat', 'dog']

  x = x_test[i]
  true_label = t_test[i]
  predictions_array = predictions[i]
  predicted_label = np.argmax(predictions_array)

  target_layer = 'conv2d_5'
  cam, heatmap = grad_cam(model, x, target_layer)

  moto=array_to_img(x, scale=True)
  hantei=array_to_img(heatmap, scale=True)
  hyouji=array_to_img(cam, scale=True)

  print("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                  100*np.max(predictions_array),
                                  class_names[true_label]))
  row = 1
  col = 3
  plt.figure(figsize=(15,15))
  plt.subplot(row, col, 1)
  plt.imshow(moto)
  plt.axis('off')
  plt.subplot(row, col, 2)
  plt.imshow(hantei)
  plt.axis('off')
  plt.subplot(row, col, 3)
  plt.axis('off')
  plt.imshow(hyouji)

  plt.show()
  return

for i in range(100):
    hantei_hyouji(i, dataset['test_img'], t_test, predictions, model)

And the result is

Apparently, when it comes to cats, it's like looking at the patterns of the body and the shape of the body, not the ears or eyes. Dogs seem to be paying more attention to their faces, especially their noses. In the first image example, we pay attention to the body shape, but as a result of paying more attention to the face, it is misjudged as a dog by a small margin. In the last image example, while paying attention to the nose, it seems that it was misidentified as a cat by paying attention to the body shape, probably the rounded back. In the second example, you pay more attention to the surroundings than the cat itself, and probably because you can't see the nose or the characteristics of the dog at all, you decided that it was a cat.

Reference site

I implemented Grad-CAM with keras and tensorflow

I will explain the source code of Grad-CAM and Guided Grad-CAM

Part 17 ← → Part 19 Click here for the table of contents of the memo Unreadable Glossary