Thing you want to do

I want to understand Keras Conv2D
I want to understand the code below (I want to be able to explain what each function is doing? And the meaning of the arguments).

from keras import layers, models
model = models.Sequential()
model.add(layers.Conv2D(32,(3,3),activation="relu",input_shape=(150,150,3)))

And I want to implement an image classification model in python (for example, I can distinguish between a dog photo and a cat photo)

What you can read and understand this article

You can roughly understand "What is convolution?"
You can roughly understand "How do I decide the value of the argument to be passed to the Keras Conv2D function?".
Understand the meaning of "kernel", "filter", and "stride".

What is Conv2D?

If you search for "keras Conv2D", you will find "2D convolutional layer". So what is a "two-dimensional convolution layer"? There is also the term "one-dimensional convolutional neural network". Therefore, as a premise to understand "What is the difference between 1D and 2D?" It is necessary to understand "convolutional neural network" and "convolution".

What is CNN?

Convolutional Neural Network.

Convolutional: Convolutional Neural Network: Neural network

So CNN is a "convolutional neural network".

Reference information to deepen your understanding of CNN

https://www.atmarkit.co.jp/ait/articles/1804/23/news138.html According to:

Speaking of "deep learning of images", it is a major method called CNN. CNN is an acronym for Convolutional Neural Network, which introduces an operation called "convolution" into a neural network.
Convolution is the calculation of the sum of the products of each element of the grid-like numerical data called the kernel (or filter) and the numerical data of the partial image (called the window) of the same size as the kernel. , Is the process of converting to a single number. This conversion process is converted into small grid-like numerical data (that is, a tensor) by shifting the window little by little.

basic way of thinking

What is an "image" in the first place?

Image files such as jpg have a fixed number of pixels for each of width and height. For example, suppose you have a photo with width: 300px and height: 200px. If one pixel is represented by ■ (square) The photo is an array of 300 x 200 = 60,000 ■. So, if width: 5px and height: 5px and there are a total of 25 ■, it will be as shown in the figure below.

Furthermore, in the case of black-and-white photographs

Each ■ is either black or white
Black is represented by ■ (painted in black), and white is represented by □ (white).

Then, in the case of "drawing x (x) with black characters on a white background", it will be as shown in the figure below.

Similarly, if it is a plus sign (+),

And if it is a minus sign (-),

And if it is an equal sign (=),

Is.

The idea of "focusing on small divisions and examining their characteristics"

X with black letters on a white background

What happens if you "examine the features by focusing on small divisions" for the image data? For example, pay attention to the part surrounded by the red frame and the blue frame.

This area is all

Is. In other words, it can be seen that "the red frame part and the blue frame part have the same characteristics". here,

"Data showing features (feature detectors)" such as

kernel

(Sometimes called a filter. The meaning is the same). In other words, if you want to understand the characteristics of the "5 x 5" original image, The original image should be subdivided and each should be compared to the "2 x 2" kernel. This is the idea of "determining an image" or "identifying the characteristics of an image or its difference from other images".

What is "convolution"?

In order to understand Conv2D, it is necessary to understand "two-dimensional convolution layer". To do so, we first need to understand the "convolution layer". So what is "convolution"?

Roughly speaking, it is as follows.

The process of comparing the original image with the kernel (filter), calculating, outputting the calculation (matrix calculation) result, and arranging them is called "convolution".
The folded output result is sometimes called a "feature map".
The data output by convolution is smaller than the data of the original image.

The output result (feature map) of "convolution" with the original image of 5 x 5 and the kernel (filter) of 3 x 3 is 9 squares (3 x 3).

If you want to perform a convolution on a 5x5 original image with a 3x3 kernel Shift by 1 square (this is called "stride (number of pixels to shift) is 1") Then, a total of 9 matrix calculations will be performed. Therefore, if the calculation results are output and arranged, it will be 9 times, that is, "feature map is 9 squares".

The red frame is the object to be compared with the kernel, that is, the "area of interest (called a window)". The matrix operation is repeated by shifting by 1 square (1 pixel) from the upper left to the lower right of the original image. In this case, since the calculation is performed 9 times, the feature map becomes 9 squares (3 x 3). It is called "stride is 1" to calculate by shifting one pixel at a time. If you calculate by shifting by 2 pixels, it is said that the stride is 2.

Specific calculation example

Let's actually try the "first matrix operation" in the above figure. The procedure for matrix calculation is as follows. Matrix operation is performed on the red frame part (window) in the left figure and the right figure (kernel).

By the way, the kernel mentioned here is just an example. In the actual convolution, "The vertical and horizontal size of the kernel can be specified arbitrarily other than 3x3." Also note that "convolution is done using multiple types of kernels, not just one type" (details will be described later).

Now, the matrix operation

Compare a part (window) of the original image with the kernel and multiply the elements at the same position.
Add all the values obtained by the multiplication

As a result, the output result can be obtained. For the sake of clarity, I'll put in some numbers. here, Black -1 1 white And.

From the upper left cell to the lower right cell, the calculation is performed in order (9 times in total), as shown below.

-1 x  1 = -1 (multiply the left side of the upper row)
 1 x  1 =1 (multiply the upper centers)
 1 x  1 =1 (multiply the right side of the upper row)
 1 x -1 = -1 (multiply the left side of the middle row)
-1 x -1 =1 (multiply the centers in the middle row)
 1 x -1 = -1 (multiply the right side of the middle row)
 1 x  1 =1 (multiply the left side of the bottom row)
 1 x  1 =1 (multiply the lower centers)
-1 x  1 = -1 (multiply the lower right sides)

The left side is the "value of one cell in a part of the original image", The right side is the "value of one cell in the kernel". And, "add all" the answer,

SUM(-1, 1, 1, -1, 1, -1, 1, 1, -1)

Therefore, the result is 1. Since this 1 is arranged in the "upper left of the feature map", The feature map is as follows.

If you continue the calculation in this way, the values will be entered in the remaining 8 squares of the feature map. Performing such a calculation is "convolution". In other words, "convolution is the work of calculating the matrix of the original image and kernel and outputting the result to the feature map."

However, it is difficult to manually perform such convolution (matrix calculation). Therefore, it is calculated using a function like Keras' Conv2D.

The meaning of the arguments passed to the keras function Conv2D ()

About the sample code at the beginning.

from keras import layers, models
model = models.Sequential()
model.add(layers.Conv2D(32,(3,3),activation="relu",input_shape=(150,150,3)))

Conv2D () used in this

Conv2D(32,(3,3),activation="relu",input_shape=(150,150,3))

Investigate what the argument of is meant. You are passing four arguments.

Conv2D(
  32,
  (3,3),
  activation="relu",
  input_shape=(150,150,3)
)

official keras documentation https://keras.io/ja/layers/convolutional/#conv2d The description of is as follows.

keras.layers.Conv2D(
  filters,
  kernel_size,
  strides=(1, 1),
  padding='valid',
  data_format=None,
  dilation_rate=(1, 1),
  activation=None,
  use_bias=True,
  kernel_initializer='glorot_uniform',
  bias_initializer='zeros',
  kernel_regularizer=None,
  bias_regularizer=None,
  activity_regularizer=None,
  kernel_constraint=None,
  bias_constraint=None
)

Let's start with the first argument. The description of the official document is as follows.

filters :An integer, the dimension of the output space (that is, the number of output filters in the convolution).

In this code, we are passing 32. In other words, "the number of output filters is 32" is specified. So what is an "output filter"?

What is a "filter" in the first place?

"What is a kernel?" In convolution was mentioned above. It is important to know that the "kernel" is sometimes called the "filter" here. In other words, the first argument, filters, is a "filter" and a "kernel". You can see that it is a setting value related to the kernel.

https://qastack.jp/stats/154798/difference-between-kernel-and-filter-in-cnn Then, the following questions and answers are made.

Question: What is the difference between a "kernel" and a "filter" in a convolutional neural network?
Answer: It has the same meaning. The kernel is sometimes called a filter.

Therefore, in conclusion

The kernel is a "filter" and a "feature detector". They all have the same meaning.

Will be.

If so, "the number of output filters is 32" means "the number of output kernels is 32".

Review of convolution

5x5 input image

On the other hand, a 3x3 filter (also called a kernel)

When convolving at. If you calculate by shifting one square at a time as shown in the figure below, the calculation will be performed 9 times in total, so the answer (feature map) will be 9 squares (3x3).

(By the way, such a convolution that slides one square at a time is expressed as "stride is 1". The higher the stride value, the fewer calculations)

What is a stride?

How many squares do you want to shift and calculate? The sloppy value.

If the stride is 1

Will be.

If the stride is 2

Will be.

Then, what is the length and width of the feature map x what is the convolution under the following conditions?

The input image is 25 x 25.
The filter (kernel) is 5 x 5.
Stride is 2.

The answer is 11 x 11. You can understand it by writing a grid on a spreadsheet and counting it while actually shifting it by hand. There are 25 x 25 grids. This is used as an input image. The overlapping pink frame (5x5) is the filter (kernel). Since the stride is 2, we will calculate by shifting by 2 squares. You will reach the right end in the 11th calculation. Since the vertical is the same, the feature map is 11 x 11.

How to decide the arguments to pass to the Conv2D function

Based on the above knowledge, consider the parameters required to execute the convolution. Specifically, it is necessary to answer the following questions.

Question (1): What is the number of vertical and horizontal pixels of the kernel (filter) you want to use for convolution?
Question (2): What is the number of vertical and horizontal pixels of the image you want to identify by convolution (that is, the input image)?
Question (3): What is the stride value? (How many pixels?)

There may be other questions, but the answer to these questions is to "determine the value of the argument to pass to the function."

How to determine the vertical and horizontal size of the filter (kernel)

https://child-programmer.com/ai/keras/conv2d/ Excerpt from the description of.

Conv2D(16, (3, 3)Commentary
: It means to use 16 "3x3" size filters (16 types of "3x3" filters).
It seems that odd numbers that can determine the center, such as "5x5" and "7x7", are easy to use.
It seems that the number of filters tends to be "16, 32, 64, 128, 256, 512" etc.
It seems that you should try a large number of filters for problems that seem complicated, and a small number of filters for problems that seem easy.

Here, the value related to the filter is

What is the vertical and horizontal size of one filter x how many pixels? (Pixel value)

When

How many vertical and horizontal size filters are used? (Number of sheets)

Be careful not to confuse it. The vertical and horizontal sizes are as explained so far. In the example below, the vertical and horizontal size of the filter is "5 x 5" (the pink area is a 5x5 = 25 pixel square).

So what does "the number of filters (how many filters do you use? That number)" mean? There is more than one type of filter for convolution. "One type" only indicates "one feature". For example, if you have a 3x3 filter, the filter types are, for example,

And so on. This is the "type of filter" and the "number of filters", that is, the "number of filters".

Summary,

Conv2D(16, (3, 3)

"Fold using 16 (16 types) filters with 3x3 vertical and horizontal pixels." It is an instruction.

Supplement on "number of filters"

If you want to know more about the meaning of "convolution using multiple filters, for example 16 types (16 sheets)" https://products.sint.co.jp/aisia/blog/vol1-16 See "Convolutional layer" in. The following is an excerpt.

Filters are created automatically and change with learning (error backpropagation).
Feature maps are output for the number of filters.

"The number of feature maps output as many as the number of filters" means After convolution with 16 types (16 sheets) of filters, It means that 16 "feature maps" are output.

Here for the sake of simplicity Consider the case of "convolution is performed with three filters".

For example, in the figure below, the filter (pink area) is 2x2. The feature map (green area) is 3x3.

If there is only one type of filter (pink area), Only one feature map (green area) is output.

However, if you prepare three types of filters, Because each type performs matrix calculation Since each feature map has different results, three feature maps are output.

Take a look at the sample code at the beginning

Sample code at the beginning

from keras import layers, models
model = models.Sequential()
model.add(layers.Conv2D(32,(3,3),activation="relu",input_shape=(150,150,3)))

Then

Conv2D(32,(3,3)

it is written like this. This is an instruction to "convolve using 32 types (32 sheets) of 3x3 filters (kernels)".

Above,

Question (1): What is the number of vertical and horizontal pixels of the kernel (filter) you want to use for convolution?

I understand how to decide the answer to (how to pass arguments).

Continue to

Question (2): What is the number of vertical and horizontal pixels of the image you want to identify by convolution (that is, the input image)?

Consider.

What is input_shape?

https://child-programmer.com/ai/keras/conv2d/ The following is an excerpt from.

input_shape=(28, 28, 1)Commentary
: A gray scale (black and white image) of 28 pixels vertically and 28 pixels horizontally is input.

In other words, in the sample code at the beginning

input_shape=(150,150,3)

If "The vertical and horizontal pixels of the input image are 150 x 150" Will be. So what does 3 mean?

Official documentation https://keras.io/ja/layers/convolutional/#conv2d To

Input for RGB images_shape=(128, 128, 3)It becomes.

1 for black and white images RGB 3

Therefore, it is considered to be the number of colors (3 types of red, green, and blue for RGB). If it is a normal photo (.jpg), it is RGB, so if you set 3, there will be no problem.

What is activation?

Sample code

model.add(layers.Conv2D(32,(3,3),activation="relu",input_shape=(150,150,3)))

Written in

activation="relu"

What does

https://child-programmer.com/ai/keras/conv2d/ The explanation in is below.

activation=Explanation of relu
: Activation function "ReLU (Rectified Linear Unit)"-Ramp function ".
Performed on the filtered image. Output is 0 when the input is 0 or less. If the input is larger than 0, it is output as it is.

https://keras.io/ja/layers/convolutional/#conv2d The explanation in is below.

activation:Name of activation function to use (see activations)
If nothing is specified, no activation will be applied

In other words activation="relu" Is the command "use ReLU as the activation function".

What is activation?

The function for activating is the "activation function". So what is "activation"? Below is a collection of contexts for understanding activation.

The activation function is indispensable for neural networks. https://qiita.com/omiita/items/bfbba775597624056987
The de facto standard of the activation function is "ReLU". https://qiita.com/omiita/items/bfbba775597624056987
The activation function is used to increase the expressiveness of the model. https://ai-trend.jp/basic-study/neural-network/activation_function/
Typical activation functions include "step function", "sigmoid function", and "ReLU function". https://ai-trend.jp/basic-study/neural-network/activation_function/

Summary, "If you specify an activation function, the expressiveness of the model will increase (you can create a smart AI), so let's specify an activation function." And "ReLU is used as standard, isn't it?"

About stride designation

Question (3): What is the stride value? (How many pixels?)

But this is

strides = 1

Specify as. Detail is https://keras.io/ja/layers/convolutional/#conv2d See.

Summary

As mentioned above

model.add(layers.Conv2D(32,(3,3),activation="relu",input_shape=(150,150,3)))

What are you doing? What does each argument mean? I could roughly understand. Because the purpose of this chapter is "Understanding Keras Conv2D (2D Convolutional Layer)" Once here. We will investigate Sequential () and MaxPooling2D () in a separate chapter.

[PYTHON] I investigated Keras's Conv2D (2D convolutional layer)

Thing you want to do

What you can read and understand this article

What is Conv2D?

What is CNN?

Reference information to deepen your understanding of CNN

basic way of thinking

What is an "image" in the first place?

The idea of "focusing on small divisions and examining their characteristics"

kernel

What is "convolution"?

The output result (feature map) of "convolution" with the original image of 5 x 5 and the kernel (filter) of 3 x 3 is 9 squares (3 x 3).

Specific calculation example

The meaning of the arguments passed to the keras function Conv2D ()

What is a "filter" in the first place?

Review of convolution

What is a stride?

How to decide the arguments to pass to the Conv2D function

How to determine the vertical and horizontal size of the filter (kernel)

Supplement on "number of filters"

Take a look at the sample code at the beginning

What is input_shape?

What is activation?

What is activation?

About stride designation

Summary