Suddenly, I started studying in Chapter 3 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / 3 Chapter / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)

This chapter describes how neural networks work.

It explains the difference in how to count layers and the difference between perceptron and neural network. It's a little inconvenient that different people count different layers.

This is an introduction to the types of activation functions. I tried to graph the three types of functions that appear.

```
# coding: utf-8
import numpy as np
import matplotlib.pylab as plt
def step_function(x):
"""Step function that returns 1 if the input exceeds 0
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
return np.array(x > 0, dtype=np.int)
def sigmoid(x):
"""Sigmoid function
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
return 1 / (1 + np.exp(-x))
def relu(x):
"""ReLU function
Args:
x (numpy.ndarray)):input
Returns:
numpy.ndarray:output
"""
return np.maximum(0, x)
#Calculation
x = np.arange(-5.0, 5.0, 0.01) #step is small so that the step function does not look diagonal
y_step = step_function(x)
y_sigmoid = sigmoid(x)
y_relu = relu(x)
#Graph drawing
plt.plot(x, y_step, label="step")
plt.plot(x, y_sigmoid, linestyle="--", label="sigmoid")
plt.plot(x, y_relu, linestyle=":", label="ReLU")
plt.ylim(-0.1, 5.1)
plt.legend()
plt.show()
```

The only thing I didn't understand was that the activation function should not be a linear function. I understood from the explanation of the book that if there is only one neuron in each layer, it can be expressed in one layer even if it is multi-layered. But even if there are multiple neurons in each layer, can it be expressed in one layer? I couldn't understand it well here.

The explanation is to replace the calculation of multidimensional arrays with the calculation of matrices to improve efficiency. I studied replacement with matrix calculation when I took an online machine learning course [^ 1] about 3 years ago, and used it for 100 knocks of language processing after that [^ 2], so I will review it. I did.

Implement a 3-layer neural network using the matrix calculation in the previous section. I didn't have any particular stumbling blocks because I didn't have the ability to learn.

Explanation of softmax function. ~~ There was no particular stumbling block here either. ~~ I didn't mean to stumble, but I noticed a mistake. If you do not perform batch processing, there is no problem as implemented in the book, but you need to modify it when implementing "3.6.3 batch processing".

Below is the Softmax code that I tried to support batch processing.

`python`

```
def softmax(x):
"""Softmax function
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
#In this case, it is necessary to calculate well for each image using broadcast.
if x.ndim == 2:
#For each image (axis=1) Calculate the maximum value and reshape so that it can be broadcast
c = np.max(x, axis=1).reshape(x.shape[0], 1)
#Calculate the numerator while subtracting the maximum value as an overflow countermeasure
exp_a = np.exp(x - c)
#The denominator is also for each image (axis)=Total to 1) and reshape so that it can be broadcast
sum_exp_a = np.sum(exp_a, axis=1).reshape(x.shape[0], 1)
#Calculated for each image
y = exp_a / sum_exp_a
else:
#If it is not batch processing, implement it according to the book
c = np.max(x)
exp_a = np.exp(x - c) #Overflow measures
sum_exp_a = np.sum(exp_a)
y = exp_a / sum_exp_a
return y
```

In addition, in the GitHub repository https://github.com/oreilly-japan/deep-learning-from-scratch of this book At one source, it was transposed for broadcasting. Maybe it's speedy, but at first glance I didn't know what I was doing, so I tried code that uses `reshape`

.

It actually implements the inference process of the neural network using the trained parameters. I need a `sample_weight.pkl`

that stores the learned parameters, so this book's GitHub repository [https://github.com/oreilly-japan/deep-learning-from-scratch](https://github. Let's bring the files in the `ch3`

folder of com / oreilly-japan / deep-learning-from-scratch) to the current directory.

As I proceeded with the implementation according to the book, I ran into an overflow warning.

```
/Users/segavvy/Documents/deep-learning-from-scratch/ch03/3.6_mnist.py:19: RuntimeWarning: overflow encountered in exp
return 1 / (1 + np.exp(-x))
```

For this, refer to the explanation of Meeting Machine Learning with Python >> Logistic Regression >> Sigmoid Function and set the value of x. I tried to fix it so that it would not overflow.

Also, when calculating the final recognition accuracy, in the book, ʻaccuracy_cnt` is type-converted to`

float`, but in python3, division between integers returns a floating point number, so this conversion seems unnecessary.

Also, while implementing it, I was wondering what kind of image I could not infer well, so I tried to display it.

Below is the code I wrote.

```
# coding: utf-8
import numpy as np
import os
import pickle
import sys
sys.path.append(os.pardir) #Add parent directory to path
from dataset.mnist import load_mnist
from PIL import Image
def sigmoid(x):
"""Sigmoid function
Since it overflows in the implementation of the book, it is corrected by referring to the following site.
http://www.kamishima.net/mlmpyja/lr/sigmoid.html
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#Correct x to a range that does not overflow
sigmoid_range = 34.538776394910684
x2 = np.maximum(np.minimum(x, sigmoid_range), -sigmoid_range)
#Sigmoid function
return 1 / (1 + np.exp(-x2))
def softmax(x):
"""Softmax function
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
#In this case, it is necessary to calculate well for each image using broadcast.
if x.ndim == 2:
#For each image (axis=1) Calculate the maximum value and reshape so that it can be broadcast
c = np.max(x, axis=1).reshape(x.shape[0], 1)
#Calculate the numerator while subtracting the maximum value as an overflow countermeasure
exp_a = np.exp(x - c)
#The denominator is also for each image (axis)=Total to 1) and reshape so that it can be broadcast
sum_exp_a = np.sum(exp_a, axis=1).reshape(x.shape[0], 1)
#Calculated for each image
y = exp_a / sum_exp_a
else:
#If it is not batch processing, implement it according to the book
c = np.max(x)
exp_a = np.exp(x - c) #Overflow measures
sum_exp_a = np.sum(exp_a)
y = exp_a / sum_exp_a
return y
def load_test_data():
"""MNIST test image and test label acquisition
Image value is 0.0〜1.Normalized to 0.
Returns:
numpy.ndarray, numpy.ndarray:Test image,Test label
"""
(x_train, t_train), (x_test, t_test) \
= load_mnist(flatten=True, normalize=True)
return x_test, t_test
def load_sapmle_network():
"""Get sample trained weight parameters
Returns:
dict:Weight and bias parameters
"""
with open("sample_weight.pkl", "rb") as f:
network = pickle.load(f)
return network
def predict(network, x):
"""Inference by neural network
Args:
network (dict):Weight and bias parameters
x (numpy.ndarray):Input to neural network
Returns:
numpy.ndarray:Neural network output
"""
#Parameter retrieval
W1, W2, W3 = network['W1'], network['W2'], network['W3']
b1, b2, b3 = network['b1'], network['b2'], network['b3']
#Neural network calculation (forward)
a1 = np.dot(x, W1) + b1
z1 = sigmoid(a1)
a2 = np.dot(z1, W2) + b2
z2 = sigmoid(a2)
a3 = np.dot(z2, W3) + b3
y = softmax(a3)
return y
def show_image(img):
"""Image display
Args:
image (numpy.ndarray):Image bitmap
"""
pil_img = Image.fromarray(np.uint8(img))
pil_img.show()
#Read MNIST test data
x, t = load_test_data()
#Read sample weight parameters
network = load_sapmle_network()
#Inference, recognition accuracy calculation
batch_size = 100 #Batch processing unit
accuracy_cnt = 0 #The number of correct answers
error_image = None #Unrecognized image
for i in range(0, len(x), batch_size):
#Batch data preparation
x_batch = x[i:i + batch_size]
#inference
y_batch = predict(network, x_batch)
p = np.argmax(y_batch, axis=1)
#Correct answer count
accuracy_cnt += np.sum(p == t[i:i + batch_size])
#Error the unrecognized image_Connect to image
for j in range(0, batch_size):
if p[j] != t[i + j]:
if error_image is None:
error_image = x_batch[j]
else:
error_image = np.concatenate((error_image, x_batch[j]), axis=0)
print("Recognition accuracy:" + str(accuracy_cnt / len(x)))
#Display unrecognized images
error_image *= 255 #Image value is 0.0〜1.Since it is normalized to 0, set it back to 0-255 so that it can be displayed.
show_image(error_image.reshape(28 * (len(x) - accuracy_cnt), 28))
```

And the execution result.

```
Recognition accuracy:0.9352
```

Since the failed ~~ 793 ~~ 648 images are simply connected vertically, a ridiculously long image is displayed, but there are certainly many characters that are difficult to understand. However, there are some characters that can be recognized.

~~ Also, the book says that the recognition accuracy will be `0.9352`

, but for some reason it has become` 0.9207`

. Even if I returned the sigmoid function to the state where the warning was issued, it did not change, so there may be some other mistake ... ~~

~~ Chapter 3 also had a lot of review for me, so I didn't make a big stumbling block, but I'm worried about the difference in recognition accuracy at the end. ~~ I didn't intend to stumble on Chapter 3, but I noticed the following two points later.

As @tunnel pointed out, I found out why the recognition accuracy was different from the book! Originally, it was necessary to use the image data value normalized to 0.0 to 1.0, but the one with 0 to 255 was used. Thank you @tunnel! Even so, if the values are different so far, the recognition accuracy is likely to be tattered, but it is interesting that it did not become so bad.

Even if I learned for some reason in Chapter 4, the loss function did not become small, and when I was investigating the cause, I noticed that the softmax function could not support batch processing. The above code has been fixed. (I was happy if this was explained a little more in Chapter 3 ...)

(To other chapters of this memo: Chapter 1 / Chapter 2 / 3 Chapter / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)

[^ 1]: This is a lecture Machine Learning provided by Stanford University in the online course service called Coursera. Volunteers added Japanese subtitles, so even if I wasn't good at English, it was pretty good. The technique of replacing array calculations with matrix calculations is described under the name Vectorization. [^ 2]: I used it when I solved problem 73 of Chapter 8 of 100 Language Processing Knock 2015. Learning notes at that time [100 amateur language processing knocks: 73](https://qiita.com/segavvy/items/5ad0d5742a674bdf56cc#%E3%83%99%E3%82%AF%E3%83%88% Posted as E3% 83% AB% E5% 8C% 96).

Recommended Posts