1.First of all

This time, in order to understand forward propagation and error back propagation, I will implement a neural network that actually works with scratch. The dataset uses MNIST.

2. Neural network specifications

A neural network with 28 x 28 = 784 input layers, 10 intermediate layers, and 2 output layers is used, the activation function / error function is sigmoid, and the optimization method is gradient descent.

The data set is obtained by extracting "1" and "7" from MNIST and performing binary classification. スクリーンショット 2020-03-16 09.00.03.png

3. Forward propagation

First, when you calculate $ a ^ 0_0 $, there are 784 inputs from $ x ^ 0_0 $ to $ x ^ 0_ {783} $, each with weights $ w ^ 0_ {00} $ to $ w ^ 0_ Since {783,0} $ is hung, スクリーンショット 2020-03-16 09.12.07.png

Expressed as a matrix, all calculations from $ a ^ 0_0 $ to $ a ^ 0_9 $ can be easily represented. スクリーンショット 2020-03-16 09.19.28.png

Since $ x ^ 1_0 $ to $ x ^ 1_9 $ is the result of passing $ a ^ 0_0 $ to $ a ^ 0_9 $ through the activation function sigmoid, スクリーンショット 2020-03-16 09.21.38.png

Next, when you calculate $ a ^ 1_0 $, there are 10 inputs from $ x ^ 1_0 $ to $ x ^ 1_9 $, each with weights $ w ^ 1_ {00} $ to $ w ^ 1_ {90. } $ Is hung, so スクリーンショット 2020-03-16 09.25.48.png

As before, if $ a ^ 1_0 $ and $ a ^ 1_1 $ are represented by a matrix, スクリーンショット 2020-03-16 18.33.38.png

Finally, $ y ^ 0 $ and $ y ^ 1 $ are スクリーンショット 2020-03-16 19.29.15.png

In this way, forward propagation can be easily performed by inner product or addition of matrices.

4. Error back propagation (intermediate layer to output layer)

First, update the weights and biases from the middle layer to the output layer.

The update expression for the weight w can be represented by $ w = w- \ eta * \ frac {\ partial E} {\ partial w} $. Here, $ \ eta $ is the learning rate, and $ \ frac {\ partial E} {\ partial w} $ is the error E differentiated by the weight w.

Let's give a concrete example of $ \ frac {\ partial E} {\ partial w} $ and express it with a general formula to implement it. First, the weights from the middle layer to the output layer.

スクリーンショット 2020-03-15 19.10.21.png Find $ \ frac {\ partial E ^ 0} {\ partial w ^ 1_ {00}} $ to update the weight $ w ^ 1_ {00} $. From the chain rule of differentiation スクリーンショット 2020-03-15 19.14.30.png Expressed as a general formula, k = 0 to 9, j = 0 to 1, スクリーンショット 2020-03-16 08.44.27.png This allows the weight $ w ^ 1_ {kj} $ to be updated. As for the bias, $ b ^ 1 $ is 1, so $ x ^ 1_k $ in the above formula just replaces 1. スクリーンショット 2020-03-16 10.02.58.png This allows the bias $ b ^ 1_j $ to be updated.

5. Error back propagation (from input layer to intermediate layer)

Next is the update of weights and biases from the input layer to the middle layer.

スクリーンショット 2020-03-16 10.15.30.png

To update the weight $ w ^ 0_ {00} $, $ \ frac {\ partial E ^ 0} {\ partial w ^ 0_ {00}} $ and $ \ frac {\ partial E ^ 1} {\ You need to find partial w ^ 0_ {00}} $. From the chain rule of differentiation スクリーンショット 2020-03-16 08.32.17.png Expressed as a general formula, k = 0 to 783, j = 0 to 9, スクリーンショット 2020-03-16 08.33.56.png This allows the weight $ w ^ 0_ {kj} $ to be updated. As for the bias, $ b ^ 0 $ is 1, so $ x ^ 0_k $ in the above formula just replaces 1. スクリーンショット 2020-03-16 10.11.34.png This allows the bias $ b ^ 0_j $ to be updated.

6. Implementation of forward propagation and error back propagation part

Based on the general formula obtained earlier, implement the forward propagation and error back propagation parts.

#Sigmoid function
def sigmoid(a):
    return 1 / (1 + np.exp(-a))

#Differentiation of sigmoid function
def sigmoid_d(a):
    return (1 - sigmoid(a)) * sigmoid(a)

#Backpropagation of error
def back(l, j):
    if l == max_layer - 1:
        return (y[j] - t[j]) * sigmoid_d(A[l][j])
    else:
        output = 0
        m = A[l+1].shape[0]   
        for i in range(m):
            output += back(l+1, i) * W[l+1][i,j] * sigmoid_d(A[l][j])
        return output

The specific movement of def back (l, j): is

When l = 1, (y [j] -t [j]) * sigmoid_d (A [1] [j]) is returned.

When l = 0, 　(y[0]-t[0])＊sigmoid_d(A[1][0])*W[1][0,j]*sigmoid_d(A[0][j]) 　+(y[1]-t[1])＊sigmoid_d(A[1][1])*W[1][1,j]*sigmoid_d(A[0][j]) Is returned.

#Weight W setting
np.random.seed(seed=7)
w0 = np.random.normal(0.0, 1.0, (10, 784))
w1 = np.random.normal(0.0, 1.0, (2, 10))
W = [w0, w1]

#Bias b setting
b0 = np.ones((10, 1))
b1 = np.ones((2, 1))
B = [b0, b1]

#Other settings
max_layer = 2 #Setting the number of layers
n = 0.5  #Learning rate setting

Set the weight W, bias b, and other settings.

スクリーンショット 2020-03-16 19.48.21.png Each term of the weight matrix w0 and w1 is a random number that follows a normal distribution of 0 to 1 so that learning can start smoothly. By the way, if you change the seed = number of np.random.seed (seed = 7), the starting condition of learning (whether it starts smoothly or is a little sluggish) will change. Each term of the bias matrices b0 and b1 is 1.

#Learning loop
count = 0 
acc = []

for x, t in zip(xs, ts):
    
    #Forward propagation
    x0 = x.flatten().reshape(784, 1)
    a0 = W[0].dot(x0) + B[0]
    x1 = sigmoid(a0)
    a1 = W[1].dot(x1) + B[1]
    y = sigmoid(a1)

    #X for parameter update,List a
    X = [x0, x1]
    A = [a0, a1]

    #Parameter update
    for l in range(len(X)):
        for j in range(W[l].shape[0]):
            for k in range(W[l].shape[1]):
                W[l][j, k] = W[l][j, k] - n * back(l, j) * X[l][k]  
            B[l][j] = B[l][j] - n * back(l, j)

It's a learning loop. Forward propagation can be easily performed by inner product of matrices, addition, etc. In parameter update,

When l = 0, the range is j = 0 to 9, k = 0 to 783. 　W[0][j,k] = W[0][j,k] - n ＊ back(0,j) ＊ X[0][k] 　B[0][j] = B[0][j] - n ＊ back(0,j) Will be updated.

When l = 1, the range is j = 0 to 1, k = 0 to 9. 　W[1][j,k] = W[1][j,k] - n ＊ back(0,j) ＊ X[0][k] 　B[1][j] = B[1][j] - n ＊ back(0,j) Will be updated.

7. Dataset preparation

Read the MNIST dataset with Keras and extract only "1" and "7".

import numpy as np
from keras.datasets import mnist
from keras.utils import np_utils
import matplotlib.pyplot as plt

#Numeric display
def show_mnist(x):
    fig = plt.figure(figsize=(7, 7))   
    for i in range(100):
        ax = fig.add_subplot(10, 10, i+1, xticks=[], yticks=[])
        ax.imshow(x[i].reshape((28, 28)), cmap='gray')
    plt.show()

#Data set reading
(x_train, y_train), (x_test, y_test) = mnist.load_data()
show_mnist(x_train)

# 1,Extract 7
x_data, y_data = [], []
for i in range(len(x_train)):  
    if y_train[i] == 1 or y_train[i] == 7:
       x_data.append(x_train[i])
       if y_train[i] == 1:
          y_data.append(0)
       if y_train[i] == 7:
          y_data.append(1)

show_mnist(x_data)

#Convert from list format to numpy format
x_data = np.array(x_data)
y_data = np.array(y_data)

# x_data normalization, y_One-hot representation of data
x_data = x_data.astype('float32')/255
y_data = np_utils.to_categorical(y_data)

#Learn, get test data
xs = x_data[0:200]
ts = y_data[0:200]  
xt = x_data[2000:3000]  
tt = y_data[2000:3000]

スクリーンショット 2020-03-16 14.42.10.png スクリーンショット 2020-03-16 14.42.25.png The data from 0 to 9 and the data from which 1 and 7 are extracted are displayed from the beginning.

Prepare 200 pieces of training data xs and ts, and 1,000 pieces of test data xt and tt.

8. Whole implementation

It is the whole implementation that adds accuracy confirmation by test data and accuracy transition graph display for each learning.

import numpy as np
from keras.datasets import mnist
from keras.utils import np_utils

#Data set reading
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# 1,Extract only the number 7
x_data, y_data = [], []
for i in range(len(x_train)):  
    if y_train[i] == 1 or y_train[i] == 7:
       x_data.append(x_train[i])
       if y_train[i] == 1:
          y_data.append(0)
       if y_train[i] == 7:
          y_data.append(1)

#Convert from list format to numpy format
x_data = np.array(x_data)
y_data = np.array(y_data)

# x_data normalization, y_One-hot data
x_data = x_data.astype('float32')/255
y_data = np_utils.to_categorical(y_data)

#Acquisition of training data and test data
xs = x_data[0:200]  
ts = y_data[0:200]  
xt = x_data[2000:3000]  
tt = y_data[2000:3000]  


#Sigmoid function
def sigmoid(a):
    return 1 / (1 + np.exp(-a))

#Differentiation of sigmoid function
def sigmoid_d(a):
    return (1 - sigmoid(a)) * sigmoid(a)

#Backpropagation of error
def back(l, j):
    if l == max_layer - 1:
        return (y[j] - t[j]) * sigmoid_d(A[l][j])
    else:
        output = 0
        m = A[l+1].shape[0]   
        for i in range(m):
            output += back(l + 1, i) * W[l + 1][i, j] * sigmoid_d(A[l][j])
        return output

#Weight W setting
np.random.seed(seed=7)
w0 = np.random.normal(0.0, 1.0, (10, 784))
w1 = np.random.normal(0.0, 1.0, (2, 10))
W = [w0, w1]

#Bias b setting
b0 = np.ones((10, 1))
b1 = np.ones((2, 1))
B = [b0, b1]

#Other settings
max_layer = 2 #Setting the number of layers
n = 0.5  #Learning rate setting

#Learning loop
count = 0 
acc = []

for x, t in zip(xs, ts):
    
    #Forward propagation
    x0 = x.flatten().reshape(784, 1)
    a0 = W[0].dot(x0) + B[0]
    x1 = sigmoid(a0)
    a1 = W[1].dot(x1) + B[1]
    y = sigmoid(a1)

    #X for parameter update,List a
    X = [x0, x1]
    A = [a0, a1]

    #Parameter update
    for l in range(len(X)):
        for j in range(W[l].shape[0]):
            for k in range(W[l].shape[1]):
                W[l][j, k] = W[l][j, k] - n * back(l, j) * X[l][k]  
            B[l][j] = B[l][j] - n * back(l, j) 
            
    #Accuracy check by test data
    correct, error = 0, 0

    for i in range(1000):

        #Inference with learned parameters
        x0 = xt[i].flatten().reshape(784, 1)
        a0 = W[0].dot(x0) + B[0]
        x1 = sigmoid(a0)
        a1 = W[1].dot(x1) + B[1]
        y = sigmoid(a1)
    
        if np.argmax(y) == np.argmax(tt[i]):
           correct += 1
        else:
           error += 1
    calc = correct/(correct+error)
    acc.append(calc)
    count +=1
    print("\r[%s] acc: %s"%(count, calc))
   
#Accuracy transition graph display
import matplotlib.pyplot as plt
plt.plot(acc, label='acc')
plt.legend()
plt.show()

スクリーンショット 2020-03-16 17.21.46.png In 200 steps, the classification accuracy was 97.8%. It would be great if the neural network that I implemented by scratch works properly.

[PYTHON] Try to build a deep learning / neural network with scratch