I needed to explain the code of the neural network (NN), so I decided to take this opportunity to summarize it briefly. Below, we will proceed assuming that the basic knowledge about NN is known.

Reference material

[Learning and Neural Network (Electronic Information and Communication Engineering Series)](http://www.amazon.co.jp/gp/product/4627702914/ref=as_li_tf_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4627702914&linkCode=as2&tag=shimashimao06- 22) <img src = "http://ir-jp.amazon-adsystem.com/e/ir?t=shimashimao06-22&l=as2&o=9&a=4627702914" width="1" height="1" border=" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/>

What i did

Backpropagation (BP) theory
Implementation of 3-layer neural network

BP theory

symbol

Input to the network: $ (x_ {1}, x_ {2}, ..., x_ {N}) $ Output from network: $ (y_ {1}, y_ {2}, ..., y_ {M}) $ Network I / O relationship: $ (y_ {1}, y_ {2}, ..., y_ {M}) = F (x_ {1}, x_ {2}, ..., x_ {N}) $ Join weight: {$ w_ {ij} } Threshold: { {\ theta_ {i}} $}

Overview of BP

One of the learning methods for feedforward NN (extensible for recurrent NN)
$ (y_ {1}, y_ {2}, ..., y_ {M}) = F (x_ {1}, x_ {2}, ..., x_ {N}) $ was specified A method to give a procedure to determine the parameters {$ w_ {ij}, \ theta_ {i} $} inside the network so as to realize the input / output relationship. (However, as a result of learning, it is not always possible to completely realize the target input / output relationship.)
For function F, sigmoid function etc. is used
BP defines {$ w_ {ij}, \ theta_ {i} $} based on the gradient method
In the gradient method, an evaluation scale is used to measure the degree of achievement of the target (this time, the squared error. Hereinafter referred to as E).
Give appropriate initial values to the parameters {$ w_ {ij}, \ theta_ {i} $} inside the network, and repeat minor corrections of the parameters so as to reduce the error evaluation scale E.
As a result, the set of parameter values converges to one of the minimum points of the evaluation scale.

Benefits of BP

The above reference book has the following description as to why BP has become widely used.

Calculation of the partial derivative of the evaluation function for each parameter tends to be complicated in a general nonlinear system, but there is a systematic and systematic calculation method in BP. Moreover, the information required for determining the amount of correction of each parameter can be transmitted to the required location by using the connection structure known by the network. This means that there is no need to provide a special communication line for learning, which is convenient for simplifying algorithms and hardware.

The function approximation capability of the network is theoretically guaranteed. If a sufficient number of intermediate layer elements are used, any function can be approximated with high accuracy by appropriately determining parameter values (connection weight, threshold value). It is not always possible to find a parameter value that gives the optimum approximation by BP, but the proof of a kind of existence theorem gives a sense of security.

Function approximation

Perform the following operations.

Parameter initial value setting
Convergence test
Input of learning data
Gradient calculation
Parameter correction

Function approximation in a single neuron

Since the sequential update learning method is derived, the gradient method is applied with the goal of reducing the evaluation scale of the following equation for each training data ($ x ^ {(l)}, y ^ {(l)} $). E(w_0, w_1, ..., w_N)=|y^{(l)}-\hat{y}^{(l)}|^2 s=\sum_{n=0}^{N}w_nx_n^{(l)} \hat{y}^{(l)}=sigmoid(s)

The following equation is obtained by differentiating the error evaluation scale $ E (w_0, w_1, ..., w_N) $ by $ w_n $. \frac{\partial E(w_0, w_1, ..., w_N)}{\partial w_n}=-2(y^{(l)}-\hat{y}^{(l)})sigmoid'(s)x_n^{(l)} here, sigmoid(s)=\frac{1}{1+e^{-\alpha s}}

When this is differentiated, sigmoid'(s)=\frac{\alpha e^{-\alpha s}}{(1+e^{-\alpha s})^2}

To summarize this further, \frac{\alpha e^{-\alpha s}}{(1+e^{-\alpha s})^2} =\alpha \frac{1}{1+e^{-\alpha s}}(1-\frac{1}{1+e^{-\alpha s}}) =\alpha sigmoid(s)(1-sigmoid(s)) =\alpha \hat{y}^{(l)}(1-\hat{y}^{(l)})

Using the value of $ \ partial E (w_0, w_1, ..., w_N) / \ partial w_n $ obtained by the above, the connection weight is repeatedly corrected by the following equation. w_n=w_n-\epsilon \frac{\partial E(w_0, w_1, ..., w_N)}{\partial w_n}

Derivation

Chain rule

The small change of x spreads to the change of other variable values in a chain reaction according to the dependency of the variables in the above figure. According to the chain rule, the following relational expression holds between the minute changes of these variables $ \ Delta x, \ Delta z_1, \ Delta x_2, \ Delta y $. \Delta z_1=\frac{\partial z_1}{\partial x}\Delta x \Delta z_2=\frac{\partial z_2}{\partial x}\Delta x \Delta y=\frac{\partial y}{\partial z_1}\Delta z_1 + \frac{\partial y}{\partial z_2}\Delta z_2

To summarize the above formula, \frac{\partial y}{\partial x}=\frac{\partial y}{\partial z_1}\frac{\partial z_1}{\partial x} + \frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x}

3 layer NN

Based on the above chain rule, the BP of NN consisting of the input layer, intermediate layer, and output layer is derived. The number of units for all three layers is 3 (4 units for the input layer and intermediate layer, considering the bias term). From the unit of the output layer, number the output layer: {1,2,3}, the intermediate layer: {4,5,6}, and the input layer: {7,8,9}. And

When the weight $ w_2 $ from the 4th unit in the middle layer to the 2nd unit in the output layer changes slightly
When the weight $ w_4 $ from the 5th unit of the input layer to the 4th unit of the intermediate layer changes slightly.

BP is derived using these two cases as examples. An image of each propagation is shown below.

In case of 1 above

The relationship between the amount of change in $ w_2 $ $ \ Delta w_2 $ and the amount of change in $ s_2 $ $ \ Delta s_2 $ is shown by the following equation. \Delta s_2=\Delta w_2y_4 Also, y_2=sigmoid(s_2) Than \Delta y_2=sigmoid'(s_2)\Delta s_2

This change in $ y_2 $ changes the value of the error evaluation scale, and the following relational expression holds. \Delta E=2(y_2-t_2)\Delta y_2 Here, $ t_2 $ refers to the correct answer data of $ y_2 $. If you organize the formula, \Delta E=2(y_2-t_2)sigmoid'(s_2)y_4\Delta w_2 \frac{\partial E}{\partial w_2}=2(y_2-t_2)sigmoid'(s_2)y_4

From the above equation, the partial differential coefficient $ \ partial E / \ partial w_2 $ required to correct $ w_2 $ by the gradient method was obtained.

Also, from $ \ Delta s_2 = y_4 \ Delta w_2 $ \frac{\partial E}{\partial s_2}=2(y_2-t_2)sigmoid'(s_2) Is also sought at the same time. These values can be used as they are when calculating the partial differential coefficient when the weight one layer above changes.

In case of 2 above

The relationship of the amount of change when $ w_4 $ is changed is calculated by the following formula. \Delta s_4=\Delta w_4 y_5 \Delta y_4=sigmoid'(s_4) \Delta s_4

The change in $ y_4 $ that occurs in this way affects $ s_1, s_2, s_3 $ at the connection destination. Therefore, the following relational expression holds between the amounts of change. \Delta s_1=w_1 \Delta y_4 \Delta s_2=w_2 \Delta y_4 \Delta s_3=w_3 \Delta y_4

The above change also changes the error evaluation scale as shown in the following equation. \Delta E=\frac{\partial E}{\partial s_1} \Delta s_1+\frac{\partial E}{\partial s_2} \Delta s_2+\frac{\partial E}{\partial s_3} \Delta s_3

Organize this and divide both sides by $ \ Delta s_4 $ to \frac{\partial E}{\partial s_4}=(\frac{\partial E}{\partial s_1}w_1+\frac{\partial E}{\partial w_2}s_2+\frac{\partial E}{\partial s_3}w_3)sigmoid'(s_4)

Is sought. It can be seen that the above formula is a kind of recurrence formula in which $ \ partial E / \ partial s_i $ in the middle layer is obtained by $ \ partial E / \ partial s_j $ in the output layer. Similarly, it can be seen that the error can be propagated to the lower layers by finding $ \ partial E / \ partial s $ of the final layer for any number of NNs.

Finally, \frac{\partial s_4}{\partial w_4}=y_5 To \frac{\partial E}{\partial w_4}=\frac{\partial E}{\partial s_4}\frac{\partial s_4}{\partial w_4} Substitute in \frac{\partial E}{\partial w_4}=\frac{\partial E}{\partial s_4}y_5 To get. $ sigmoid'(s) $ simplifies the calculation by using $ y = sigmoid (s) $ shown in the single neuron part.

Reference code

Sorry for the pretty dirty code, but I'll paste it below. The threshold (bias) is fixed for all neurons for simplicity. When actually creating code, add an element with a value of 1 to the beginning of the input vector, add a bias element to the weight vector, and adjust the parameters as part of the weight.

`python`


# coding: utf-8

import numpy as np
Afrom numpy.random import randint
import sys


class NN:
    def __init__(self):
        self.alph = 0.04
        self.mu = 0.01
        self.theta = 0.1
        self.w = []
        self.output = []
        self.output_sigm = []
        self.T = 0

    def create_data(self, input_n_row, input_n_col, layer_sizes):
        self.x = randint(2, size=(input_n_row, input_n_col))
        self.y = randint(2, size=(input_n_row, layer_sizes[-1]))
        for i_layer, size in enumerate(layer_sizes):
            if i_layer == 0:
                self.w.append(np.random.randn(input_n_col, size))
                self.output.append(np.zeros((input_n_row, size)))
                self.output_sigm.append(np.zeros((input_n_row ,size)))
            else:
                self.w.append(np.random.randn(layer_sizes[i_layer-1], size))
                self.output.append(np.zeros((input_n_row, size)))
                self.output_sigm.append(np.zeros((input_n_row ,size)))

    def fit(self, eps=10e-6):
        error = sys.maxint
        self.forward()
        while error>eps:
            self.update( self.backword() )
            self.forward()
            error = self.calculate_error()
            self.T += 1
            print "T=", self.T
            print "error", error

    def calculate_error(self):
        return np.sum( np.power(self.y - self.output_sigm[-1], 2) )

    def forward(self):
        for i_layer in xrange(len(self.output)):
            if i_layer == 0:
                self.output[i_layer] = self.x.dot(self.w[i_layer])
                self.output_sigm[i_layer] = self.sigmoid(self.output[i_layer])
            else:
                self.output[i_layer] = self.output_sigm[i_layer-1].dot(self.w[i_layer])
                self.output_sigm[i_layer] = self.sigmoid(self.output[i_layer])

    def backword(self):
        result = []
        for i_layer in range(len(self.w))[::-1]:
            if i_layer==len(self.w)-1:
                result.insert(0, self.diff(self.output_sigm[i_layer], self.y) )
            else:
                result.insert(0, self.diff_mult( self.output_sigm[i_layer], result[0].dot(self.w[i_layer+1].T)) )
        return result

    def update(self, diff):
        for i_layer in range(len(self.w))[::-1]:
            if i_layer==0:
                for i_row in xrange(len(diff[i_layer])):
                    self.w[i_layer] -= self.get_incremental_update_value(
                                              self.x[i_row].reshape(len(self.w[i_layer]),1),
                                              diff[i_layer][i_row,:].reshape(1,self.w[i_layer].shape[1])
                                          )
            else:
                for i_row in xrange(len(diff[i_layer])):
                    self.w[i_layer] -= self.get_incremental_update_value(
                                              self.output_sigm[i_layer-1][i_row,:].reshape(len(self.w[i_layer]),1),
                                              diff[i_layer][i_row,:].reshape(1,self.w[i_layer].shape[1])
                                          )

    def get_incremental_update_value(self, input_data, diff):
        return np.kron(input_data, self.mu*diff)

    def diff(self, y, t):
        return self.alph * 2*(y - t) * self.dsigmoid(y)

    def diff_mult(self, y, prp_value):
        return self.alph * self.dsigmoid(y) * prp_value

    def sigmoid(self, s, alph=0.01):
        return 1/(1+np.exp(-self.alph*(s-self.theta)))

    def dsigmoid(self, y):
        return y * (1 - y)

if __name__=='__main__':
    layer_sizes = (4,3)
    input_layer_size = 3 
    input_data_size = 1000

    nn = NN()
    nn.create_data(input_data_size, input_layer_size, layer_sizes)
    nn.fit()

We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.

[PYTHON] Simple neural network theory and implementation