I needed to explain the code of the neural network (NN), so I decided to take this opportunity to summarize it briefly. Below, we will proceed assuming that the basic knowledge about NN is known.
[Learning and Neural Network (Electronic Information and Communication Engineering Series)](http://www.amazon.co.jp/gp/product/4627702914/ref=as_li_tf_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4627702914&linkCode=as2&tag=shimashimao06- 22) <img src = "http://ir-jp.amazon-adsystem.com/e/ir?t=shimashimao06-22&l=as2&o=9&a=4627702914" width="1" height="1" border=" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/>
Input to the network: $ (x_ {1}, x_ {2}, ..., x_ {N}) $
Output from network: $ (y_ {1}, y_ {2}, ..., y_ {M}) $
Network I / O relationship: $ (y_ {1}, y_ {2}, ..., y_ {M}) = F (x_ {1}, x_ {2}, ..., x_ {N}) $
Join weight: {$ w_ {ij}
The above reference book has the following description as to why BP has become widely used.
- Calculation of the partial derivative of the evaluation function for each parameter tends to be complicated in a general nonlinear system, but there is a systematic and systematic calculation method in BP. Moreover, the information required for determining the amount of correction of each parameter can be transmitted to the required location by using the connection structure known by the network. This means that there is no need to provide a special communication line for learning, which is convenient for simplifying algorithms and hardware.
- The function approximation capability of the network is theoretically guaranteed. If a sufficient number of intermediate layer elements are used, any function can be approximated with high accuracy by appropriately determining parameter values (connection weight, threshold value). It is not always possible to find a parameter value that gives the optimum approximation by BP, but the proof of a kind of existence theorem gives a sense of security.
Perform the following operations.
Since the sequential update learning method is derived, the gradient method is applied with the goal of reducing the evaluation scale of the following equation for each training data ($ x ^ {(l)}, y ^ {(l)} $).
The following equation is obtained by differentiating the error evaluation scale $ E (w_0, w_1, ..., w_N) $ by $ w_n $.
When this is differentiated,
To summarize this further,
Using the value of $ \ partial E (w_0, w_1, ..., w_N) / \ partial w_n $ obtained by the above, the connection weight is repeatedly corrected by the following equation.
The small change of x spreads to the change of other variable values in a chain reaction according to the dependency of the variables in the above figure. According to the chain rule, the following relational expression holds between the minute changes of these variables $ \ Delta x, \ Delta z_1, \ Delta x_2, \ Delta y $.
To summarize the above formula,
Based on the above chain rule, the BP of NN consisting of the input layer, intermediate layer, and output layer is derived. The number of units for all three layers is 3 (4 units for the input layer and intermediate layer, considering the bias term). From the unit of the output layer, number the output layer: {1,2,3}, the intermediate layer: {4,5,6}, and the input layer: {7,8,9}. And
BP is derived using these two cases as examples. An image of each propagation is shown below.
The relationship between the amount of change in $ w_2 $ $ \ Delta w_2 $ and the amount of change in $ s_2 $ $ \ Delta s_2 $ is shown by the following equation.
This change in $ y_2 $ changes the value of the error evaluation scale, and the following relational expression holds.
From the above equation, the partial differential coefficient $ \ partial E / \ partial w_2 $ required to correct $ w_2 $ by the gradient method was obtained.
Also, from $ \ Delta s_2 = y_4 \ Delta w_2 $
The relationship of the amount of change when $ w_4 $ is changed is calculated by the following formula.
The change in $ y_4 $ that occurs in this way affects $ s_1, s_2, s_3 $ at the connection destination. Therefore, the following relational expression holds between the amounts of change.
The above change also changes the error evaluation scale as shown in the following equation.
Organize this and divide both sides by $ \ Delta s_4 $ to
Is sought. It can be seen that the above formula is a kind of recurrence formula in which $ \ partial E / \ partial s_i $ in the middle layer is obtained by $ \ partial E / \ partial s_j $ in the output layer. Similarly, it can be seen that the error can be propagated to the lower layers by finding $ \ partial E / \ partial s $ of the final layer for any number of NNs.
Finally,
Sorry for the pretty dirty code, but I'll paste it below. The threshold (bias) is fixed for all neurons for simplicity. When actually creating code, add an element with a value of 1 to the beginning of the input vector, add a bias element to the weight vector, and adjust the parameters as part of the weight.
python
# coding: utf-8
import numpy as np
Afrom numpy.random import randint
import sys
class NN:
def __init__(self):
self.alph = 0.04
self.mu = 0.01
self.theta = 0.1
self.w = []
self.output = []
self.output_sigm = []
self.T = 0
def create_data(self, input_n_row, input_n_col, layer_sizes):
self.x = randint(2, size=(input_n_row, input_n_col))
self.y = randint(2, size=(input_n_row, layer_sizes[-1]))
for i_layer, size in enumerate(layer_sizes):
if i_layer == 0:
self.w.append(np.random.randn(input_n_col, size))
self.output.append(np.zeros((input_n_row, size)))
self.output_sigm.append(np.zeros((input_n_row ,size)))
else:
self.w.append(np.random.randn(layer_sizes[i_layer-1], size))
self.output.append(np.zeros((input_n_row, size)))
self.output_sigm.append(np.zeros((input_n_row ,size)))
def fit(self, eps=10e-6):
error = sys.maxint
self.forward()
while error>eps:
self.update( self.backword() )
self.forward()
error = self.calculate_error()
self.T += 1
print "T=", self.T
print "error", error
def calculate_error(self):
return np.sum( np.power(self.y - self.output_sigm[-1], 2) )
def forward(self):
for i_layer in xrange(len(self.output)):
if i_layer == 0:
self.output[i_layer] = self.x.dot(self.w[i_layer])
self.output_sigm[i_layer] = self.sigmoid(self.output[i_layer])
else:
self.output[i_layer] = self.output_sigm[i_layer-1].dot(self.w[i_layer])
self.output_sigm[i_layer] = self.sigmoid(self.output[i_layer])
def backword(self):
result = []
for i_layer in range(len(self.w))[::-1]:
if i_layer==len(self.w)-1:
result.insert(0, self.diff(self.output_sigm[i_layer], self.y) )
else:
result.insert(0, self.diff_mult( self.output_sigm[i_layer], result[0].dot(self.w[i_layer+1].T)) )
return result
def update(self, diff):
for i_layer in range(len(self.w))[::-1]:
if i_layer==0:
for i_row in xrange(len(diff[i_layer])):
self.w[i_layer] -= self.get_incremental_update_value(
self.x[i_row].reshape(len(self.w[i_layer]),1),
diff[i_layer][i_row,:].reshape(1,self.w[i_layer].shape[1])
)
else:
for i_row in xrange(len(diff[i_layer])):
self.w[i_layer] -= self.get_incremental_update_value(
self.output_sigm[i_layer-1][i_row,:].reshape(len(self.w[i_layer]),1),
diff[i_layer][i_row,:].reshape(1,self.w[i_layer].shape[1])
)
def get_incremental_update_value(self, input_data, diff):
return np.kron(input_data, self.mu*diff)
def diff(self, y, t):
return self.alph * 2*(y - t) * self.dsigmoid(y)
def diff_mult(self, y, prp_value):
return self.alph * self.dsigmoid(y) * prp_value
def sigmoid(self, s, alph=0.01):
return 1/(1+np.exp(-self.alph*(s-self.theta)))
def dsigmoid(self, y):
return y * (1 - y)
if __name__=='__main__':
layer_sizes = (4,3)
input_layer_size = 3
input_data_size = 1000
nn = NN()
nn.create_data(input_data_size, input_layer_size, layer_sizes)
nn.fit()
We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.
Recommended Posts