I implemented a 3-layer neural network with python and tried to identify XNOR. I have also posted mathematical formulas, so if you are interested, please read it. I used ["Deep Learning"](http://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) as a textbook.
Structure of this article
A neural network is a model that imitates the neural circuits of the human brain. Image recognition and voice recognition are possible by using this model. The network implemented this time has a three-layer structure consisting of an input layer, an intermediate layer (1 layer), and an output layer.
I will explain with the figure below.
Let the weight from the $ i $ th unit in the $ l-1 $ layer to the $ j $ th unit in the $ l $ layer be $ w_ {ji} ^ {(l)} $. Also, let $ u_ {i} ^ {(l-1)} $ be the value held by the $ i $ th unit in the $ l-1 $ layer. The sigmoid function $ g (x) $ is used as the activation function. It is convenient because the differential value can be obtained using the value before differentiation.
g(x) = \cfrac{1}{1 + \exp(-x)} \\
g'(x) = g(x)(1 - g(x))
The square error $ E_n $ is used as the objective function, and the weight is updated by the following formula to minimize it. $ E_n $ is the error caused by one sample, and the method of updating the weight using this $ E_n $ is called the stochastic gradient descent method. The stochastic gradient descent method has the advantage that it is less likely to be trapped in the local solution because it changes each time the objective function is updated. $ \ Epsilon $ is called the learning rate and is a parameter that determines the learning speed. The point is how to calculate the term of partial differential.
{{w}_{ji}^{(l)}}_{new} = {w_{ji}^{(l)}}_{old} - \epsilon\cfrac{\partial E_n}{\partial {w_{ji}^{(l)}}_{old}}
Now, I will explain how to find $ \ cfrac {\ partial E_n} {\ partial w_ {ji} ^ {(l)}} $. Since the method of obtaining the output layer and the intermediate layer are slightly different, we will explain them separately.
**
\begin{align}
\cfrac{\partial E_n}{\partial w_{ji}^{(l)}} &= \cfrac{\partial}{\partial w_{ji}^{(l)}}\cfrac{1}{2}(t_j - g(u_{j}^{(l)}))^{2} \\
&= (t_j - g(u_{j}^{(l)}))\cdot\cfrac{\partial}{\partial w_{ji}^{(l)}}(t_j - g(u_{j}^{(l)})) \\
&= (t_j - g(u_{j}^{(l)}))\cdot\cfrac{\partial}{\partial u_{j}^{(l)}}(t_j - g(u_{j}^{(l)}))\cdot\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} \\
\\
&= (g(u_{j}^{(l)}) - t_j)\cdot g'(u_{j}^{(l)})\cdot g(u_{i}^{(l-1)}) \\
\\
&= (g(u_{j}^{(l)}) - t_j)g(u_{j}^{(l)})(1 - g(u_{j}^{(l)}))g(u_{i}^{(l-1)})
\end{align}
**
\begin{align}
\cfrac{\partial E_n}{\partial w_{ji}^{(l)}} &= \cfrac{\partial E_n}{\partial u_{j}^{(l)}}\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} \\
\\
&= \delta_{j}^{(l)}\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}}
\end{align}
The first term on the right side is
\begin{align}
\delta_{j}^{(l)} &= \cfrac{\partial E_n}{\partial u_{j}^{(l)}} \\
&= \sum_{k}\cfrac{\partial E_n}{\partial u_{k}^{(l+1)}}\cfrac{\partial u_{k}^{(l+1)}}{\partial u_{j}^{(l)}} \\
\\
&= \sum_{k}\delta_{k}^{(l+1)}w_{kj}^{(l+1)}g'(u_{j}^{(l)}) \\
\\
&= \Bigl(\sum_{k}\delta_{k}^{(l+1)}w_{kj}^{(l+1)}\Bigr)g(u_{j}^{(l)})(1 - g(u_{j}^{(l)}))
\end{align}
The second term on the right side is
\begin{align}
\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} &= g(u_{i}^{(l-1)})
\end{align}
From the above,
\begin{align}
\cfrac{\partial E_n}{\partial w_{ji}^{(l)}} &= \delta_{j}^{(l)}\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} \\
\\
&= \Bigl(\sum_{k}\delta_{k}^{(l+1)}w_{kj}^{(l+1)}\Bigr)g(u_{j}^{(l)})(1 - g(u_{j}^{(l)}))g(u_{i}^{(l-1)})
\end{align}
When updating the weight of the middle layer $ l $, $ \ delta_ {k} ^ {(l + 1)} $ of the next layer $ l + 1 $ is required. $ \ Delta $ in the output layer can be obtained as the difference between the output value and the teacher, and by propagating this in order from the output layer to the input layer, the weight of the intermediate layer is updated. This is the reason why it is called error back propagation.
XNOR
XNOR outputs $ 1 $ when the input values are the same, and outputs $ 0 $ when the input values are different. It cannot be linearly identified.
0 | 0 | 1 |
0 | 1 | 0 |
1 | 0 | 0 |
1 | 1 | 1 |
We have implemented a neural network that identifies XNORs. The number of intermediate layers is one, and the number of units in the intermediate layer is two. The learning rate is $ \ epsilon = 0.1 $ and the momentum coefficient is $ \ mu = 0.9 $. (Momentum is one of the methods to improve the convergence performance, and the weight correction amount is added by multiplying the previous weight correction amount by a coefficient.)
neuralnetwork.py
import numpy
import math
import random
from matplotlib import pyplot
class Neural:
# constructor
def __init__(self, n_input, n_hidden, n_output):
self.hidden_weight = numpy.random.random_sample((n_hidden, n_input + 1))
self.output_weight = numpy.random.random_sample((n_output, n_hidden + 1))
self.hidden_momentum = numpy.zeros((n_hidden, n_input + 1))
self.output_momentum = numpy.zeros((n_output, n_hidden + 1))
# public method
def train(self, X, T, epsilon, mu, epoch):
self.error = numpy.zeros(epoch)
N = X.shape[0]
for epo in range(epoch):
for i in range(N):
x = X[i, :]
t = T[i, :]
self.__update_weight(x, t, epsilon, mu)
self.error[epo] = self.__calc_error(X, T)
def predict(self, X):
N = X.shape[0]
C = numpy.zeros(N).astype('int')
Y = numpy.zeros((N, X.shape[1]))
for i in range(N):
x = X[i, :]
z, y = self.__forward(x)
Y[i] = y
C[i] = y.argmax()
return (C, Y)
def error_graph(self):
pyplot.ylim(0.0, 2.0)
pyplot.plot(numpy.arange(0, self.error.shape[0]), self.error)
pyplot.show()
# private method
def __sigmoid(self, arr):
return numpy.vectorize(lambda x: 1.0 / (1.0 + math.exp(-x)))(arr)
def __forward(self, x):
# z: output in hidden layer, y: output in output layer
z = self.__sigmoid(self.hidden_weight.dot(numpy.r_[numpy.array([1]), x]))
y = self.__sigmoid(self.output_weight.dot(numpy.r_[numpy.array([1]), z]))
return (z, y)
def __update_weight(self, x, t, epsilon, mu):
z, y = self.__forward(x)
# update output_weight
output_delta = (y - t) * y * (1.0 - y)
_output_weight = self.output_weight
self.output_weight -= epsilon * output_delta.reshape((-1, 1)) * numpy.r_[numpy.array([1]), z] - mu * self.output_momentum
self.output_momentum = self.output_weight - _output_weight
# update hidden_weight
hidden_delta = (self.output_weight[:, 1:].T.dot(output_delta)) * z * (1.0 - z)
_hidden_weight = self.hidden_weight
self.hidden_weight -= epsilon * hidden_delta.reshape((-1, 1)) * numpy.r_[numpy.array([1]), x]
self.hidden_momentum = self.hidden_weight - _hidden_weight
def __calc_error(self, X, T):
N = X.shape[0]
err = 0.0
for i in range(N):
x = X[i, :]
t = T[i, :]
z, y = self.__forward(x)
err += (y - t).dot((y - t).reshape((-1, 1))) / 2.0
return err
main.py
from neuralnetwork import *
if __name__ == '__main__':
X = numpy.array([[0, 0], [0, 1], [1, 0], [1, 1]])
T = numpy.array([[1, 0], [0, 1], [0, 1], [1, 0]])
N = X.shape[0] # number of data
input_size = X.shape[1]
hidden_size = 2
output_size = 2
epsilon = 0.1
mu = 0.9
epoch = 10000
nn = Neural(input_size, hidden_size, output_size)
nn.train(X, T, epsilon, mu, epoch)
nn.error_graph()
C, Y = nn.predict(X)
for i in range(N):
x = X[i, :]
y = Y[i, :]
c = C[i]
print x
print y
print c
print ""
It can be identified correctly.
0 | 0 | [ 1, 0 ] | [ 0.92598739, 0.07297757 ] |
0 | 1 | [ 0, 1 ] | [ 0.06824915, 0.93312514 ] |
1 | 0 | [ 0, 1 ] | [ 0.06828438, 0.93309010 ] |
1 | 1 | [ 1, 0 ] | [ 0.92610205, 0.07220633 ] |
I made a graph showing how the error is decreasing. The horizontal axis is the number of epochs and the vertical axis is the error.
We implemented a 3-layer neural network and were able to identify the XNOR. There is still more to study, such as learning tricks (how to determine the initial value of the weight, how to determine the learning rate, AdaGrad, momentum), but this is the end.
Recommended Posts