Introduction

I implemented a 3-layer neural network with python and tried to identify XNOR. I have also posted mathematical formulas, so if you are interested, please read it. I used ["Deep Learning"](http://www.amazon.co.jp/ Deep Learning-Machine Learning Professional Series-Okatani-Takayuki / dp / 4061529021) as a textbook.

Structure of this article

Introduction
neural network
Weight update
Backpropagation of error
XNOR
Implementation in python
Result
in conclusion

neural network

A neural network is a model that imitates the neural circuits of the human brain. Image recognition and voice recognition are possible by using this model. The network implemented this time has a three-layer structure consisting of an input layer, an intermediate layer (1 layer), and an output layer.

Weight update

I will explain with the figure below.

Let the weight from the $ i $ th unit in the $ l-1 $ layer to the $ j $ th unit in the $ l $ layer be $ w_ {ji} ^ {(l)} $. Also, let $ u_ {i} ^ {(l-1)} $ be the value held by the $ i $ th unit in the $ l-1 $ layer. The sigmoid function $ g (x) $ is used as the activation function. It is convenient because the differential value can be obtained using the value before differentiation.

g(x) = \cfrac{1}{1 + \exp(-x)} \\
g'(x) = g(x)(1 - g(x))

The square error $ E_n $ is used as the objective function, and the weight is updated by the following formula to minimize it. $ E_n $ is the error caused by one sample, and the method of updating the weight using this $ E_n $ is called the stochastic gradient descent method. The stochastic gradient descent method has the advantage that it is less likely to be trapped in the local solution because it changes each time the objective function is updated. $ \ Epsilon $ is called the learning rate and is a parameter that determines the learning speed. The point is how to calculate the term of partial differential.

{{w}_{ji}^{(l)}}_{new} = {w_{ji}^{(l)}}_{old} - \epsilon\cfrac{\partial E_n}{\partial {w_{ji}^{(l)}}_{old}}

Error back propagation

Now, I will explain how to find $ \ cfrac {\ partial E_n} {\ partial w_ {ji} ^ {(l)}} $. Since the method of obtaining the output layer and the intermediate layer are slightly different, we will explain them separately.

** ** The teacher is represented by $ t_j $. Partial differentiation of the squared error $ E_n $ with respect to the output layer weight $ w_ {ji} ^ {(l)} $

\begin{align}
\cfrac{\partial E_n}{\partial w_{ji}^{(l)}} &= \cfrac{\partial}{\partial w_{ji}^{(l)}}\cfrac{1}{2}(t_j - g(u_{j}^{(l)}))^{2} \\
&= (t_j - g(u_{j}^{(l)}))\cdot\cfrac{\partial}{\partial w_{ji}^{(l)}}(t_j - g(u_{j}^{(l)})) \\
&= (t_j - g(u_{j}^{(l)}))\cdot\cfrac{\partial}{\partial u_{j}^{(l)}}(t_j - g(u_{j}^{(l)}))\cdot\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} \\
\\
&= (g(u_{j}^{(l)}) - t_j)\cdot g'(u_{j}^{(l)})\cdot g(u_{i}^{(l-1)}) \\
\\
&= (g(u_{j}^{(l)}) - t_j)g(u_{j}^{(l)})(1 - g(u_{j}^{(l)}))g(u_{i}^{(l-1)})
\end{align}

** ** Partial differentiation of the squared error $ E_n $ with respect to the intermediate layer weight $ w_ {ji} ^ {(l)} $

\begin{align}
\cfrac{\partial E_n}{\partial w_{ji}^{(l)}} &= \cfrac{\partial E_n}{\partial u_{j}^{(l)}}\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} \\
\\
&= \delta_{j}^{(l)}\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}}
\end{align}

The first term on the right side is

\begin{align}
\delta_{j}^{(l)} &= \cfrac{\partial E_n}{\partial u_{j}^{(l)}} \\
&= \sum_{k}\cfrac{\partial E_n}{\partial u_{k}^{(l+1)}}\cfrac{\partial u_{k}^{(l+1)}}{\partial u_{j}^{(l)}} \\
\\
&= \sum_{k}\delta_{k}^{(l+1)}w_{kj}^{(l+1)}g'(u_{j}^{(l)}) \\
\\
&= \Bigl(\sum_{k}\delta_{k}^{(l+1)}w_{kj}^{(l+1)}\Bigr)g(u_{j}^{(l)})(1 - g(u_{j}^{(l)}))
\end{align}

The second term on the right side is

\begin{align}
\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} &= g(u_{i}^{(l-1)})
\end{align}

From the above,

\begin{align}
\cfrac{\partial E_n}{\partial w_{ji}^{(l)}} &= \delta_{j}^{(l)}\cfrac{\partial u_{j}^{(l)}}{\partial w_{ji}^{(l)}} \\
\\
&= \Bigl(\sum_{k}\delta_{k}^{(l+1)}w_{kj}^{(l+1)}\Bigr)g(u_{j}^{(l)})(1 - g(u_{j}^{(l)}))g(u_{i}^{(l-1)})
\end{align}

When updating the weight of the middle layer $ l $, $ \ delta_ {k} ^ {(l + 1)} $ of the next layer $ l + 1 $ is required. $ \ Delta $ in the output layer can be obtained as the difference between the output value and the teacher, and by propagating this in order from the output layer to the input layer, the weight of the intermediate layer is updated. This is the reason why it is called error back propagation.

XNOR

XNOR outputs $ 1 $ when the input values are the same, and outputs $ 0 $ when the input values are different. It cannot be linearly identified.

x_1	x_2	t
0	0	1
0	1	0
1	0	0
1	1	1

Implementation in python

We have implemented a neural network that identifies XNORs. The number of intermediate layers is one, and the number of units in the intermediate layer is two. The learning rate is $ \ epsilon = 0.1 $ and the momentum coefficient is $ \ mu = 0.9 $. (Momentum is one of the methods to improve the convergence performance, and the weight correction amount is added by multiplying the previous weight correction amount by a coefficient.)

`neuralnetwork.py`


import numpy
import math
import random
from matplotlib import pyplot

class Neural:

	# constructor
	def __init__(self, n_input, n_hidden, n_output):
		self.hidden_weight = numpy.random.random_sample((n_hidden, n_input + 1))
		self.output_weight = numpy.random.random_sample((n_output, n_hidden + 1))
		self.hidden_momentum = numpy.zeros((n_hidden, n_input + 1))
		self.output_momentum = numpy.zeros((n_output, n_hidden + 1))


# public method
	def train(self, X, T, epsilon, mu, epoch):
		self.error = numpy.zeros(epoch)
		N = X.shape[0]
		for epo in range(epoch):
			for i in range(N):
				x = X[i, :]
				t = T[i, :]

				self.__update_weight(x, t, epsilon, mu)

			self.error[epo] = self.__calc_error(X, T)


	def predict(self, X):
		N = X.shape[0]
		C = numpy.zeros(N).astype('int')
		Y = numpy.zeros((N, X.shape[1]))
		for i in range(N):
			x = X[i, :]
			z, y = self.__forward(x)

			Y[i] = y
			C[i] = y.argmax()

		return (C, Y)


	def error_graph(self):
		pyplot.ylim(0.0, 2.0)
		pyplot.plot(numpy.arange(0, self.error.shape[0]), self.error)
		pyplot.show()


# private method
	def __sigmoid(self, arr):
		return numpy.vectorize(lambda x: 1.0 / (1.0 + math.exp(-x)))(arr)


	def __forward(self, x):
		# z: output in hidden layer, y: output in output layer
		z = self.__sigmoid(self.hidden_weight.dot(numpy.r_[numpy.array([1]), x]))
		y = self.__sigmoid(self.output_weight.dot(numpy.r_[numpy.array([1]), z]))

		return (z, y)

	def __update_weight(self, x, t, epsilon, mu):
		z, y = self.__forward(x)

		# update output_weight
		output_delta = (y - t) * y * (1.0 - y)
		_output_weight = self.output_weight
		self.output_weight -= epsilon * output_delta.reshape((-1, 1)) * numpy.r_[numpy.array([1]), z] - mu * self.output_momentum
		self.output_momentum = self.output_weight - _output_weight

		# update hidden_weight
		hidden_delta = (self.output_weight[:, 1:].T.dot(output_delta)) * z * (1.0 - z)
		_hidden_weight = self.hidden_weight
		self.hidden_weight -= epsilon * hidden_delta.reshape((-1, 1)) * numpy.r_[numpy.array([1]), x]
		self.hidden_momentum = self.hidden_weight - _hidden_weight


	def __calc_error(self, X, T):
		N = X.shape[0]
		err = 0.0
		for i in range(N):
			x = X[i, :]
			t = T[i, :]

			z, y = self.__forward(x)
			err += (y - t).dot((y - t).reshape((-1, 1))) / 2.0

		return err

`main.py`


from neuralnetwork import *

if __name__ == '__main__':

	X = numpy.array([[0, 0], [0, 1], [1, 0], [1, 1]])
	T = numpy.array([[1, 0], [0, 1], [0, 1], [1, 0]])
	N = X.shape[0] # number of data

	input_size = X.shape[1]
	hidden_size = 2
	output_size = 2
	epsilon = 0.1
	mu = 0.9
	epoch = 10000

	nn = Neural(input_size, hidden_size, output_size)
	nn.train(X, T, epsilon, mu, epoch)
	nn.error_graph()

	C, Y = nn.predict(X)

	for i in range(N):
		x = X[i, :]
		y = Y[i, :]
		c = C[i]

		print x
		print y
		print c
		print ""

result

It can be identified correctly.

x_1	x_2	t	y
0	0	[ 1, 0 ]	[ 0.92598739, 0.07297757 ]
0	1	[ 0, 1 ]	[ 0.06824915, 0.93312514 ]
1	0	[ 0, 1 ]	[ 0.06828438, 0.93309010 ]
1	1	[ 1, 0 ]	[ 0.92610205, 0.07220633 ]

I made a graph showing how the error is decreasing. The horizontal axis is the number of epochs and the vertical axis is the error.

in conclusion

We implemented a 3-layer neural network and were able to identify the XNOR. There is still more to study, such as learning tricks (how to determine the initial value of the weight, how to determine the learning rate, AdaGrad, momentum), but this is the end.

Neural network implementation in python