Purpose

Create an MLP (Multi Layer Perceptron) in Python3 using only Numpy and Scipy. In addition, as an actual example, we will build an MLP that classifies handwritten digit images using the handwritten digit database MNIST.

environment

software	version
Python	3.5
Numpy	1.10
Scipy	0.16
Matplotlib	1.5
tqdm	4.7
python-mnist	0.3

Caution

This article is an event of MPS (Morning Project Samurai) Yokohama to be held on July 23, 2016 Deep Learning (with Python) Vol. 10 This article was written for: //mpsamurai.doorkeeper.jp/events/49018). I would appreciate it if you could ask any questions or suggestions.

MLP (Multi Layer Perceptron)

First, consider a simple MLP as shown in the figure below. Actually, all the nodes of layer $ i + 1 $ and layer $ i $ are fully connected, but for the sake of clarity, only a part of the connection is drawn.

Here, the symbols are defined as follows.

$ s_ {ij} $: State of $ j $ th node of layer $ i $
$ w_ {ikj} $: Weight on the $ j $ th output of layer $ i-1 $ input to the $ k $ th node of layer $ i $
$ f_ {i} $: AF (Activation Function) of layer $ i $
$ y_ {ij} $: Output of $ j $ th node of layer $ i $
E: EF (Error Function)

FP (Forward Propagation)

The FP from layer $ i-1 $ to layer $ i $ looks like this:

y_{i} = f_{i}(W_{i}y_{i-1} + b_{i-1})

Here, let $ y_ {i} = (y_ {ij}) $, $ W_ {i} = (w_ {ikj}) $, $ b_ {i} = (b_ {ij}) $.

The Python code for a two-layer MLP FP with three inputs and one output looks like this:

import numpy as np


# activation function
def sigmoid(s):
    return 1/(1 + np.exp(-s))


# x: Input vector, W: Weight matrix, b: Bias, y: Output vector
x = np.random.randn(3, 1)
W = np.random.randn(1, 3)
b = np.random.randn(1, 1)
y = sigmoid(W @ x + b)

If you define a class Layer that has the above operation as a method, you can easily create an MLP. The following is an example of a network that has one intermediate layer in addition to the input layer and output layer.

from scipy import misc
import numpy as np


class Layer:
    def __init__(self, W, b, f):
        self._W = W
        self._b = b
        self._f = f

    def propagate_forward(self, x):
        return self._f(self._W @ x + self._b)


def sigmoid(s):
    return 1/(1 + np.exp(-s))


if __name__ == '__main__':
    # Input vector (Layer 0)
    x = np.random.randn(3, 1)

    # Middle layer (Layer 1)
    W1 = np.random.randn(3, 3)
    b1 = np.random.randn(3, 1)
    layer1 = Layer(W1, b1, sigmoid)

    # Output layer (Layer 2)
    W2 = np.random.randn(3, 3)
    b2 = np.random.randn(3 ,1)
    layer2 = Layer(W2, b2, sigmoid)

    # FP
    y1 = layer1.propagate_forward(x)
    y2 = layer2.propagate_forward(y1)

    print(y2)

At this point, you have the configuration of the MLP and the output for some input of that MLP. The following is an example in which the handwritten number 9 as shown in the figure below is given as an input.

from scipy import misc
import numpy as np
from matplotlib import pyplot as plt


class Layer:
    def __init__(self, W, b, f):
        self._W = W
        self._b = b
        self._f = f

    def propagate_forward(self, x):
        return self._f(self._W @ x + self._b)


def sigmoid(s):
    return 1/(1 + np.exp(-s))


if __name__ == '__main__':
    # Input vector (Layer 0)
    x = misc.imread('img/9.png', flatten=True).flatten()
    x = x.reshape((len(x), 1))

    # Middle layer (Layer 1)
    n_output_1 = len(x)
    W1 = np.random.randn(n_output_1, len(x))
    b1 = np.random.randn(n_output_1, 1)
    layer1 = Layer(W1, b1, sigmoid)

    # Output layer (Layer 2)
    n_output_2 = 10
    W2 = np.random.randn(n_output_2, n_output_1)
    b2 = np.random.randn(n_output_2, 1)
    layer2 = Layer(W2, b2, sigmoid)

    # FP
    y1 = layer1.propagate_forward(x)
    y2 = layer2.propagate_forward(y1)

    hist_W1, bins_W1 = np.histogram(W1.flatten())
    hist_W2, bins_W2 = np.histogram(W2.flatten())

    # Draw bar chart
    index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    plt.title('Prediction')
    plt.bar(index , y2.flatten(), align="center")
    plt.xticks(index, index)
    plt.show()

Since we are using the sigmoid function for the activation function, the output of the 10 nodes in the output layer will be between 0 and 1. Assuming that the output of the $ i $ th node in the output layer represents the probability that the input image is the number $ i $, we predict that the above results are most likely to be 2, 6 or 7. , Useless.

EF (Error Function)

EF is an index to measure how far the state of MLP is from the ideal state. For example, in the case of the above-mentioned handwritten image recognition, the correct answer is 9, and it is ideal to obtain the following as an output.

y_{2} = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0)^{T}

SE (Squared Error) is one of the commonly used error functions and is expressed as follows.

E = \frac{1}{2}\sum_{j} (t_{j} - y_{ij})^{2} = \frac{1}{2}(t-y_{i})^{T}(t-y_{i})

The code for finding the SE when a handwritten 9 is input is shown below.

from scipy import misc
import numpy as np


class Layer:
    def __init__(self, W, b, f):
        self._W = W
        self._b = b
        self._f = f

    def propagate_forward(self, x):
        return self._f(self._W @ x + self._b)


def sigmoid(s):
    return 1/(1 + np.exp(-s))


def se(t, y):
    return ((t - y).T @ (t-y)).flatten()[0]/2


if __name__ == '__main__':
    # Input vector (Layer 0)
    x = misc.imread('img/9.png', flatten=True).flatten()
    x = x.reshape((len(x), 1))

    # Middle layer (Layer 1)
    n_output_1 = len(x)
    W1 = np.random.randn(n_output_1, len(x))
    b1 = np.random.randn(n_output_1, 1)
    layer1 = Layer(W1, b1, sigmoid)

    # Output layer (Layer 2)
    n_output_2 = 10
    W2 = np.random.randn(n_output_2, n_output_1)
    b2 = np.random.randn(n_output_2, 1)
    layer2 = Layer(W2, b2, sigmoid)

    # FP
    y1 = layer1.propagate_forward(x)
    y2 = layer2.propagate_forward(y1)

    # Calculate SE
    t = np.zeros(shape=(10, 1))
    t[9, 0] = 1.0
    e = se(t, y2)
    print(e)

Learning with BP (Backward Propagation)

BP is to propagate information in the opposite direction of the network. Learning using BP requires two steps: the information required for learning is propagated in the opposite direction using BP, and the actual learning is performed using that information.

MLP training uses the fact that the EF gradient is found in the space where the parameter to be trained is the domain, and moving the parameter in the opposite direction reduces the EF value most in the vicinity of the current parameter. Do. BP is used to propagate the information necessary to efficiently calculate this gradient.

Output layer learning

First, consider learning the output layer of a three-layer MLP.

The parameter you want to adjust is the weight $ w_ {2ji} $, so if you partially differentiate EF $ E $, you get the following equation.

\frac{\partial{E}}{\partial{w_{2ji}}} = 
\frac{\partial{E}}{\partial{y_{2j}}}
\frac{\partial{y_{2j}}}{\partial{s_{2j}}}
\frac{\partial{s_{2j}}}{\partial{w_{2ji}}}

This is to find out how $ E $ changes when you change $ w_ {2ji} $, how $ s_ {2j} $ changes when you change $ w_ {2ji} $ Then find out how $ y_ {2j} $ changes when you change $ s_ {2j} $, and finally $ y_ {2j} $ when you change it. It shows the relationship that we should investigate how E $ changes, which is intuitive when compared with the MLP diagram.

Here, it is set as follows.

\frac{\partial{E}}{\partial{y_{2j}}}\frac{\partial{y_{2j}}}{\partial{s_{2j}}} = \delta_{2j}

Also, the following equation can be obtained from $ s_ {2j} = \ sum_ {i} w_ {2ji} y_ {1i} $.

\frac{\partial{s_{2j}}}{\partial{w_{2ji}}} = y_{1i}

The following equations are obtained by combining the previous equations.

\frac{\partial{E}}{\partial{w_{2ji}}} = \delta_{2j}y_{1i}

Learning can be done with the following update formula. Here, the arrow in the equation means to replace the variable on the left side with the right side. Also, $ \ epsilon $ is a hyperparameter called learning rate. By adjusting this value, the speed of learning can be increased or decreased.

w_{2ji} \leftarrow w_{2ji} - \epsilon \delta_{2j}y_{1i}

Here, put it as follows,

\delta_{2} = (\delta_{2j})

The following equation is a summary of the discussions so far in the form of a matrix.

\Delta W_{2} = (\frac{\partial{E}}{\partial{w_{2ji}}}) 
= \delta_{2}y_{1}^{T}

W_{2} \leftarrow W_{2} - \epsilon \Delta W_{2}

Considering the Bias term in the same way, the following equation can be obtained.

\frac{\partial{E}}{\partial{b_{2j}}} = \delta_{2j}

Therefore, the update formula for the bias term is as follows.

b_{2} \leftarrow b_{2} - \epsilon \delta_{2}

Specifically, an example using the sigmoid function as AF and SE as EF is shown.

\frac{\partial{E}}{\partial{y_{2j}}} = -(t_{j}-y_{2j})

\frac{\partial{y_{2j}}}{\partial{s_{2j}}} = y_{2j}(1-y_{2j})

Therefore, the following equation is obtained.

\delta_{2j} = -(t_{j}-y_{2j})y_{2j}(1-y_{2j})

This $ \ delta_ {2} = (\ delta_ {2j}) $ can be used to train both weight and Bias parameters.

Below is the code that allows the three-layer MLP to learn the recognition of handwritten character 9 only in the output layer.

from scipy import misc
import numpy as np
from matplotlib import pyplot as plt


class Layer:
    def __init__(self, W, b, f):
        self._W = W
        self._b = b
        self._f = f

    def propagate_forward(self, x):
        return self._f(self._W @ x + self._b)


def sigmoid(s):
    return 1/(1 + np.exp(-s))


def d_sigmoid(y):
    return y * (1 - y)


def se(t, y):
    return ((t - y).T @ (t-y)).flatten()[0]/2


def d_se(t, y):
    return -(t - y)


if __name__ == '__main__':
    # Input vector (Layer 0)
    x = misc.imread('img/9.png', flatten=True).flatten()
    x = x.reshape((len(x), 1))

    # Middle layer (Layer 1)
    n_output_1 = len(x)
    W1 = np.random.randn(n_output_1, len(x))
    b1 = np.random.randn(n_output_1, 1)
    layer1 = Layer(W1, b1, sigmoid)

    # Output layer (Layer 2)
    n_output_2 = 10
    W2 = np.random.randn(n_output_2, n_output_1)
    b2 = np.random.randn(n_output_2, 1)
    layer2 = Layer(W2, b2, sigmoid)

    # Training datum
    t = np.zeros(shape=(10, 1))
    t[9, 0] = 1.0

    # FP, BP and learning
    epsilon = 0.1
    se_history = []
    delta2_history = []
    for i in range(0, 100):
        # FP
        y1 = layer1.propagate_forward(x)
        y2 = layer2.propagate_forward(y1)

        # Calculate and store SE
        se_history.append(se(t, y2))

        # BP and learning
        delta2 = d_se(t, y2) * d_sigmoid(y2)
        Delta_W2 = delta2 @ y1.T
        layer2._W -= epsilon * Delta_W2
        layer2._b -= epsilon * delta2

        # Store delta2 history
        delta2_history.append(np.linalg.norm(delta2))

    y1 = layer1.propagate_forward(x)
    y2 = layer2.propagate_forward(y1)

    # Draw SE history
    plt.figure()
    plt.title('SE History')
    plt.plot(range(len(se_history)), se_history)

    # Draw delta2 history
    plt.figure()
    plt.title('delta2 history')
    plt.plot(range(len(delta2_history)), delta2_history)

    # Draw bar chart
    index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    plt.figure()
    plt.title('Prediction')
    plt.bar(index, y2.flatten(), align="center")
    plt.xticks(index, index)

    plt.show()

The figure below shows the transition of SE when trained 100 times. It can be seen that the SE has dropped significantly in the early stages of learning. In addition, it can be seen that the learning effect has hardly come out after the 5th time.

To confirm the cause, a graph of the size of $ \ delta_ {2} $ is shown below. From the 5th time onward, you can see that the size is almost $ 0 $.

The MLP output at this time is as follows.

Middle layer learning

Next, consider the learning of the middle layer of the 3-layer MLP. Below is another diagram of the 3-layer MLP.

As with the training of the output layer, first partially differentiate EF with the weight $ w_ {1ih} $ that you want to train. Here, for example, if you change $ w_ {100} $, this change will affect $ s_ {10} $, then $ y_ {10} $, which will affect all nodes in the output layer. Affects $ s_ {2j} $, affects their output $ y_ {2j} $, and ultimately affects the value of EF through them. Taking this into consideration, we will develop and organize the partial differential equations.

\begin{eqnarray}
\frac{\partial{E}}{\partial{w_{1ih}}} 
&=& \sum_{j}\frac{\partial{E}}{\partial{y_{2j}}}
\frac{\partial{y_{2j}}}{\partial{s_{2j}}}
\frac{\partial{s_{2j}}}{\partial{y_{1i}}}
\frac{\partial{y_{1i}}}{\partial{s_{1i}}}
\frac{\partial{s_{1i}}}{\partial{w_{1ih}}}\\

&=& (\sum_{j}\frac{\partial{E}}{\partial{y_{2j}}}
\frac{\partial{y_{2j}}}{\partial{s_{2j}}}
\frac{\partial{s_{2j}}}{\partial{y_{1i}}})
\frac{\partial{y_{1i}}}{\partial{s_{1i}}}
\frac{\partial{s_{1i}}}{\partial{w_{1ih}}}\\

&=&(\sum_{j}\delta_{2j} w_{2ji})
\frac{\partial{y_{1i}}}{\partial{s_{1i}}}y_{0h}\\

&=&\delta_{1i}y_{0h}
\end{eqnarray}

Here, $ \ delta_ {1i} $ is set as follows.

\delta_{1i} = (\sum_{j}\delta_{2j} w_{2ji})
\frac{\partial{y_{1i}}}{\partial{s_{1i}}}

Note that the equation obtained at the end of the above equation transformation is exactly the same as the equation derived when considering the learning of the weights of the output layer. Reprint those formulas.

\begin{eqnarray}
\frac{\partial{E}}{\partial{w_{2ji}}} &=& \delta_{2j}y_{1i}\\
\frac{\partial{E}}{\partial{w_{1ih}}} &=& \delta_{1i}y_{0h}
\end{eqnarray}

The upper one is used for learning the output layer, and the lower one is used for learning the intermediate layer this time.

Also note that $ \ delta_ {1i} $ is based on $ \ delta_ {2j} $ found for training the output layer. In other words, if you know the $ \ delta_ {2j} $ found in the output layer (if it is propagated from the output layer), you can find the $ \ delta_ {1i} $ in the middle layer without complicated calculations. You will be able to learn the middle layer based on it. In MLP learning, this $ \ delta $ is called an error, and propagating this is called BP.

The learning of the weights of the middle layer is organized using a matrix representation.

\delta_{1} = W_{2}^{T} \circ \frac{\partial{y_{1}}}{\partial{s_{1}}}

Where $ \ circ $ represents the Hadamard product. The Hadamard product is the multiplication of each element of the matrix.

\Delta W_{1} = \delta_{1} y_{0}^T

Where $ y_ {0} $ is a vector consisting of the outputs of each node in the input layer. The formula for updating weights by learning is as follows.

W_{1} \leftarrow W_{1} - \epsilon \Delta W_{1}

Please try to derive the Bias term by yourself.

Creating an MLP to classify handwritten digit images

Based on the discussion so far, the code that trains the 3-layer MLP to classify handwritten numbers using a database of handwritten numbers called MNIST It is shown below. As before, let's say that there are 10 nodes in the output layer, and the $ i $ th node represents the probability that the number written in the given image is $ i $. This time, the node number showing the highest probability is adopted as the final recognition result.

MNIST provides 600 million handwritten digit images with labels as teacher data, this time using the first 1000 of these data for learning and the others as test data. The number of learnings is set to 400,000 times in total by using the above data 400 times repeatedly.

import numpy as np
from matplotlib import pyplot as plt
from tqdm import tqdm


class Layer:
    def __init__(self, W, b, f):
        self._W = W
        self._b = b
        self._f = f

    def propagate_forward(self, x):
        return self._f(self._W @ x + self._b)


def sigmoid(s):
    return 1/(1 + np.exp(-s))


def d_sigmoid(y):
    return y * (1 - y)


def se(t, y):
    return ((t - y).T @ (t - y)).flatten()[0] / 2.0


def d_se(t, y):
    return -(t - y)


def ma(history, n):
    return np.array([0, ] * (n - 1) + [np.average(history[i - n: i]) for i in range(n, len(history) + 1)])


if __name__ == '__main__':
    from mnist import MNIST

    # Load MNIST dataset
    mndata = MNIST('./mnist')
    train_img, train_label = mndata.load_training()
    train_img = np.array(train_img, dtype=float)/255.0
    train_label = np.array(train_label, dtype=float)

    # Input vector (Layer 0)
    n_output_0 = len(train_img[0])

    # Middle layer (Layer 1)
    n_output_1 = 200
    W1 = np.random.randn(n_output_1, n_output_0)
    b1 = np.random.randn(n_output_1, 1)
    layer1 = Layer(W1, b1, sigmoid)

    # Output layer (Layer 2)
    n_output_2 = 10
    W2 = np.random.randn(n_output_2, n_output_1)
    b2 = np.random.randn(n_output_2, 1)
    layer2 = Layer(W2, b2, sigmoid)

    # FP, BP and learning
    epsilon = 0.15
    n_training_data = 1000
    se_history = []
    y1_history = []
    y2_history = []
    W1_history = []
    W2_history = []
    cpr_history = []
    for loop in range(400):
        for i in tqdm(range(n_training_data)):
            # Store W1 and W2 history
            W1_history.append(np.linalg.norm(layer1._W))
            W2_history.append(np.linalg.norm(layer2._W))

            # FP
            x = train_img[i].reshape(len(train_img[i]), 1)
            y1 = layer1.propagate_forward(x)
            y2 = layer2.propagate_forward(y1)

            # Store y1 and y2
            y1_history.append(y1)
            y2_history.append(y2)

            # Training datum
            t = np.zeros(shape=(10, 1))
            t[train_label[i], 0] = 1.0

            # Calculate and store SE
            se_history.append(se(t, y2))

            # BP
            delta2 = d_se(t, y2) * d_sigmoid(y2)
            delta1 = layer2._W.T @ delta2 * d_sigmoid(y1)

            # Learning
            Delta_W2 = delta2 @ y1.T
            layer2._W -= epsilon * Delta_W2
            layer2._b -= epsilon * delta2

            Delta_W1 = delta1 @ x.T
            layer1._W -= epsilon * Delta_W1
            layer1._b -= epsilon * delta1

        # FP to evaluate correct prediction rate
        n_correct_prediction = 0
        n_prediction = 0
        for _i in np.random.choice(np.arange(n_training_data, train_img.shape[0]), 100):
            _x = train_img[_i].reshape(len(train_img[_i]), 1)
            _y1 = layer1.propagate_forward(_x)
            _y2 = layer2.propagate_forward(_y1)

            n_prediction += 1
            if train_label[_i] == np.argmax(_y2):
                n_correct_prediction += 1
        cpr_history.append(n_correct_prediction/n_prediction)

    # Draw W1
    plt.figure()
    plt.title('W1 history')
    plt.plot(range(len(W1_history)), W1_history)

    # Draw W2
    plt.figure()
    plt.title('W2 history')
    plt.plot(range(len(W2_history)), W2_history)

    # Draw SE history and its moving average
    plt.figure()
    plt.title('SE History')
    plt.plot(range(len(se_history)), se_history, color='green')
    plt.plot(range(len(se_history)), ma(se_history, 100), color='red')

    # Draw CPR history
    plt.figure()
    plt.title('CPR')
    plt.plot(range(len(cpr_history)), cpr_history)

    plt.show()

The figure below shows the change in SE of MLP with the number of learnings on the horizontal axis. Green represents the value of SE when one data is trained and then one other data is given. Red is the moving average of the green series. Since the average number of data used is 100, it is a series of average SE values obtained by learning 100 times. Looking at this figure, it can be seen that the SE obtained by each learning vibrates greatly with each learning, but on average it becomes smaller.

The figure below shows the change in the recognition rate of handwritten numbers with the number of learnings on the horizontal axis. However, since the prediction accuracy measurement is performed once every 1000 learnings, the actual number of learnings is the value on the horizontal axis multiplied by 1000. Here, too, it can be seen that the prediction accuracy improves with each learning by looking at the moving average. From this, it is expected that it is better to make predictions by combining multiple MLPs, rather than making predictions using only one MLP.

Finally

What we have implemented in this article is basic MLP and its learning. In MPS Yokohama Deep Learning from Mathematical Basics (with Python), we will do this for the next time. I will remodel it.

Volunteers have created wonderful materials for this project, so please refer to them as well (in no particular order).

Related Documents	Author
Summary up to MPS Yokohama 8th	Saito
Organize Back Propagation	n-ken
Derivative code in Python	takaneh
MPS Yoohama 9th Supplementary Course Note	Terasaki
Chain rate of composite function	Mr. Tesaraki
Chain rate of one-variable function	Terasaki
Summary and questions(making)	Wataru

We have also uploaded the slides and event videos used in previous events on SlideShare and YouTube, so if you are interested, please take a look.

Times	slide	Study session video
1	Vol. 1	Youtube
2	Vol. 2	Youtube
3	Vol. 3	Youtube
4	Vol. 4	Youtube
5	Vol. 5	YouTube
6	Vol. 6	YouTube
7	Vol. 7	Youtube
8	Vol. 8	YouTube
9	Vol. 9	Youtube

Acknowledgments

This project is made possible by the great cooperation of many people. Every time, we lend equipment such as venue, projector, WiFi, video shooting and uploading it. Information Science College, Professor Muto, Sakura Internet who provides cloud environment, Summary of each time Mr. Saito, who gives accurate advice on the management of the study session, Mr. Ueno, who leads this study session as a leader of MPS Yokohama, Mr. Wataru who helps the management of the study session every morning, Thank you very much to Mr. Terasaki, Mr. Hirara, and Mr. n-ken for creating the materials and asking questions and pointing out every time. Also, I would like to thank Mr. Masaki, who has been supporting MPS activities for two years since I met him. I would like to take this moment to say thank you. nice to meet you.

2016/07/22 Junya Kaneko

Multi Layer Perceptron for Deep Learning (Deep Learning with Python; MPS Yokohama Deep Learning Series)