Chainer is a library for implementing neural networks developed by Preferred Networks. Its features are as follows (from the homepage).

High speed: Supports CUDA and enables high speed calculation using GPU
Flexible: Flexible notation allows you to implement various types of neural nets such as convolution and recurrent.
Intuitive: Can intuitively describe network configuration

Personally, I would like to mention one more thing, "easy to install". Many deep learning frameworks are troublesome to install, but Chainer has few dependent libraries and was easy to install ... but I started using Cython from 1.5.0 and it was a bit of a hassle. became. Please refer to the following for the installation method.

In addition, Chainer's notation is intuitive and simple as described above, so it can cover a wide range from simple networks to more complex, so-called deep learning areas. Other deep learning libraries are completely over-engineered if they are not deep, but it was difficult for simple libraries (such as PyBrain) to be deep, so I think this is also a big advantage.

This time, I will explain how to use such attractive Chainer, but in order to handle Chainer, knowledge about (relatively deep) neural networks is indispensable. As a result, it often happens that the knowledge on the neural network side is insufficient (I got it).

Therefore, I would like to briefly explain the mechanism of the neural network first, and then explain how to implement it with Chainer in a later stage.

How neural networks work

Constitution

The configuration of the neural network is as follows (as an aside, drawing lines between nodes is not a hassle every time).

A neural network consists of an input layer that receives inputs, an output layer that outputs, and any number of hidden layers in between. In the figure above, the hidden layer is one layer.
There are any number of nodes in each layer. This node is actually just a function that takes input and outputs a value (see below).
When inputting, a value independent of the actual input may be entered. This is called the bias node (the gray node in the figure, which usually has a value of 1, like $ b $ (intercept) in $ ax + b $).

propagation

Let's take a closer look at how the input from the input gets to the output. The figure below makes it easy to see how the input from the input is made to the first node of the hidden layer.

You can see that four inputs are transmitted. The input is not transmitted directly as it is, but is weighted. Neural networks mimic the composition of neurons in the brain, but think of them as being weakened or strengthened as inputs (stimuli) propagate. Expressed mathematically, if the input is $ x $, it will be weighted as $ a $ like $ ax $.

Now, we received the input ʻax`, but the node does not just pass this value it received to the next layer. It seems that there is a mechanism in the brain that the input does not propagate to the next layer unless it exceeds a certain threshold, and here too, it imitates it and converts the received input to the output to the next layer. Expressed mathematically, the function that converts the input to the output to the next layer is $ h $, and the output value can be expressed as $ h (ax) $. This function $ h $ is called the activation function.

In summary, there are two important factors for value propagation in neural networks:

Weight: Determines how much the entered value is enhanced / attenuated
Activation function: How to pass the received value to the next layer

In short, the neural network simply weights the input it receives and outputs it. Therefore, a single layer neural network is almost synonymous with linear regression or logistic regression.

With that in mind, it becomes clear what the manipulation of the number of nodes and the number of layers means.

Increase the number of nodes: Increase the number of variables handled so that the value / boundary can be determined by adding a large number of elements.
Number of layers: By combining linear boundaries one after another, it is possible to express complex boundaries (one layer is linear, two layers are convex areas, three layers are areas with holes in the convex areas ... It becomes more and more complicated. Refer to Chapter 7 Perceptron-type Learning Rule for the First Time)

When dealing with neural networks, I think that the number of nodes and the number of layers may be messed up appropriately, but it is also important to plot the data firmly and find the appropriate number of nodes and layers.

Learning

To train a neural network, we use a technique called backpropagation. The error is the difference between the value output from the neural network and the actual value. Backpropagation, as the name implies, is a method of propagating this error from behind (output layer = output layer) and adjusting the weight of each layer.

The details of Backpropagation are not discussed in detail here because there are various other explanations, but the following two points are important.

Error calculation method: How to calculate the error between the value output from the neural network and the correct answer data
In the code, it is defined as cost function, loss function (or just loss), error function, etc.
How to adjust weights: The calculated error determines how to adjust the weights.
Defined as an optimizer in the code

In addition, there are several methods for using the above training data to perform the above operation of "calculating the error and updating the weight".

Batch: Use all training data and update at once from the average error
Online: Updated sequentially for each data item
Mini-batch: A technique that is somewhere between batch and online. Use some samples from all the training data (common method).

The cycle of 1 epoch is to finish updating the used learning data. Usually, you will learn this epoch several times. However, it is not so good if it is simply repeated, so the training data is shuffled at each epoch, and in the case of a mini-batch, the acquisition position of the mini-batch is shifted or randomly sampled.

This epoch is an important unit in neural network learning, such as checking the progress of learning and readjusting parameters.

Implementation by Chainer

Here is a summary of the contents of the explanation of neural networks.

Configuration: Constructed by stacking layers with multiple nodes
Propagation: Weights the input and converts it to the output to the next layer via the activation function.
Learning: Calculate the error and adjust the weight of each layer based on it

Now, let's look at the implementation in Chainer and the above points.

Constitution

In Chainer, the neural network consists of Chain (Function Set up to 1.4). The following is a definition of the 4-3-2 type neural network used in the explanation so far.

from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L


class MyChain(Chain):
    
    def __init__(self):
        super(MyChain, self).__init__(
            l1=L.Linear(4, 3),
            l2=L.Linear(3, 2)
        )
    
    def __call__(self, x):
        h = F.sigmoid(self.l1(x))
        o = self.l2(h)
        return o

Note
It seems that you don't have to inherit Chain, but you can't do CPU / GPU migration and model saving without inheriting Chain, so I think you should inherit it obediently. If it is a simple full join, there is no need to create a class, and it seems that Chain (l1 = ..., l2 = ...) is fine.
From 1.5, functions with parameters (= functions to be optimized) are clearly separated from Link, and pure functions (sigmoid etc.) are clearly separated from Functions.

I think it's best for those who wondered if the hidden layer wasn't even one layer. Please refer to the figure below for the above l1 and l2.

Considering the propagation between layers in this way, the mechanism is that there are two layers. In fact, L.Linear holds the weight for propagation and is responsible for applying this weight to the input.

propagation

Propagation processing is implemented in __call__ of Chain class as described above.

    def __call__(self, x):
        h = F.sigmoid(self.l1(x))
        o = self.l2(h)
        return o

Note
The process that was written as forward up to 1.4 will be written in __call__ (In Python, if you define __call__, for example, if you writemodel ()from an instance called model You can call the process written in __call__).

Here, the input x is weighted (self.l1 (x) ), and the value via the sigmoid function, which is often used as an activation function, is passed to the next layer (h =). F.sigmoid (self.l1 (x)) ). The final output does not require any processing to pass to the next layer, so the activation function is not used (ʻo = self.l2 (h) `)

Learning

When training, you first need to calculate the error between the predicted value and the actual value. You can simply implement this as a function (commonly named lossfun in Chainer), but for classification problems it's easier to use Classifier.

from chainer.functions.loss.mean_squared_error import mean_squared_error

model = L.Classifier(MyChain(), lossfun=mean_squared_error)

Actually, Classifier is also Link, that is, a function with parameters, and calculates the error between the value output from MyChain and the teacher data in __call__ (Function for calculation is natural. It can be specified (mean_squared_error in the above).

In 1.5, the point that this Link can be connected is very big, and the reusability of the model is much higher. Even in the above, you can see that the model of the main body and the process of calculating the error using it can be written separately.

After calculating the error, optimize the model to minimize it (Backpropagation above). It is ʻoptimizer` that plays this role, and the learning part of MNIST example is as follows. It has become.

# Setup optimizer
optimizer = optimizers.Adam()
optimizer.setup(model)

...(Omission)...

# Learning loop
for epoch in six.moves.range(1, n_epoch + 1):
    print('epoch', epoch)

    # training
    perm = np.random.permutation(N)
    sum_accuracy = 0
    sum_loss = 0
    for i in six.moves.range(0, N, batchsize):
        x = chainer.Variable(xp.asarray(x_train[perm[i:i + batchsize]]))
        t = chainer.Variable(xp.asarray(y_train[perm[i:i + batchsize]]))

        # Pass the loss function (Classifier defines it) and its arguments
        optimizer.update(model, x, t)

There are three basic steps:

Create optimizer (ʻoptimizers.Adam () `)
Set the model in optimizer (ʻoptimizer.setup (model) `)
Update model with optimizer (ʻoptimizer.update (model, x, t) `)

At the core is the updating ʻoptimizer.update. From 1.5, by passing lossfun as an argument, error calculation and propagation by the passed lossfun will be performed automatically. Of course, it is also possible to initialize the gradient with model.zerograds () and then calculate and propagate the error by yourself (loss.backward) and call ʻoptimizer.update as before.

As you can see, Chainer is designed so that once you have defined your model, you can easily optimize it (Define-and-Run).

And the trained model can be easily saved / restored by using Serializer (also ʻoptimizer` can be saved).

serializers.save_hdf5('my.model', model)
serializers.load_hdf5('my.model', model)

After that, here are some tips for actually implementing it.

Chainer mainly handles float32, so if you do not make it this type properly, an error will occur. Note that numpy is float64 by default.
You need to know exactly what type the loss function is supposed to be. For example, softmax_cross_entropy.py, which is often used in classification problems, assumes that the teacher data is of type int32 (representing a label). Please note that if you pass this as a float, an error will occur.
From 1.5 ?, the flow of calculation processing can be displayed as a graph (Visualization of Computational Graph). It's a good idea to make sure your model is built properly.

Perhaps the first thing that gets stuck is mainly type errors. I don't know if Chainer starts with a mold and ends with a mold, but there is no doubt that it starts with a mold, so please be careful about this point and use it.

[PYTHON] Neural network starting with Chainer

How neural networks work

Constitution

propagation

Learning

Implementation by Chainer

Constitution

propagation

Learning