Chainer is a library for implementing neural networks developed by Preferred Networks. Its features are as follows (from the homepage).
Personally, I would like to mention one more thing, "easy to install". Many deep learning frameworks are troublesome to install, but Chainer has few dependent libraries and was easy to install ... but I started using Cython from 1.5.0 and it was a bit of a hassle. became. Please refer to the following for the installation method.
In addition, Chainer's notation is intuitive and simple as described above, so it can cover a wide range from simple networks to more complex, so-called deep learning areas. Other deep learning libraries are completely over-engineered if they are not deep, but it was difficult for simple libraries (such as PyBrain) to be deep, so I think this is also a big advantage.
This time, I will explain how to use such attractive Chainer, but in order to handle Chainer, knowledge about (relatively deep) neural networks is indispensable. As a result, it often happens that the knowledge on the neural network side is insufficient (I got it).
Therefore, I would like to briefly explain the mechanism of the neural network first, and then explain how to implement it with Chainer in a later stage.
The configuration of the neural network is as follows (as an aside, drawing lines between nodes is not a hassle every time).
Let's take a closer look at how the input from the input gets to the output. The figure below makes it easy to see how the input from the input is made to the first node of the hidden layer.
You can see that four inputs are transmitted. The input is not transmitted directly as it is, but is weighted. Neural networks mimic the composition of neurons in the brain, but think of them as being weakened or strengthened as inputs (stimuli) propagate. Expressed mathematically, if the input is $ x $, it will be weighted as $ a $ like $ ax $.
Now, we received the input ʻax`, but the node does not just pass this value it received to the next layer. It seems that there is a mechanism in the brain that the input does not propagate to the next layer unless it exceeds a certain threshold, and here too, it imitates it and converts the received input to the output to the next layer. Expressed mathematically, the function that converts the input to the output to the next layer is $ h $, and the output value can be expressed as $ h (ax) $. This function $ h $ is called the activation function.
In summary, there are two important factors for value propagation in neural networks:
In short, the neural network simply weights the input it receives and outputs it. Therefore, a single layer neural network is almost synonymous with linear regression or logistic regression.
With that in mind, it becomes clear what the manipulation of the number of nodes and the number of layers means.
When dealing with neural networks, I think that the number of nodes and the number of layers may be messed up appropriately, but it is also important to plot the data firmly and find the appropriate number of nodes and layers.
To train a neural network, we use a technique called backpropagation. The error is the difference between the value output from the neural network and the actual value. Backpropagation, as the name implies, is a method of propagating this error from behind (output layer = output layer) and adjusting the weight of each layer.
The details of Backpropagation are not discussed in detail here because there are various other explanations, but the following two points are important.
In addition, there are several methods for using the above training data to perform the above operation of "calculating the error and updating the weight".
The cycle of 1 epoch is to finish updating the used learning data. Usually, you will learn this epoch several times. However, it is not so good if it is simply repeated, so the training data is shuffled at each epoch, and in the case of a mini-batch, the acquisition position of the mini-batch is shifted or randomly sampled.
This epoch is an important unit in neural network learning, such as checking the progress of learning and readjusting parameters.
Here is a summary of the contents of the explanation of neural networks.
Now, let's look at the implementation in Chainer and the above points.
In Chainer, the neural network consists of Chain
(Function Set up to 1.4).
The following is a definition of the 4-3-2 type neural network used in the explanation so far.
from chainer import Link, Chain, ChainList
import chainer.functions as F
import chainer.links as L
class MyChain(Chain):
def __init__(self):
super(MyChain, self).__init__(
l1=L.Linear(4, 3),
l2=L.Linear(3, 2)
)
def __call__(self, x):
h = F.sigmoid(self.l1(x))
o = self.l2(h)
return o
Note
It seems that you don't have to inherit Chain
, but you can't do CPU / GPU migration and model saving without inheriting Chain
, so I think you should inherit it obediently. If it is a simple full join, there is no need to create a class, and it seems that Chain (l1 = ..., l2 = ...)
is fine.
From 1.5, functions with parameters (= functions to be optimized) are clearly separated from Link, and pure functions (sigmoid etc.) are clearly separated from Functions.
I think it's best for those who wondered if the hidden layer wasn't even one layer. Please refer to the figure below for the above l1
and l2
.
Considering the propagation between layers in this way, the mechanism is that there are two layers. In fact, L.Linear
holds the weight for propagation and is responsible for applying this weight to the input.
Propagation processing is implemented in __call__
of Chain class as described above.
def __call__(self, x):
h = F.sigmoid(self.l1(x))
o = self.l2(h)
return o
Note
The process that was written as forward
up to 1.4 will be written in __call__
(In Python, if you define __call__
, for example, if you writemodel ()
from an instance called model
You can call the process written in __call__
).
Here, the input x
is weighted (self.l1 (x)
), and the value via the sigmoid function, which is often used as an activation function, is passed to the next layer (h =). F.sigmoid (self.l1 (x))
). The final output does not require any processing to pass to the next layer, so the activation function is not used (ʻo = self.l2 (h) `)
When training, you first need to calculate the error between the predicted value and the actual value. You can simply implement this as a function (commonly named lossfun
in Chainer), but for classification problems it's easier to use Classifier
.
from chainer.functions.loss.mean_squared_error import mean_squared_error
model = L.Classifier(MyChain(), lossfun=mean_squared_error)
Actually, Classifier
is also Link
, that is, a function with parameters, and calculates the error between the value output from MyChain
and the teacher data in __call__
(Function
for calculation is natural. It can be specified (mean_squared_error in the above).
In 1.5, the point that this Link
can be connected is very big, and the reusability of the model is much higher. Even in the above, you can see that the model of the main body and the process of calculating the error using it can be written separately.
After calculating the error, optimize the model to minimize it (Backpropagation above). It is ʻoptimizer` that plays this role, and the learning part of MNIST example is as follows. It has become.
# Setup optimizer
optimizer = optimizers.Adam()
optimizer.setup(model)
...(Omission)...
# Learning loop
for epoch in six.moves.range(1, n_epoch + 1):
print('epoch', epoch)
# training
perm = np.random.permutation(N)
sum_accuracy = 0
sum_loss = 0
for i in six.moves.range(0, N, batchsize):
x = chainer.Variable(xp.asarray(x_train[perm[i:i + batchsize]]))
t = chainer.Variable(xp.asarray(y_train[perm[i:i + batchsize]]))
# Pass the loss function (Classifier defines it) and its arguments
optimizer.update(model, x, t)
There are three basic steps:
At the core is the updating ʻoptimizer.update. From 1.5, by passing lossfun as an argument, error calculation and propagation by the passed lossfun will be performed automatically. Of course, it is also possible to initialize the gradient with
model.zerograds () and then calculate and propagate the error by yourself (loss.backward) and call ʻoptimizer.update
as before.
As you can see, Chainer is designed so that once you have defined your model, you can easily optimize it (Define-and-Run
).
And the trained model can be easily saved / restored by using Serializer
(also ʻoptimizer` can be saved).
serializers.save_hdf5('my.model', model)
serializers.load_hdf5('my.model', model)
After that, here are some tips for actually implementing it.
float64
by default.softmax_cross_entropy.py
, which is often used in classification problems, assumes that the teacher data is of type int32 (representing a label). Please note that if you pass this as a float, an error will occur.Perhaps the first thing that gets stuck is mainly type errors. I don't know if Chainer starts with a mold and ends with a mold, but there is no doubt that it starts with a mold, so please be careful about this point and use it.
Recommended Posts