Use Chainer to learn the basic ideas of deep learning.

Chainer basic parts

chainer.Variable

In Chainer, variables are described as a class called Variable.

import numpy as np
from chainer import Variable

x1 = Variable(np.array([0.12]).astype(np.float32))
x2 = Variable(np.array([0.34]).astype(np.float32))
x3 = Variable(np.array([0.56]).astype(np.float32))

"Forward calculation" is performed by the following calculation. In deep learning, these coefficients (0.5, 0.3, 0.2, etc. below) are called "gradients", and the purpose of the calculation is to find the optimum gradient value.

z = 0.5 * x1 + 0.3 * x2 + 0.2 * x3

The result of the operation is also an object of the Variable class.

variable([0.274])

You can use the .data attribute to reference the numpy.ndarray object that is the content of the data.

z.data

array([0.274], dtype=float32)

Perform the "reverse calculation" as follows. The purpose of the reverse calculation is to "backpropagate" the error in the forward calculation and fine-tune the "coefficients" mentioned above.

z.backward()

The reverse calculation gives the "derivative value" needed to fine-tune the coefficients of each variable.

x1.grad, x2.grad, x3.grad

(array([0.5], dtype=float32),
 array([0.3], dtype=float32),
 array([0.2], dtype=float32))

Until now, variables were one-dimensional, but you can also use multidimensional arrays as shown below.

x1 = Variable(np.array([0.12, 0.21]).astype(np.float32))
x2 = Variable(np.array([0.34, 0.43]).astype(np.float32))
x3 = Variable(np.array([0.56, 0.65]).astype(np.float32))

z = 0.5 * x1 + 0.3 * x2 + 0.2 * x3

variable([0.274, 0.364])

If the variable is multidimensional, you need to tell us the dimension of the slope of the function before doing the "reverse calculation".

z.grad = np.ones(2, dtype=np.float32)

z.backward()

x1.grad, x2.grad, x3.grad

(array([0.5, 0.5], dtype=float32),
 array([0.3, 0.3], dtype=float32),
 array([0.2, 0.2], dtype=float32))

chainer.links.Linear

Chainer provides a linear transformation function as chainer.links.Linear used when passing data from one layer to the next in a neural network (NN) that performs deep learning. Linear transformation

y = Wx + b

It is expressed as. Before using chainer.links.Linear, let's express a linear transformation with numpy.

import numpy as np
W = np.array([[ 5,  1, -2 ],
       [ 3,  -5, -1 ]], dtype=np.float32)

b = np.array([2, -3], dtype=np.float32)

If there is only one input data (only one data consisting of 3 variables), it can be calculated as follows.

x = np.array([0, 1, 2])
y = x.dot(W.T) + b
y

array([ -1., -10.])

If there are 5 input data (5 data consisting of 3 variables), it can be calculated as follows. In this way, the strength of matrix calculation is that it can be calculated in parallel even if there is a large amount of data.

x = np.array(range(15)).astype(np.float32).reshape(5, 3)
x

array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.],
       [12., 13., 14.]], dtype=float32)

y = x.dot(W.T) + b
y

array([[ -1., -10.],
       [ 11., -19.],
       [ 23., -28.],
       [ 35., -37.],
       [ 47., -46.]], dtype=float32)

A similar calculation can be achieved with chainer.links.Linear.

import chainer.links as L
h = L.Linear(3,2) #A linear action function y that inputs a 3D vector and outputs a 2D vector= Wx + b

h.W.data #Random numbers are entered by default

array([[ 0.5469049 , -0.35929427, -0.9921321 ],
       [ 1.4973897 ,  0.620568  ,  0.78245926]], dtype=float32)

h.b.data #Contains 0 vectors by default

array([0., 0.], dtype=float32)

x = Variable(np.array(range(15)).astype(np.float32).reshape(5, 3))

x.data # Variable.data is a Numpy array object

array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.],
       [12., 13., 14.]], dtype=float32)

y = h(x)

y.data

array([[ -2.3435585,   2.1854866],
       [ -4.757123 ,  10.886737 ],
       [ -7.170687 ,  19.587988 ],
       [ -9.584251 ,  28.289238 ],
       [-11.997816 ,  36.990486 ]], dtype=float32)

x.data.dot(h.W.data.T) + h.b.data #verification of accounts

array([[ -2.3435585,   2.1854866],
       [ -4.757123 ,  10.886737 ],
       [ -7.170687 ,  19.587988 ],
       [ -9.584251 ,  28.289238 ],
       [-11.997816 ,  36.990486 ]], dtype=float32)

chainer.functions

Chainer provides various functions in chainer.functions that take an object of Variable class as an argument. A typical example is the mean squared error chainer.functions.mean_squared_error.

import chainer.functions as F

y_pred = Variable(np.array([0.1, 0.2, 0.3]).astype(np.float32))
y_real = Variable(np.array([0.2, 0.1, 0.3]).astype(np.float32))
loss = F.mean_squared_error(y_pred, y_real)
loss

variable(0.00666667)

Multiple Linear Regresion (MLR)

Now, as a practice for Chainer, let's do linear multiple regression. As the data to be handled, we will handle the data of Iris (iris), which is often used in the field of machine learning.

Iris data

import numpy as np
from sklearn import datasets
iris = datasets.load_iris() #Reading iris data
data = iris.data.astype(np.float32)
X = data[:, :3] #The first three of the iris measurement data are used as explanatory variables.
Y = data[:, 3].reshape(len(data), 1) #Let the last one be the objective variable.

#The odd-numbered data is the teacher data, and the even-numbered data is the test data.
index = np.arange(Y.size)
X_train = X[index[index % 2 != 0], :] #Explanatory variable (teacher data)
X_test = X[index[index % 2 == 0], :] #Explanatory variable (test data)
Y_train = Y[index[index % 2 != 0], :] #Objective variable (teacher data)
Y_test = Y[index[index % 2 == 0], :] #Objective variable (test data)

chainer.Sequential

In chainer.Sequential, define the structure of the neural network.

import chainer.links as L
from chainer import Sequential
n_input = 3 #Input data is 3 variables
n_output = 1 #Output data is one variable
mlr = Sequential( #Define neural network
    L.Linear(n_input, n_output) #Neural network consisting of only one layer
)

When first defined, it contains a random number as a coefficient (gradient).

mlr[0].W.data, mlr[0].b.data #You can see the coefficients in this way

(array([[ 0.8395325 ,  0.26789278, -0.4547218 ]], dtype=float32),
 array([0.], dtype=float32))

You can do "forward calculation" as follows. However, since the coefficient remains a random initial value, the predicted value obtained is also erratic.

Y_pred = mlr(X_test) #Forward calculation

Y_pred

variable([[4.5826297],
          [4.211921 ],
          [4.525466 ],
          [4.136074 ],
          [3.8342214],
          [4.842596 ],
          [4.196824 ],
          [5.3951936],
          [4.9871187],
          [5.0303006],
          [4.6712837],
          [4.3715415],
          [4.07662  ],
          [4.380943 ],
          [4.6397934],
          [4.132669 ],
          [4.7818465],
          [4.2620945],
          [4.9639153],
          [3.9064832],
          [4.544149 ],
          [3.9600616],
          [4.435637 ],
          [4.5720534],
          [4.758643 ],
          [4.596792 ],
          [4.395105 ],
          [4.11534  ],
          [4.0359087],
          [4.226083 ],
          [3.1419218],
          [3.807672 ],
          [3.841272 ],
          [3.458812 ],
          [3.7482173],
          [3.6278338],
          [3.73065  ],
          [4.1945934],
          [4.276256 ],
          [3.7678359],
          [3.5324283],
          [3.8191838],
          [3.2909057],
          [4.318143 ],
          [3.6407008],
          [3.313174 ],
          [3.7469225],
          [3.5148606],
          [3.6523929],
          [3.5871825],
          [3.44477  ],
          [4.0815   ],
          [3.623253 ],
          [2.7371933],
          [3.657213 ],
          [3.995137 ],
          [4.01153  ],
          [3.300307 ],
          [3.7596698],
          [4.02334  ],
          [4.058117 ],
          [4.1678634],
          [3.9169993],
          [3.7725363],
          [3.5766659],
          [4.188837 ],
          [3.5766659],
          [3.2712274],
          [3.653448 ],
          [3.6582088],
          [3.908893 ],
          [3.2735178],
          [3.9169993],
          [3.6851778],
          [3.660439 ]])

chainer.optimizers

chainer.optimizers offers a variety of optimization techniques. One of them is stochastic gradient descent (SGD).

from chainer import optimizers
optimizer = optimizers.SGD(lr=0.01) #Select SGD as the optimization method
optimizer.setup(mlr) #Set up the defined network

<chainer.optimizers.sgd.SGD at 0x7f2e090bb7f0>

Compare the predicted value Y_pred obtained earlier with the actual observed value Y_train to obtain the mean square error MSE. There are several other choices to define the error. In deep learning, these errors are called "loss", and the function for finding the loss is called the "loss function". Deep learning aims to minimize this loss.

import chainer.functions as F
loss = F.mean_squared_error(Y_pred, Y_train)
loss

variable(9.235707)

Do the reverse calculation and update the gradient to reduce the error as follows:

mlr.cleargrads() #Gradient initialization
loss.backward() #Reverse calculation
optimizer.update() #Gradient update

Let's see that the gradient value has changed.

mlr[0].W.data, mlr[0].b.data

(array([[ 0.51864076,  0.08801924, -0.63399327]], dtype=float32),
 array([-0.05653327], dtype=float32))

Repeat the above calculation until the loss converges.

%time
for i in range(50):
    Y_pred = mlr(X_test) #Forward calculation
    loss = F.mean_squared_error(Y_pred, Y_train) #Error calculation
    mlr.cleargrads() #Gradient initialization
    loss.backward() #Reverse calculation
    optimizer.update() #Gradient update

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs

Make sure the loss is reduced.

loss

variable(0.15386367)

from the beggining to the end

The above flow can be summarized as follows.

%time
import numpy as np
import chainer.links as L
import chainer.functions as F
from chainer import Sequential
from chainer import optimizers
from sklearn import datasets

iris = datasets.load_iris()
data = iris.data.astype(np.float32)
X = data[:, :3]
Y = data[:, 3].reshape(len(data), 1)

index = np.arange(Y.size)
X_train = X[index[index % 2 != 0], :]
X_test = X[index[index % 2 == 0], :]
Y_train = Y[index[index % 2 != 0], :]
Y_test = Y[index[index % 2 == 0], :]

n_input = 3
n_output = 1
mlr = Sequential(
    L.Linear(n_input, n_output)
)

optimizer = optimizers.SGD(lr=0.01)
optimizer.setup(mlr)

loss_history = []
for i in range(100):
    Y_pred = mlr(X_train)
    loss = F.mean_squared_error(Y_pred, Y_train)
    loss_history.append(np.mean(loss.data))
    mlr.cleargrads()
    loss.backward()
    optimizer.update()

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.77 µs

loss

variable(0.05272739)

mlr[0].W.data, mlr[0].b.data

(array([[ 0.11841334, -0.2642416 ,  0.324636  ]], dtype=float32),
 array([0.08475045], dtype=float32))

Since we recorded the value of loss for each iterative calculation, we will illustrate it and see if the loss has converged.

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(loss_history)

[<matplotlib.lines.Line2D at 0x7f2e08bfcbe0>]

It seems that it has converged. Now let's look at the y-y plot comparing the predicted and measured values. The closer it is to the diagonal, the better the prediction.

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.scatter(Y_train.flatten(), mlr(X_train).data.flatten(), alpha=0.5, label='train')
plt.plot([min(Y), max(Y)], [min(Y), max(Y)])
plt.grid()
plt.legend()
plt.xlabel('Observed')
plt.ylabel('Predicted')
plt.show()

Logistic Regression (LR)

Next, let's perform logistic regression, which is a method of regression to the sigmoid function (logistic function) and is also used as a classification method.

Iris data

This time, the four measurement data of iris are used as explanatory variables, and the iris varieties (three types) are used as objective variables.

import numpy as np
from sklearn import datasets
iris = datasets.load_iris() #Reading iris data
X = iris.data.astype(np.float32) #4 variables as explanatory variables
Y = iris.target #Using iris varieties (3 types) as objective variables

#One variety of iris-Convert to hot vector.
Y_ohv = np.zeros(3 * Y.size).reshape(Y.size, 3).astype(np.float32)
for i in range(Y.size):
    Y_ohv[i, Y[i]] = 1.0 # one-hot vector

#The odd-numbered data is the teacher data, and the even-numbered data is the test data.
index = np.arange(Y.size)
X_train = X[index[index % 2 != 0], :] #Explanatory variable (teacher data)
X_test = X[index[index % 2 == 0], :] #Explanatory variable (test data)
Y_train = Y_ohv[index[index % 2 != 0], :] #Objective variable one-hot vector (teacher data)
Y_test = Y_ohv[index[index % 2 == 0], :] #Objective variable one-hot vector (test data)
Y_ans_train = Y[index[index % 2 != 0]] #Objective variable (teacher data)
Y_ans_test = Y[index[index % 2 == 0]] #Objective variable (test data)

Definition of neural network

from chainer import Sequential
import chainer.links as L
import chainer.functions as F
n_input = 4 #Input is 4 variables
n_output = 3 #Output is 3 variables
lr = Sequential(
    L.Linear(n_input, n_output), #Linear transformation
    F.sigmoid, #Sigmoid function
    F.softmax #Softmax function that converts to a positive real number whose sum is 1.
)

optimisation

from chainer import optimizers
optimizer = optimizers.SGD(lr=0.01)
optimizer.setup(lr)

<chainer.optimizers.sgd.SGD at 0x7f2e0878b7b8>

Y_pred = lr(X_train)
loss = F.mean_squared_error(Y_pred, Y_train)
loss

variable(0.21429048)

lr.cleargrads()
loss.backward()
optimizer.update()

Y_pred = lr(X_train)
loss = F.mean_squared_error(Y_pred, Y_train)
loss

variable(0.21422786)

%time
for i in range(50):
    Y_pred = lr(X_train)
    loss = F.mean_squared_error(Y_pred, Y_train)
    lr.cleargrads()
    loss.backward()
    optimizer.update()

CPU times: user 7 µs, sys: 1 µs, total: 8 µs
Wall time: 28.1 µs

loss

variable(0.21126282)

from the beggining to the end

The above flow can be summarized as follows.

%time
import numpy as np
import chainer.links as L
import chainer.functions as F
from chainer import Sequential
from chainer import optimizers
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data.astype(np.float32)
Y = iris.target

Y_ohv = np.zeros(3 * Y.size).reshape(Y.size, 3).astype(np.float32)
for i in range(Y.size):
    Y_ohv[i, Y[i]] = 1.0 # one-hot vector

index = np.arange(Y.size)
X_train = X[index[index % 2 != 0], :]
X_test = X[index[index % 2 == 0], :]
Y_train = Y_ohv[index[index % 2 != 0], :]
Y_test = Y_ohv[index[index % 2 == 0], :]
Y_ans_train = Y[index[index % 2 != 0]]
Y_ans_test = Y[index[index % 2 == 0]]

n_input = 4
n_output = 3
lr = Sequential(
    L.Linear(n_input, n_output),
    F.sigmoid,
    F.softmax
)

optimizer = optimizers.SGD(lr=0.01)
optimizer.setup(lr)

loss_history = []
for i in range(100000):
    Y_pred = lr(X_train)
    loss = F.mean_squared_error(Y_pred, Y_train)
    loss_history.append(np.mean(loss.data))
    lr.cleargrads()
    loss.backward()
    optimizer.update()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs

loss

variable(0.14579579)

Since we recorded the value of loss for each iterative calculation, we will illustrate it and see if the loss has converged.

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(loss_history)

[<matplotlib.lines.Line2D at 0x7f2e06445080>]

lr[0].W.data, lr[0].b.data

(array([[ 0.7060194 ,  1.5789627 , -2.7305322 , -1.5006719 ],
        [-0.6907905 , -0.43505952, -0.78199637, -0.06515903],
        [-1.7571408 , -2.1365883 ,  2.8683107 ,  2.772224  ]],
       dtype=float32),
 array([ 0.2556363 , -0.15823539, -0.9368208 ], dtype=float32))

Since the last layer is a softmax function that converts the sum to a positive real number, the output can be regarded as "the probability that it can be regarded as that variety". Let's find the correct answer rate when the variety with the maximum probability is the "predicted variety".

Y_pred = lr(X_train)
nrow, ncol = Y_pred.data.shape

count = 0
for i in range(nrow):
    cls = np.argmax(Y_pred.data[i, :])
    if cls == Y_ans_train[i]:
        count += 1

print(count, " / ", nrow, " = ", count / nrow)

50  /  75  =  0.6666666666666666

Multi-Layer Perceptron (MLP)

So far, I have created linear multiple regression and logistic regression models using Chainer. In the same way, if you thicken the layer, it becomes "deep learning". The simplest model of deep learning is the multi-layer perceptron.

Classification by MLP

%time
import numpy as np
from chainer import Sequential
from chainer import optimizers
import chainer.links as L
import chainer.functions as F
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data.astype(np.float32)
Y = iris.target

Y_ohv = np.zeros(3 * Y.size).reshape(Y.size, 3).astype(np.float32)
for i in range(Y.size):
    Y_ohv[i, Y[i]] = 1.0 # one-hot vector

index = np.arange(Y.size)
X_train = X[index[index % 2 != 0], :]
X_test = X[index[index % 2 == 0], :]
Y_train = Y_ohv[index[index % 2 != 0], :]
Y_test = Y_ohv[index[index % 2 == 0], :]
Y_ans_train = Y[index[index % 2 != 0]]
Y_ans_test = Y[index[index % 2 == 0]]

n_input = 4
n_hidden = 6
n_output = 3
mlp = Sequential(
    L.Linear(n_input, n_hidden),
    F.sigmoid,
    L.Linear(n_hidden, n_output),
    F.softmax
)

optimizer = optimizers.SGD(lr=0.01)
optimizer.setup(mlp)

loss_history = []
for i in range(100000):
    Y_pred = mlp(X_train)
    loss = F.mean_squared_error(Y_pred, Y_train)
    loss_history.append(np.mean(loss.data))
    mlp.cleargrads()
    loss.backward()
    optimizer.update()

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.48 µs

loss

variable(0.01375966)

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(loss_history)

[<matplotlib.lines.Line2D at 0x7f2e08746fd0>]

Y_pred = mlp(X_train)
nrow, ncol = Y_pred.data.shape

count = 0
for i in range(nrow):
    cls = np.argmax(Y_pred.data[i, :])
    if cls == Y_ans_train[i]:
        count += 1

print(count, " / ", nrow, " = ", count / nrow)

74  /  75  =  0.9866666666666667

Regression by MLP

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
data = iris.data.astype(np.float32)
X = data[:, :3]
Y = data[:, 3].reshape(len(data), 1)

index = np.arange(Y.size)
X_train = X[index[index % 2 != 0], :]
X_test = X[index[index % 2 == 0], :]
Y_train = Y[index[index % 2 != 0], :]
Y_test = Y[index[index % 2 == 0], :]

%time
import numpy as np
from chainer import Sequential
from chainer import optimizers
import chainer.links as L
import chainer.functions as F
n_input = 3
n_hidden = 6
n_output = 1
mlpr = Sequential(
    L.Linear(n_input, n_hidden),
    F.sigmoid,
    L.Linear(n_hidden, n_output)
)

optimizer = optimizers.SGD(lr=0.01)
optimizer.setup(mlpr)

loss_history = []
for i in range(10000):
    Y_pred = mlpr(X_train)
    loss = F.mean_squared_error(Y_pred, Y_train)
    loss_history.append(np.mean(loss.data))
    mlpr.cleargrads()
    loss.backward()
    optimizer.update()

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.82 µs

loss

variable(0.04921096)

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(loss_history)

[<matplotlib.lines.Line2D at 0x7f2e064200f0>]

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(6,6))
plt.scatter(Y_train.flatten(), mlpr(X_train).data.flatten(), alpha=0.5, label='train')
plt.plot([min(Y), max(Y)], [min(Y), max(Y)])
plt.grid()
plt.legend()
plt.xlabel('Observed')
plt.ylabel('Predicted')
plt.show()

Autoencoder AutoEncoder (AE)

An autoencoder is a neural network that returns to itself. The input layer to intermediate layer converter is called an encoder, and the intermediate layer to output layer converter is called a decoder. "Dimensionality reduction" (dimension reduction) can be achieved by reducing the number of neurons in the middle layer to less than the input data.

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data.astype(np.float32)

%time
import numpy as np
from chainer import Sequential
from chainer import optimizers
import chainer.links as L
import chainer.functions as F
n_input = 4
n_hidden = 2
n_output = 4
ae = Sequential(
    L.Linear(n_input, n_hidden),
    F.sigmoid,
    L.Linear(n_hidden, n_output),
)

optimizer = optimizers.SGD(lr=0.01)
optimizer.setup(ae)

loss_history = []
for i in range(10000):
    X_pred = ae(X)
    loss = F.mean_squared_error(X_pred, X)
    loss_history.append(np.mean(loss.data))
    ae.cleargrads()
    loss.backward()
    optimizer.update()

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.54 µs

loss

variable(0.09372737)

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(loss_history)

[<matplotlib.lines.Line2D at 0x7f2e06699a58>]

View the data projected on the middle layer

latent = F.sigmoid(ae[0](X))

%matplotlib inline
import matplotlib.pyplot as plt
plt.scatter(latent.data[0:50, 0], latent.data[0:50, 1], alpha=0.5)
plt.scatter(latent.data[50:100, 0], latent.data[50:100, 1], alpha=0.5)
plt.scatter(latent.data[100:150, 0], latent.data[100:150, 1], alpha=0.5)
plt.grid()

[PYTHON] Linear multiple regression, logistic regression, multi-layer perceptron, autoencoder, Chainer yo!

Chainer basic parts

Multiple Linear Regresion (MLR)

Iris data

from the beggining to the end

Logistic Regression (LR)

Iris data

Definition of neural network

optimisation

from the beggining to the end

Multi-Layer Perceptron (MLP)

Classification by MLP

Regression by MLP

Autoencoder AutoEncoder (AE)

View the data projected on the middle layer