[PYTHON] Try implementing RBM with chainer.

0. Introduction

RBM is a method used as pre-learning when multiple layers are used in deep learning. The source code is published on theano. http://deeplearning.net/tutorial/rbm.html This time I ported this code to chainer.

In the case of RBM, the error function is a function of Free Energy and is complicated (?). Because of the inverse error propagation, if you differentiate with the weight matrix and bias variables, you will get a beautiful calculation formula. It can be calculated from the value of Contrastive Divergence (CD) processing. When implementing with chainer, rather than bother to create an error function (loss) and perform inverse error propagation (loss.backward) It would be simpler to simply implement it with cupy (numpy + GPU), but I dared to implement it using the error function. The source code of cupy (numpy + GPU) only will be released when the opportunity arises. However, the following source code also uses cupy to speed up CD processing. .. ..

# -*- coding: utf-8 -*-

import os,sys
import argparse
import time

import numpy as np

import chainer
from chainer import computational_graph
from chainer import cuda, Variable

import chainer.functions as F
import chainer.links as L
import chainer.optimizers as O


import pickle
import gzip

parser = argparse.ArgumentParser()
parser.add_argument('--gpu', '-g', default=-1, type=int,
                    help='GPU ID (negative value indicates CPU)')
parser.add_argument('--pcd', '-p', default=1, type=int,
                    help='pcd_flag')
parser.add_argument('--kcd', '-k', default=1, type=int,
                    help='cd-k')


'''When calculating with GPU, cupy= numpy +Process with GPU.'''
args = parser.parse_args()
if args.gpu >= 0:
    cuda.check_cuda_available()
xp = cuda.cupy if args.gpu >= 0 else np

def sigmoid(x):
    return 1. / (1 + xp.exp(-x))

class RBM(chainer.Chain):
    '''
    n_visbile:Dimension of visible layer
    n_hidden:Dimension of hidden layer
    l.W:Weight matrix
    l.b:hbias
    l.a:vbaias
    real:Visible layer is 0,Cases are divided according to whether it is 1 or a real number. The initial layer assumes a real number, but the middle layer is 0,I'm thinking only one.
h in chainer= l(v)The weight matrix that appears in general RBM is transposed so as to multiply.
    '''
    def __init__(self,n_visible,n_hidden,real=0,k=1,pcd_flag=0):
        super(RBM,self).__init__(
            l=L.Linear(n_visible,n_hidden),
            # chainer 1.Since this usage was deprecated in 5 manual, add_Use param.
#            a=L.Parameter(xp.zeros(n_visible,dtype=xp.float32)),
        )
        self.l.add_param("a",(n_visible),dtype=xp.float32)

        ''' l.Doesn't it learn unless a is initialized and initialized by using chainer?'''
        self.l.a.data.fill(0)

        self.n_visible = n_visible
        self.n_hidden = n_hidden
        self.real = real
        self.k = k
        self.pcd_flag = pcd_flag

    def __call__(self,v_data,v_prev_data=None):

        batch_size = v_data.shape[0]

        if self.pcd_flag == 0:
            v_prev_data = v_data
        elif self.pcd_flag ==1 and v_prev_data is None:
            v_prev_data = v_data
        vh_data = self.constrastive_divergence(v_prev_data,self.k)
        v = Variable(v_data)
        vh = Variable(vh_data.astype(np.float32))

        '''
        http://deeplearning.net/tutorial/rbm.html expression(5)From
Error function(loss)Is(Free energy of visible layer data)When(Visible layer data CD-k free energy)の差Whenなる。
        loss = (self.free_energy(v) - self.free_energy(vh)) / batch_size
        v->The mapping to vh is not included in the learning, so the CD-Variableization after k processing is completed
        '''

        loss = (self.free_energy(v) - self.free_energy(vh)) / batch_size
        return loss

    def free_energy(self,v):
        ''' Function to compute the free energy '''
        '''
Input value v must be Variable
Originally, after taking SUM in row units, column units(Number of batches)It's a process that should take SUM
After all, I will take SUM, so SUM all at once
        '''
        batch_size = v.data.shape[0]
        n_visible = self.n_visible
        real = self.real
        if real == 0:
            '''
The visible layer[0,1]When
            vbias_term = -1 * SUM((a(i) * v(i))
            '''
            vbias_term = F.sum(F.matmul(v,self.l.a))
        else:
            '''
When the visible layer is real
            vbias_term = -0.5 * SUM((v(i)-a(i)) * (v(i)-a(i)))
Number of batches in chainer*To handle in the input layer, to subtract bias from each line
Forcibly use m * n
            '''
            m = Variable(xp.ones((batch_size,1),dtype=xp.float32))
            n = F.reshape(self.l.a,(1,n_visible))
            v_ = v - F.matmul(m,n)
            vbias_term = -F.sum(0.5 * v_ * v_)

        wx_b = self.l(v)
        hidden_term = F.sum(F.log(1+F.exp(wx_b)))
        return -vbias_term-hidden_term

    def propup(self,vis):
        '''
Calculate the probability distribution of hidden layers from visible layer data
The hidden layer[0,1]So it returns the probability of being 1.
Input value vis is always xp(np)
        '''
        pre_sigmoid_activation = xp.dot(vis,self.l.W.data.T) + self.l.b.data
        return sigmoid(pre_sigmoid_activation)

    def propdown(self,hid):
        '''
Calculate the probability distribution of the visible layer from the hidden layer data
The visible layer[0,1]When, returns the probability of becoming 1.
When the visible layer is a real number, it returns the average of the normal distribution. Variance is fixed at 1.
Input value hid is always xp(np)
        '''
        real = self.real
        if real == 0:
            pre_sigmoid_activation = xp.dot(hid,self.l.W.data) + self.l.a.data
            v_mean = sigmoid(pre_sigmoid_activation)
        else:
            v_mean = xp.dot(hid,self.l.W.data) + self.l.a.data
        return v_mean

    def sample_h_given_v(self,v0_sample):
        '''
        v0_sample → h1_Gibbs sampling sample
        [0,1]Randomize the value of and compare it with the probability of 1 calculated by popup
        h1_mean[i] >  p[i]Then h1_sample[i] = 1
        h1_mean[i] <= p[i]Then h1_sample[i] = 0
        '''
        h1_mean = self.propup(v0_sample)
        h1_sample = xp.random.binomial(size=h1_mean.shape,n=1,p=h1_mean)
        return h1_mean,h1_sample

    def sample_v_given_h(self,h0_sample):
        '''
        h0_sample → v1_Gibbs sampling sample
The visible layer[0,1]in the case of
        [0,1]The value of is a random value p[i]And compare it with the probability of 1 calculated by popdown
        v1_mean[i] >  p[i]Then v1_sample[i] = 1
        v1_mean[i] <= p[i]Then v1_sample[i] = 0
When the visible layer is a real number
        v1_sample[i]Is average v1_mean[i], Because it is a normal distribution with variance 1.
Random value u from normal distribution with mean 0, variance 1[i]And add the value calculated by popdown
        v1_sample[i] = v1_mean[i] + u[i]
        '''
        v1_mean = self.propdown(h0_sample)
        if self.real == 0:
            v1_sample = xp.random.binomial(size=v1_mean.shape,n=1,p=v1_mean)
        else:
            batch_number = h0_sample.shape[0]
            v1_sample = v1_mean + xp.random.randn(batch_number,self.n_visible)
        return v1_mean,v1_sample

    def gibbs_hvh(self,h0_sample):
        ''' h->v->gibbs sampling h'''
        v1_mean,v1_sample = self.sample_v_given_h(h0_sample)
        h1_mean,h1_sample = self.sample_h_given_v(v1_sample)
        return v1_mean,v1_sample,h1_mean,h1_sample

    def gibbs_vhv(self,v0_sample):
        ''' v->h->v gibbs sampling'''
        h1_mean,h1_sample = self.sample_h_given_v(v0_sample)
        v1_mean,v1_sample = self.sample_v_given_h(h1_sample)
        return h1_mean,h1_sample,v1_mean,v1_sample

    def constrastive_divergence(self,v0_sample,k=1):
        ''' CD-processing of k, cupy is required to make it GPU'''
        vh_sample = v0_sample
        for step in range(k):
            ph_mean,ph_sample,vh_mean,vh_sample = self.gibbs_vhv(vh_sample)
        return vh_sample

    def reconstruct(self, v):
        h = sigmoid(xp.dot(v,self.l.W.data.T) + self.l.b.data)
        reconstructed_v = sigmoid(xp.dot(h,self.l.W.data) + self.l.a.data)
        return reconstructed_v

In the case of chainer, you can use chainer.links.Linear to define the weight matrix (W) and bias (b) of the mapping of each layer. In the case of RBM, an additional bias is required. l.add_param("a",(n_visible),dtype=xp.float32) I added a parameter in. l.W: Weight matrix l.b:hbias l.a:vbaias It will be. To manage the mapping from the visible layer to the hidden layer with chainer.links.Linear Transposes the weight matrix found in a typical RBM book. (Wij → Wji)

1. Experiment with MNIST

I experimented with the following source code using MNIST data. Since there was no GPU server, it was trained by the CPU, so the number of trainings is 50 times. This time, the hidden layer is set to 500 dimensions, and the results with a learning coefficient of 0.01, momentum of 0.5, and PCD-10 are used.

The following is the original data. Actually there are 60,000 cases, but only 150 cases are displayed mnist_original.png

The following is an image of the above image reconstructed. It is almost reproduced. mnist_reconstruct.png Only items are displayed.

Next is an image that visualizes the weight matrix. The hidden layer is 500 dimensions and with respect to the basis of each vector It is converted to 28 * 28 images. mnist_weight.png

Recommended Posts

Try implementing RBM with chainer.
Try implementing XOR with PyTorch
Try implementing perfume with Go
Try horse racing prediction with Chainer
Try Common Representation Learning with chainer
Try implementing XOR with Keras Functional API
Try with Chainer Deep Q Learning --Launch
Seq2Seq (1) with chainer
Try scraping with Python.
Use tensorboard with Chainer
Try SNN with BindsNET
Try regression with TensorFlow
Try implementing associative memory with Hopfield network in Python
Now, let's try face recognition with Chainer (prediction phase)
Try implementing MetaTrader's LWMA with scipy's FIR filter function
Now, let's try face recognition with Chainer (learning phase)
Try to factorial with recursion
Try implementing gRPC structured logs easily and simply with grpc_zap
Try function optimization with Optuna
Try deep learning with TensorFlow
Try edge detection with OpenCV
Try Google Mock with C
Try using matplotlib with PyCharm
Try programming with a shell!
Try GUI programming with Hy
Try an autoencoder with Pytorch
Try Python output with Haxe 3.2
Try matrix operation with NumPy
Learn elliptical orbits with Chainer
Try running CNN with ChainerRL
Try various things with PhantomJS
Try Deep Learning with FPGA
Seq2Seq (3) ~ CopyNet Edition ~ with chainer
Use chainer with Jetson TK1
Neural network starting with Chainer
Try running Python with Try Jupyter
Implemented Conditional GAN with chainer
Try implementing Yubaba in Python 3
Image caption generation with Chainer
Try Selenium Grid with Docker
Try face recognition with Python
Try OpenCV with Google Colaboratory
Implemented SmoothGrad with Chainer v2
Deep Embedded Clustering with Chainer 2.0
A little stuck with chainer
Try machine learning with Kaggle
Try TensorFlow MNIST with RNN
Try building JupyterHub with Docker
Try using folium with anaconda
Implementing logistic regression with NumPy
Introduction to Deep Learning (2) --Try your own nonlinear regression with Chainer-
Avoid implementing useless functions with inheritance
Try Deep Learning with FPGA-Select Cucumbers
Try scraping with Python + Beautiful Soup
Try implementing Yubaba in Go language
Multilayer Perceptron with Chainer: Function Fitting
Try to operate Facebook with Python
Try implementing extension method in python
Try singular value decomposition with Python
Try deep learning with TensorFlow Part 2
Try http-prompt with interactive http access