This article is the last day of Wacul Advent Calendar 2016.

Last time

In preparation, we implemented a restricted Boltzmann machine. This time, we will "stack" this restricted Boltzmann machine to form a deep Boltzmann machine.

[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ① http://qiita.com/yutaitatsu/items/a9478841357b10789514

What is a deep Boltzmann machine?

It is a kind of Boltzmann machine (with hidden layer) that I explained last time. It represents one generative model as a whole. It has a visible layer at the bottom, and a hidden layer is stacked on top of it. 　 Since the deep Boltzmann machine is a kind of general Boltzmann machine with a hidden layer, it can be learned by the gradient method as it is. However, the deeper the layer, the more parameters to estimate and the higher the risk of falling into a local solution.

Ingenuity of learning

Therefore, *** each layer is conditionally independent (there is no link between nodes in the same layer) ***. Here is the biggest trick of deep Boltzmann machine learning introduced here. The reason for doing so is that ***, starting from the bottom, considers two layers each as a restricted Boltzmann machine *** </ font>. The pair of two layers from the bottom is regarded as a restricted Boltzmann machine, and the aim is to obtain good initial values of parameters by pre-learning each pair. Specifically, take the following steps.

[Pre-learning steps]

① First, learn the restricted Boltzmann machine (at the bottom) including the visible layer
② In the restricted Boltzmann machine learned in ①, sample the value of the hidden layer using the "conditional distribution from the visible layer to the hidden layer".
③ Learn by regarding the sample obtained in ② as the input of the visible layer of the next (second from the bottom) restricted Boltzmann machine.
(Repeat below)

After getting a good initial value of the parameter, relearn the orthodox gradient method. However, if the layer is deep, the calculation cost of the expected value will increase explosively, so an approximation method is required to obtain a solution in a realistic time.

スライド3.png

Approximation method

Approximation is needed to calculate the expected value under the current parameters when calculating the gradient. This area is the same as the previous restricted Boltzmann machine. 　 There are two main methods for approximating the expected value of the distribution. One is a *** sampling method such as "Gypsum sampling" used in the previous limited Boltzmann machine learning, and the other is a *** variational method such as "mean field approximation" used this time. **is. This time, we will use both methods.

The difference between the sampling method and the variational method is that the former "actually generates a large number of samples from the target distribution itself and approximates the expected value by taking the average", while the latter "expected value". Instead, use the one that is closest to the target distribution among the distributions that can be calculated analytically (easily). " It should be noted that the former guarantees that the value will converge to the true expected value if an infinite number of samples are obtained, while the latter does not.

Specifically, the sampling method uses the Markov chain Monte Carlo method (MCMC), and the variational method estimates the parameters that minimize the KL divergence with the target distribution for a distribution group such as a mixed Gaussian distribution, and uses this. To do.

How to use the deep Boltzmann machine

The deep-layer Botulman machine can be used as an "autoencoder" in addition to being used as a generative model as it is. What this means is that each hidden layer can be considered a *** visible layer dimensionality reduction *** </ font>. In other words, the value of the hidden layer can be regarded as a feature quantity with a reduced dimension while preserving the essence of the information possessed by each node of the visible layer. Therefore, deep Boltzmann machines can be applied as data preprocessing for applying deterministic neural networks and other orthodox machine learning techniques.

Implementation example

All the parameters of the deep Boltzmann machine are retained as the parameters of the restricted Boltzmann machine that composes them.
Uses Gibbs Supling and expected value calculation of mean field approximation
For the energy function during post-learning, only the link value is used, and the bias is omitted.

Implementation of deep Boltzmann machine

from typing import List
from itertools import chain
Layer = List[int]
Input = List[int]

from pprint import pprint

def sigmoid(z):
    return 1/(1+np.exp(-z))

def get_array_from_structure(base_array, structure):
    result = []
    current_index = 0 
    for index in range(len(structure)):
        accumlator = []
        if index == 0:
            accumlator = base_array[0:structure[current_index]]
        else:
            accumlator = base_array[current_index: current_index + structure[index]]

        current_index += structure[index]
        result.append(accumlator)
    
    return result

class DeepBolzmannMachine:
    learningRate = 0.05
    sample_num = 100
    
    def __init__(self,layer_structure:Layer):
        #Generate a list of RBM by pairing from the Visible layer in order
        
        ##Layer structure
        self.layer_structure = layer_structure
        self.layer_num = len(layer_structure)
        
        ##Make a pair
        self.pairs = []
        for i in range(len(layer_structure) -1):
            self.pairs.append((layer_structure[i],layer_structure[i+1]))
        
        ##Generate RBM for each pair
        self.rbms = []
        for pair in self.pairs:
            rbm = RBM(pair[0],pair[1])
            self.rbms.append(rbm)
            
        self.link_structure = list(map(lambda x :x[0] *x[1], self.pairs))
 
    def preTrain(self,initaial_data_for_learning:Input, sample_num : int):

        current_learning_data = []

        ##Calculate the initial value of each RBM
        for rbm_index in range(len(self.rbms)):            
            #When the layer is the bottom layer (Visible vs. Hidden)
            if rbm_index == 0:
                self.rbms[rbm_index].train(initaial_data_for_learning,100)
                
                for i in range(sample_num):
                    hidden_value = self.rbms[rbm_index].getHiddenValue(initaial_data_for_learning[i])
                    current_learning_data.append(hidden_value)
            
            #When the layer is not the bottom layer (Hidden vs. Hidden)
            else:
                #Learn with input data
                self.rbms[rbm_index].train(current_learning_data,100)
                
                #Create data for the next RBM
                data_for_learning = []
                for i in range(sample_num):
                    hidden_value = self.rbms[rbm_index].getHiddenValue(current_learning_data[i])
                    data_for_learning.append(hidden_value)
                    
                current_learning_data = data_for_learning
               

    #Update the parameters obtained in the pre-learning to the initial values by the gradient ascending method.
    #Find the first term of the derivative by the link parameter of the log-likelihood function by mean field approximation and the second term by Contrastive Divergence.
    def train(self, data_for_learning:Input, iteration_times:int):
        samples = []
        for iteration_times in range(iteration_times):            
            #Calculate the expected value of each hidden layer / node for one sample input by mean field approximation [Note]:Update only the link, not the bias]
            ##Intermediate data of expected value(mean_field)Is kept as a three-dimensional array (sth sample)/r layer/Expected value of the jth variable)
            sample_length = len(data_for_learning)
            max_node_num = max(self.layer_structure)
            
            mean_field = np.zeros((sample_length, self.layer_num, max_node_num))
            for s in range(sample_length):
                for r in range(self.layer_num):
                    if r == 0:
                        pass
                    elif r == 1:
                        for j in range(self.rbms[0].Hn):
                            first_paragraph = sum(map(lambda i : self.rbms[0].link[i][j] * data_for_learning[s][i] , range(self.rbms[0].Vn)))
                            second_paragraph = sum(map(lambda k : self.rbms[1].link[j][k] * mean_field[s][1][k] , range(self.rbms[1].Hn)))
                            
                            mean_field[s][1][j]  =  sigmoid(first_paragraph + second_paragraph)
                            
                    elif r == self.layer_num - 1:
                        for j in range(self.rbms[r -1].Hn):
                            mean_field[s][r][j]  =  sigmoid(sum(map(lambda k : self.rbms[r-1].link[k][j] * mean_field[s][r-1][k]  , range(self.rbms[r-2].Hn))))
                
                    else :
                        for j in range(self.rbms[r -1].Hn):
                            mean_field[s][r][j]  =  sigmoid(sum(map(lambda k : self.rbms[r -1].link[k][j] * mean_field[s][r-2][k]  , range(self.rbms[r - 2].Hn))) + sum(map(lambda l : self.rbms[r].link[j][l] * mean_field[s][r+1][l] , range(self.rbms[r].Hn))))

            #Calculate the first term of the slope of the link
            gradient_1 = []
            
            for r in range(len(self.rbms)):
                gradient_1_element = []
                if r == 0:
                    for i in range(self.rbms[0].Vn):
                        for j in range(self.rbms[0].Hn):
                            gradient_1_element.append((sum(map(lambda s : data_for_learning[s][i]  * mean_field[s][1][j]  , range(sample_length)))) / sample_length)
                        
                else:
                    for i in range(self.rbms[r].Vn):
                        for j in range(self.rbms[r].Hn):
                            gradient_1_element.append((sum(map(lambda s : mean_field[s][r][i]  * mean_field[s][r+1][j]  , range(sample_length)))) / sample_length)
                
                gradient_1.append(gradient_1_element)
            
            #Calculate the second term of the slope of the link
            
            ##To sample
            new_samples = []    
            for s in range(DeepBolzmannMachine.sample_num):
                sample = []
                ##Initialize the value of sample
                for layer in self.layer_structure:
                    accumlator = []
                    for i in range(layer):
                        accumlator.append(np.empty)
                    sample.append(accumlator)
            
                ##Set the initial value for sampling (at this time, the previous sample is used for the second and subsequent times of the gradient rising loop)
                initial_randoms = []
                if iteration_times == 0:
                    for layer in self.layer_structure:
                        accumlator = []
                        for i in range(layer):
                            accumlator.append(np.random.rand(1)[0])
                        initial_randoms.append(accumlator)
                else:
                    initial_randoms = samples[s]

                ##Estimate one sample (value of each node)
                sigmoid_belief = []
                for layer_index in range(len(self.layer_structure)):
                    for node_index in range(self.layer_structure[layer_index]):
                        ####Calculate sigmoid belief using messages from all nodes attached to that node
                        if layer_index == 0:
                            sigmoid_belief = sigmoid(sum(map(lambda j : initial_randoms[1][j] * self.rbms[0].link[node_index][j]  , range(self.layer_structure[1]))))
            
                        elif layer_index == self.layer_num -1:
                            sigmoid_belief = sigmoid(sum(map(lambda j : initial_randoms[self.layer_num - 2][j] * self.rbms[self.layer_num -2].link[j][node_index]  , range(self.layer_structure[self.layer_num -2]))))
            
                        else :
                            belief_downstair = sum(map(lambda j : initial_randoms[layer_index - 1][j] * self.rbms[layer_index -1].link[j][node_index]  , range(self.layer_structure[layer_index -1])))
                            belief_upstair = sum(map(lambda j : initial_randoms[layer_index + 1][j] * self.rbms[layer_index].link[node_index][j]  , range(self.layer_structure[layer_index + 1])))
                            sigmoid_belief = sigmoid(belief_upstair  + belief_downstair)
            
                        ####Generate random numbers
                        r = np.random.rand(1)[0]
                    
                        ####Set 1 if sigmoid belief is larger than random number, 0 if smaller than random number
                        if sigmoid_belief > r :
                            sample[layer_index][node_index] = 1
                        else: 
                            sample[layer_index][node_index] = 0
            
                #Add sample
                new_samples.append(sample)
                
            #Discard the sample from the previous gradient update and replace it with this sample
            samples = new_samples
        
            #Approximate the second gradient term from the sample
            gradient_2 = []
            for r in range(len(self.rbms)):
                gradient_2_element = []
                if r == 0:
                    for i in range(self.rbms[0].Vn):
                        for j in range(self.rbms[0].Hn):
                            gradient_2_element.append(sum(map(lambda m : samples[m][0][i] * samples[m][1][j], range(DeepBolzmannMachine.sample_num))) / DeepBolzmannMachine.sample_num)
                else :
                    for i in range(self.rbms[r].Vn):
                        for j in range(self.rbms[r].Hn):
                            gradient_2_element.append(sum(map(lambda m : samples[m][r][i] * samples[m][r+1][j], range(DeepBolzmannMachine.sample_num))) / DeepBolzmannMachine.sample_num)
                
                gradient_2.append(gradient_2_element)
            
            
            #Calculate the gradient
            gradient_1_flatten = np.array(list(chain.from_iterable(gradient_1)))
            gradient_2_flatten = np.array(list(chain.from_iterable(gradient_2)))
            
            gradient_flatten = gradient_1_flatten + gradient_2_flatten            
            gradient = get_array_from_structure(gradient_flatten, self.link_structure)
            
            #Update parameters with gradient
            for rbm_index in range(len(self.rbms)):
                self.rbms[rbm_index].link +=  DeepBolzmannMachine.learningRate *  gradient[rbm_index].reshape((self.rbms[rbm_index].Vn,self.rbms[rbm_index].Hn))

Execution example

#A 4-layer network with 5, 4, 3, and 2 nodes in order from the visible layer
layer_structure = [5,4,3,2]

#Number of samples to be sampled when calculating the expected value
sample_num = 100

#Boltzmann machine instantiation
dbm = DeepBolzmannMachine(layer_structure)

#Create appropriate learning data
data_for_learning = np.random.rand(100,layer_structure[0])

#Pre-learning
dbm.preTrain(data_for_learning,sample_num)

#Post-learning
dbm.train(data_for_learning,sample_num)

end

That's why I implemented it.

Stochastic deep learning doesn't seem to be as pervasive as deep learning in deterministic neural networks, but personally *** "Understanding events, including their uncertainty" *** Bayes I think that the merit of this way of thinking is great, so I hope this will become the de facto standard.

This year is just around the corner. Good year-end, everyone ...

[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ②