[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ①

This article is the 18th day of the Wacul Advent Calendar.

Self-introduction

I have been working in the analysis team of WACUL Co., Ltd. for a year. Python history: about 3 weeks

things to do

Practicing python & I will try to scratch-implement a deep Boltzmann machine mainly for the purpose of studying probabilistic deep learning. All the theoretical aspects are in the following books.

" Deep Learning (Machine Learning Professional Series) "Takayuki Okatani (Author) " Deep Learning "(supervised by the Japanese Society for Artificial Intelligence)

In this article, we will first implement a restricted Boltzmann machine as a preparation. Next time, we will build a deep Boltzmann machine by stacking restricted Boltzmann machines.

A brief description of the Boltzmann machine

Since the probability distribution is simply a "function that associates a set of values with the probability density (probability mass) at which it is obtained", it can be regarded as a "mechanism behind the generation of a set of values" as a whole. I can do it. This is called *** Generative Model *** </ font>. If you can build a generative model, you can understand the relationships between variables and calculate expected values, which will expand the range of applications. The Boltzmann machine that we will discuss this time is a generative model that calculates the probability density using the Boltzmann distribution. The specific probability density is the standardized value after taking the exponential function of the energy function (Φ) defined below multiplied by minus. (Multiply a constant to fit in the closed interval of [0,1]) The energy function is a weighted linear sum of each node bias and link.

スライド1.png

Visible and hidden variables

Visible variables are variables that can be observed directly, and hidden variables are variables that cannot be observed. Introducing hidden variables improves the expressiveness of the Boltzmann machine. In the case of a model that includes hidden variables, distribution of only visible variables marginalized with respect to hidden variables </ strong> is used as the objective function for maximum likelihood estimation.

Limited Boltzmann machine

The restricted Boltzmann machine is a model that simplifies parameter estimation by introducing hidden variables into a general Boltzmann machine and then applying certain constraints. What are the restrictions? ① There is no link between visible nodes ② There is no link between hidden nodes There are two. From these two conditions, the restricted Boltzmann machine has two layers, a "visible layer" consisting of only visible nodes and a "hidden layer" consisting of only hidden nodes, and is expressed as a model with links only between different layers. It will be.

This means that *** </ font> means that one layer is conditionally independent of one layer *** </ font>. -Get a sample group of all nodes in the hidden layer from the current expected values of all nodes in the visible layer -Approximate the expected value of all nodes in the hidden layer from the sample group of all nodes in the hidden layer ・ Obtain a sample group of all nodes in the visible layer from the expected values of all nodes in the hidden layer. -Approximate the expected value of all nodes in the visible layer from the sample group of all nodes in the visible layer You can turn the process. This calculation has the advantage that it is relatively easy and you can get a sample of each layer at once. Limitations The limits of the Boltzmann machine can be said to benefit from this in exchange for the expressiveness of the model.

スライド2.png

Learning a restricted Boltzmann machine

Once you've decided on the shape of your model, the next thing to do is learn the parameters. Parameter learning of the restriction Boltzmann machine uses the orthodox gradient (ascending) method for the purpose of maximizing the log-likelihood function. The gradient increase method is a method that aims to maximize the function while repeating the process of updating the parameters little by little in the direction in which the value of the objective function increases. Distribution). To maximize this, the objective function is differentiated by the target parameters to find the gradient. Expected values for visible and hidden layers are required when calculating the gradient, but since this is difficult to calculate directly, it is approximated by actually generating a sample using Gibbs sampling and averaging it. At this time, the calculation can be easily performed by the benefit of the "conditional independence between layers" of the above-mentioned restricted Boltzmann machine.

Learning goes through the following process when viewed broadly.

⚫️ Randomly initialize parameters


⚫️ Repeat below----------------------------------------------

① Alternately sampling the visible layer and the hidden layer using the current parameter values

② Calculate the expected value of each node in the visible layer and hidden layer using ①

③ Use ② to calculate the bias of the visible layer, the bias of the hidden layer, and the gradient (derivative) of each link.
    
④ Update parameters using ③
  ---------------------------------------------------------

Python implementation of restricted Boltzmann machine

The persistent contrastive divergence method is used when sampling. Since the processing efficiency is not considered, the convergence test is omitted.

RBM class implementation

from typing import List
Input = List[int]

def sigmoid(z):
    return 1/(1+np.exp(-z))


class RBM:
    learningRate = 0.005
    sample_num = 100
    
    #Vn is the number of visible nodes, Hn is the number of hidden nodes
    def __init__(self , Vn : int , Hn : int):
        self.Vn = Vn
        self.Hn = Hn
        self.bias_v = np.array(np.random.rand(Vn)) #Randomly initialized
        self.bias_h = np.array(np.random.rand(Hn)) #Randomly initialized
        self.link  = np.array(np.random.rand(Vn,Hn)) #Randomly initialized
        
        self.hidden_value = []
        
    def train(self, data_for_learning , iteration_times):
        for iteration_times in range(iteration_times):
            
            # samples_vh is v*h sample
            samples_h =[]
            samples_v = []
            samples_vh = np.zeros((self.Vn,self.Hn))
            
            messege_in_sampling = 0
            sigmoid_belief = 0
        
            #① Alternately sampling the visible layer and the hidden layer using the current parameter values
            for index in range(RBM.sample_num):
                
                current_v = data_for_learning[0]
                current_h = np.zeros(self.Hn)
                                
                ##Sampling hidden layers
                for j in  range(self.Hn) :  
                    messege_in_sampling = self.bias_h[j] + sum(list(map(lambda i: current_v[i] * self.link[i][j]  , range(self.Vn))))                
                    
                    sigmoid_belief = 1/(1+ np.exp(-1 * messege_in_sampling))
                    
                    r = np.random.rand(1)
                    
                    if sigmoid_belief > r[0]  :
                        current_h[j] = 1
                    else : 
                        current_h[j] = 0
                
                samples_h.append(current_h)
                  
                ##Sampling the visible layer
                for i in range(self.Vn):
                    messege_in_sampling = self.bias_v[i] + sum(list(map(lambda j: current_h[j] * self.link[i][j]  , range(self.Hn))))                
                    
                    sigmoid_belief = 1/(1+ np.exp(-1 * messege_in_sampling))
                    
                    r = np.random.rand(1)
                    if sigmoid_belief > r[0]  :
                        current_v[i] = 1
                    else : 
                        current_v[i] = 0
                                        
                samples_v.append(current_v)  
                
                ##Visible layer ✖️ Hidden layer sampling
                z = itertools.product(current_v,current_h)
                
                product = []
                for element in z:
                    product.append(element)
                    
                current_vh = (np.array(list(map(lambda x : x[0] * x[1] ,product))) ) .reshape(self.Vn,self.Hn)
                    
                samples_vh += current_vh
            
            #② Calculate the expected value of each node in the visible layer and hidden layer
            E_V  = np.sum(np.array(samples_v),axis=0) / RBM.sample_num
            E_H = np.sum(np.array(samples_h),axis=0) / RBM.sample_num
            E_VH = samples_vh / RBM.sample_num

            #③ Calculate the bias of the visible layer, the bias of the hidden layer, and the gradient (derivative) of each link.
            ##Derivative of visible layer bias
            gradient_v = sum(np.array(data_for_learning)) / len(data_for_learning) - E_V
                
            ##Derivative of hidden layer bias
            gradient_h = []
            for j in range(len(self.bias_h)):
                gradient_h_1 = []
                sigmoid_beliefs_h_1 = []

                for n in range(len(data_for_learning)):
                    messege_h_1 = self.bias_h[j] + sum(list(map(lambda i: data_for_learning[n][i] * self.link[i][j]  , range(self.Vn))))
                    sigmoid_belief_h_1 = 1/(1+ np.exp(-1 * messege_h_1))
                    sigmoid_beliefs_h_1.append(sigmoid_belief_h_1)
                
                gradient_h_1 = sum(sigmoid_beliefs_h_1) / len(data_for_learning)      
                gradient_h_2 = E_H[j]

                gradient_h_j =  gradient_h_1 +  gradient_h_2
                gradient_h.append(gradient_h_j)

            ##Derivative of the link
            gradient_vh = []
            for i in range(len(self.bias_v)):
                for j in range(len(self.bias_h)):
                    gradient_vh_1 = []
                    sigmoid_beliefs_vh_1 = []

                    for n in range(len(data_for_learning)):
                        messege_vh = self.bias_h[j] + sum(list(map(lambda i: data_for_learning[n][i] * self.link[i][j]  , range(self.Vn))))
                        sigmoid_belief_vh_1 = 1/(1+ np.exp(-1 * messege_vh))
                        sigmoid_beliefs_vh_1.append(sigmoid_belief_vh_1 * data_for_learning[n][i])

                    gradient_vh_1 = sum(sigmoid_beliefs_vh_1) / len(data_for_learning)      
                    gradient_vh_2 = E_VH[i][j]

                    gradient_vh_ij =  gradient_vh_1 +  gradient_vh_2

                    gradient_vh.append(gradient_vh_ij)

            #④ Update parameters using ③
            self.bias_v += RBM.learningRate *  np.array(gradient_v)
            self.bias_h += RBM.learningRate *  np.array(gradient_h)
            self.link +=  RBM.learningRate *  np.array(gradient_vh).reshape(self.Vn,self.Hn)            

    def energy(self,input : Input) -> float:
        #Visible layer energy
        energy_v = sum(list(map(lambda i : self.bias_v[i] * input[i]  , range(self.Vn))))
        
        #Hidden layer energy
        ##Estimated value of hidden layer
        hidden_layer = self.getHiddenValue(input)
                
        energy_h = sum(list(map(lambda j : self.bias_h[j] * hidden_layer[j]  , range(self.Hn))))
        
        #link energy
        accumlator = []        
        for i in range(self.Vn):
            for j in range(self.Hn):
                accumlator.append(input[i]*hidden_layer[j] * self.link[i][j])
                
        energy_vh = sum(accumlator)       
                
        energy = -1 * (energy_v + energy_h + energy_vh)
        
        return energy
    
    def estimate(self,input:Input) -> float:            
    
        #Calculate energy function
        energy = self.energy(input)
        
        #Calculate partition function
        all_patterns = list(itertools.product(range(0,2), repeat=self.Vn))
        partition = sum(list(map(lambda x: np.exp(-1 *self.energy(list(x))) , all_patterns)))
            
        return  np.exp(-1*  energy ) / partition
            
    def getHiddenValue(self,input:Input) :
        
        hidden_layer = np.zeros(self.Hn)
            
        for j in  range(self.Hn) :  
            messege_in_sampling = self.bias_h[j] + sum(list(map(lambda i: input[i] * self.link[i][j]  , range(self.Vn))))                
            sigmoid_belief = 1/(1+ np.exp(-1 * messege_in_sampling))
            
            r = np.random.rand(1)
            if sigmoid_belief > r[0]:
                hidden_layer[j] = 1
            else : 
                hidden_layer[j] = 0
                
        self.hidden_value = hidden_layer
        
        return hidden_layer

Execution example

#Build
visible_nord = 3
hidden_nord = 4
rbm = RBM(visible_nord,hidden_nord)

#Learning data creation
training_data = np.array(list(map(lambda x : np.random.randint(0,2) , range(0,300)))).reshape(100,3)

#Parameter learning
rbm.train(training_data,10)
print(rbm.link)

input  = [1,0,1]

#Get the value of the hidden layer
rbm.getHiddenValue(input)

#Energy function output
rbm.energy(input)

#Probability density output
rbm.estimate()

It seems to work, but it's unclear if it fits. .. .. I will move on for the time being.

next time

Next time, we will combine this limited Boltzmann machine to configure a deep Boltzmann machine. The deeper the layer, the more difficult it is to learn, so the story is that you can get a good initial value of the parameters by pre-learning using the restricted Boltzmann machine first, and then use this to fine-tune the entire model. Will be.

Recommended Posts