I was reading "Pattern Recognition and Machine Learning" and decided to implement the sum-product algorithm in Chapter 8. However, I couldn't immediately think of a concrete Bayesian network example, so I decided to rely on statistical mechanics. It seems that it is very closely related to the fact that the partition function, which is familiar in statistical mechanics, can be calculated strictly, and the ground state (the global optimum solution as a combination optimization problem) can be obtained. I think it would be very interesting to dig deeper here, but due to lack of competence, I will only show a simple example and implementation. The notation is adapted to "Pattern recognition and machine learning".

Configuration

This time, consider a system in which spins $ x_i = \ pm 1 $ placed in a tree-like network interact with each other: $ H = -\sum_{(i,j)} J_{ij} x_i x_j $ However, $ J_ {ij} $ is determined stochastically, and in this sense, it can be said that we are dealing with spin glass systems. At this time, assuming that the system follows a canonical distribution, the probability distribution is $ p({\bf x}) = \dfrac{1}{Z(\beta)} \exp [-\beta H] $ Given by. To replace this probability distribution with a graphical model, define $ \ psi_ {ij} (x_i, x_j) = \ exp [\ beta J_ {ij} x_i x_j] $ $ p({\bf x}) = \dfrac{1}{Z(\beta)} \prod_{(ij)} \psi_{ij}(x_i,x_j) $ Since it can be transformed as follows, an undirected graph can be obtained by replacing the spin network with a graphical model as it is. So the factor node $ f_{ij}(x_i,x_j) = \psi_{ij}(x_i,x_j) = \exp [\beta J_{ij} x_i x_j ] $ Therefore, we can see that the factor node should be inserted between the branched $ (i, j) $. The factor graph obtained in this way is trivially tree-shaped, so you can execute sum-product and max-sum algorithms. In the following, the variable node is represented by 〇 and the factor node is represented by □. The figure below shows the relationship between each network.

First, the code for creating such a network is shown. I will omit the details.

import numpy as np

class tree: #Generate a suitable tree
    def __init__(self, N):
        self.edges = []
        self.children_of = [[] for _ in range(N)] #Multiple children
        self.parent_of = [[] for _ in range(N)] #parent is unique
        self.leaves = [0] #0 is root
        
        for n in range(1,N):
            if n == 1: #Determine node i to attach node n
                i = 0
            else:
                i = np.random.choice(n-1)
            
            self.edges.append([i,n])
            self.leaves.append(n)
            if i in self.leaves:
                self.leaves.remove(i)
            
            self.children_of[i].append(n)
            self.parent_of[n] = i
        
N = 10 # N=It seems to be okay because it was done well when I checked it at about 10
network = tree(N)
print(network.edges)
print(network.leaves)

#Generation of J and definition of f
J = {}
for edge in network.edges:
    i, j = edge
    J[str(i)+","+str(j)] = np.random.normal()

def f(beta, J_ij, x_i, x_j):
    return np.exp(beta*J_ij*x_i*x_j)
print(J)

Calculation of partition function by sum-product algorithm

First, we briefly touch on the sum-product algorithm and describe the calculation of the partition function using it. Finally, the implementation is shown.

sum-product algorithm

Using $ f_ {ij} (x_i, x_j) $ above, the probability distribution is $ p({\bf x}) = \dfrac{1}{Z(\beta)} \prod_{(ij)} f_{ij} (x_i,x_j) $ Note that it can be expressed by. Also, the normalized constant (partition function) is excluded. $ \tilde{p}({\bf x}) = \prod_{(ij)} f_{ij} (x_i,x_j) $ Is defined.

The sum-product algorithm finds the marginal distribution $ p (x_n) $ for a node $ x_n $. Specifically, the procedure is as follows:

(1) Select a node $ x_n $ and use it as the root. (2) Find all the leaf nodes for the root, and if it is a variable node (〇), propagate the message $ \ mu_ {x \ to f} (x) = 1 $ to the factor node (□). On the contrary, when it is a factor node (□), the message $ \ mu_ {f \ to x} (x) = f (x) $ is propagated to the variable node (〇).

(3) Propagate messages from the leaf node toward the root based on the following formula. However, $ N (x) $ and $ N (f) $ represent adjacent nodes of the variable node $ x $ and the factor node $ f $, respectively.

\mu_{x \to f}(x) = \prod_{l \in N(x) \setminus f} \mu_{f_l \to x}(x)\\
\mu_{f \to x}(x) = \sum_{x_1, \cdots, x_M} f(x,x_1,\cdots,x_M) \prod_{m \in N(f) \setminus x} \mu_{x_m \to f}(x_m)

(4) If you calculate $ \ prod_ {f \ in N (x_n)} \ mu_ {f \ to x_n} (x_n) $ on the root $ x_n $, this is $ \ tilde {p} ({\ bf x) }) It becomes $ \ tilde {p} (x_n) $ which is the marginalization of . In other words $ \tilde{p}(x_n) = \prod_{f \in N(x_n)} \mu_{f \to x_n}(x_n) $$ Is. This can be understood from the fact that (3) essentially only multiplies $ f $ for each factor node and $ \ sum_x $ for each variable node, but we need to do our best to show it properly.

(5) Obviously $ \ sum_ {x_n} \ tilde {p} (x_n) = Z (\ beta) \ sum_ {x_n} p (x_n) = Z (\ beta) $, so the normalization constant can be calculated. Therefore, $ p (x_n) = \ dfrac {1} {Z (\ beta)} \ tilde {p} (x_n) $ could be calculated.

The above is the sum-product algorithm.

Calculation of partition function

From the sum-product algorithm described above, the normalized constant (partition function) $ Z (\ beta) $ is $ Z(\beta) = \sum_{x_n} \tilde{p}(x_n) $ You can see that it is required by. Therefore, the partition function can be calculated by using the sum-product algorithm for the spin system arranged in the tree-like network. Recalling that the calculation of the partition function generally required an exponential order complexity, we can see that the complexity has been significantly reduced. Note that such calculations are possible because the network is clearly tree-shaped. Generally, this method cannot be applied when there is a loop. On the other hand, depending on the structure of the network, it seems that some good results can be obtained by forcibly applying sum-product. For details, please check with loopy belief propagation.

Implementation by Python

For this problem, I implemented the sum-product algorithm in Python. As you can see from the algorithm above, it is not easy to implement for a general network (even if it is a tree). However, with this setting, the factor node is adjacent to only two spin variables, so it can be considerably simplified. How to manage the message $ \ mu $ is the key to implementation, but I decided to force it with a dictionary type. The flow of the algorithm is to start with the leaf node discovered when creating the network and replace the leaf node while approaching the root.

import copy

spin = [1.0, -1.0]

beta = 1.0
beta_start = 0.1
beta_end = 4.0
beta_list = np.linspace(beta_start, beta_end, 100) #Calculate partition function for some beta values
Z_list = np.ones(100)

for s,beta in enumerate(beta_list):
    mu_x_to_f = {} #Dictionary-type management for readability
    mu_f_to_x = {}
    leaves = copy.deepcopy(network.leaves) #Otherwise network.leaves are replaced
    children_of = copy.deepcopy(network.children_of)
    
    while leaves != [0]: #Follow from leaves to root.
        i = leaves.pop(0) #Take out one of the leaves
        j = network.parent_of[i] #That parent(unique)Find out
        #print(i,j)
        direction = str(i)+"->"+str(j) #Dictionary key
        interaction = str(j)+","+str(i)
        mu_x_to_f[direction] = np.ones(2) #Initialization conditions
        for k in network.children_of[i]:
            #print("yey")
            direction_previous = str(k)+"->"+str(i)
            mu_x_to_f[direction] *= mu_f_to_x[direction_previous]
        
        #The theory that it became difficult to understand if I wrote it round and round for readability
        mu_f_to_x[direction] = np.zeros(2)
        mu_f_to_x[direction][0] = f(beta,J[interaction],spin[0],spin[0])*mu_x_to_f[direction][0] + f(beta,J[interaction],spin[0],spin[1])*mu_x_to_f[direction][1]
        mu_f_to_x[direction][1] = f(beta,J[interaction],spin[1],spin[0])*mu_x_to_f[direction][0] + f(beta,J[interaction],spin[1],spin[1])*mu_x_to_f[direction][1]
        #print(mu_f_to_x[direction])

        children_of[j].remove(i) #Remove i from the child nodes to determine if j can be left
        if len(children_of[j]) == 0: #Add to leaves if there are no child nodes
            leaves.append(j)
        #print(leaves)
    
    p0 = np.ones(2) # p(x_0)As a vector
    for i in network.children_of[0]:
        direction = str(i)+"->"+str(0)
        p0 *= mu_f_to_x[direction]
    Z_list[s] = sum(p0)

Since the partition function $ Z (\ beta) $ is obtained, the free energy $ F (\ beta) =-\ dfrac {1} {\ beta} \ log Z (\ beta) $ can be calculated. When I actually tried it, it became as follows.

Calculation of ground state by max-sum algorithm

First, the max-sum algorithm is briefly described, and the calculation of the ground state using it is explained. Finally, I will show the implementation and confirm that the state with the minimum energy is certainly obtained.

max-sum algorithm

The max-sum algorithm is an algorithm obtained by replacing $ \ sum $ with $ \ max $ and $ \ prod $ with $ \ sum $ in the sum-prodoct algorithm, and is $ p ({\ bf x}) $. $ {\ Bf x} ^ {} $, which gives the maximum and maximum values of, is calculated. The procedure is almost the same as the sum-product algorithm, but backpropagation from the root is required to calculate $ {\ bf x} ^ {} $.

(1) Select a node $ x_n $ and use it as the root. (2) Find all the leaf nodes for the root, and if it is a variable node (〇), propagate the message $ \ mu_ {x \ to f} (x) = 0 $ to the factor node (□). On the contrary, when it is a factor node (□), the message $ \ mu_ {f \ to x} (x) = \ log f (x) $ is propagated to the variable node (〇).

\mu_{x \to f}(x) = \sum_{l \in N(x) \setminus f} \mu_{f_l \to x}(x)\\
\mu_{f \to x}(x) = \max_{x_1, \cdots, x_M} \left[ \log f(x,x_1,\cdots,x_M) +\sum_{m \in N(f) \setminus x} \mu_{x_m \to f}(x_m) \right]

At that time, for each $ x $ $ {\bf \phi} _ f (x) = \arg \max_{x_1,\cdots,x_M} \left[ \log f(x,x_1,\cdots,x_M) +\sum_{m \in N(f) \setminus x} \mu_{x_m \to f}(x_m) \right] $ Is saved. This is necessary for the back propagation in (5).

(4) At root $ x_n $ $ p_{\rm max} = \max_{x_n} \left[ \sum_{f \in N(x_n)} \mu_{f \to x_n} (x_n) \right] $ Is calculated, and this is the maximum value of $ p ({\ bf x}) $. Also, $ x ^ {*} _ n $ is $ x ^ { * } _ n = \arg \max_{x_n} \left[ \sum_{f \in N(x_n)} \mu_{f \to x_n} (x_n) \right] $ Is obtained by.

(5) Backpropagate from root $ x_n $ $ (x ^ * _1, \cdots, x ^ * _M) = {\bf \phi} _ f (x) $ Find $ {\ bf x} ^ {*} $ as follows.

The above is the procedure of the max-sum algorithm.

Ground state calculation

Canonical distribution $ p({\bf x}) = \dfrac{1}{Z(\beta)} \exp [-\beta H] $ As is clear from, $ {\ bf x} ^ {} $, which gives the maximum value of $ p ({\ bf x}) $, gives the minimum value of Hamiltonian. In other words, the ground state of the spin system in the network can be found by finding $ {\ bf x} ^ {} $ using the max-sum algorithm. If you read Hamiltonian as a loss function, this leads to solving a combinatorial optimization problem. As you can see from a closer look at the procedure of the max-sum algorithm, this is dynamic programming itself. From a statistical mechanics point of view, the fact that the max-sum algorithm cannot be applied well in the presence of loops can be interpreted as the existence of frustration. I think it is a very interesting example that you can understand the close connection between graph theory, optimization problems, and statistical mechanics.

Implementation by Python

For this problem, I implemented the max-sum algorithm in Python. It is almost the same as the sum-product algorithm, but $ \ beta = 1 $. It is self-evident that this does not lose generality, considering that $ \ log f_ {ij} (x_i, x_j) = \ beta J_ {ij} x_i x_j $.

mu_x_to_f = {} #Dictionary-type management for readability
mu_f_to_x = {}
phi = {}
leaves = copy.deepcopy(network.leaves) #Otherwise network.leaves are replaced
children_of = copy.deepcopy(network.children_of)
    
spin = [1.0, -1.0]

while leaves != [0]: #Follow from leaves to root.
    i = leaves.pop(0) #Take out one of the leaves
    j = network.parent_of[i] #That parent(unique)Find out
    #print(i,j)
    direction = str(i)+"->"+str(j) #Dictionary key
    interaction = str(j)+","+str(i)
    mu_x_to_f[direction] = np.zeros(2) #Initialization conditions
    for k in network.children_of[i]:
        #print("yey")
        direction_previous = str(k)+"->"+str(i)
        mu_x_to_f[direction] += mu_f_to_x[direction_previous]
        
    #The theory that it became difficult to understand if I wrote it round and round for readability
    mu_f_to_x[direction] = np.zeros(2)
    mu_f_to_x[direction][0] = max(J[interaction]*spin[0]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[0]*spin[1]+mu_x_to_f[direction][1])
    mu_f_to_x[direction][1] = max(J[interaction]*spin[1]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[1]*spin[1]+mu_x_to_f[direction][1])
    phi[direction] = np.array([1,1])
    phi[direction][0] = np.argmax([J[interaction]*spin[0]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[0]*spin[1]+mu_x_to_f[direction][1]])
    phi[direction][1] = np.argmax([J[interaction]*spin[1]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[1]*spin[1]+mu_x_to_f[direction][1]])

    #print(mu_f_to_x[direction])

    children_of[j].remove(i) #Remove i from the child nodes to determine if j can be left
    if len(children_of[j]) == 0: #Add to leaves if there are no child nodes
        leaves.append(j)

nodes = copy.deepcopy(network.children_of[0]) #Otherwise network.leaves are replaced
children_of = copy.deepcopy(network.children_of)

x_optimal = [[] for _ in range(N)]

p_max = np.zeros(2)
for i in network.children_of[0]:
    direction = str(i)+"->"+str(0)
    p_max += mu_f_to_x[direction]   
x_optimal[0] = np.argmax(p_max)

while len(nodes) != 0:
    i = nodes.pop(0)
    j = network.parent_of[i]
    direction = str(i)+"->"+str(j)
    x_optimal[i] = phi[direction][x_optimal[j]]
    nodes.extend(network.children_of[i])
    
print(x_optimal)
def energy(x):
    E = 0.0
    for edge in network.edges:
        i,j = edge
        E += -J[str(i)+","+str(j)]*spin[x[i]]*spin[x[j]]
    return E
print(energy(x_optimal))
energy_optimal = energy(x_optimal)

For the check, I generated about $ 10000 $ samples and compared the energies, but the $ {\ bf x} ^ * $ obtained by this algorithm certainly gave the minimum energy value.

Summary

In the first half, the partition function of the spin system on the tree-like network was calculated by using the sum-product algorithm. In the latter half, the ground state of this system was calculated by the max-sum algorithm. Through this example, I was able to see one aspect of the relationship between graph theory and optimization problems and statistical mechanics. Since this example is too simple, it may be interesting to make various extensions such as increasing the spin state and including three bodies as interactions.

[PYTHON] Compute the partition function with the sum-product algorithm

Configuration

Calculation of partition function by sum-product algorithm

sum-product algorithm

Calculation of partition function

Implementation by Python

Calculation of ground state by max-sum algorithm

max-sum algorithm

Ground state calculation

Implementation by Python

Summary