I was reading "Pattern Recognition and Machine Learning" and decided to implement the sum-product algorithm in Chapter 8. However, I couldn't immediately think of a concrete Bayesian network example, so I decided to rely on statistical mechanics. It seems that it is very closely related to the fact that the partition function, which is familiar in statistical mechanics, can be calculated strictly, and the ground state (the global optimum solution as a combination optimization problem) can be obtained. I think it would be very interesting to dig deeper here, but due to lack of competence, I will only show a simple example and implementation. The notation is adapted to "Pattern recognition and machine learning".

This time, consider a system in which spins $ x_i = \ pm 1 $ placed in a tree-like network interact with each other:

First, the code for creating such a network is shown. I will omit the details.

```
import numpy as np
class tree: #Generate a suitable tree
def __init__(self, N):
self.edges = []
self.children_of = [[] for _ in range(N)] #Multiple children
self.parent_of = [[] for _ in range(N)] #parent is unique
self.leaves = [0] #0 is root
for n in range(1,N):
if n == 1: #Determine node i to attach node n
i = 0
else:
i = np.random.choice(n-1)
self.edges.append([i,n])
self.leaves.append(n)
if i in self.leaves:
self.leaves.remove(i)
self.children_of[i].append(n)
self.parent_of[n] = i
N = 10 # N=It seems to be okay because it was done well when I checked it at about 10
network = tree(N)
print(network.edges)
print(network.leaves)
#Generation of J and definition of f
J = {}
for edge in network.edges:
i, j = edge
J[str(i)+","+str(j)] = np.random.normal()
def f(beta, J_ij, x_i, x_j):
return np.exp(beta*J_ij*x_i*x_j)
print(J)
```

First, we briefly touch on the sum-product algorithm and describe the calculation of the partition function using it. Finally, the implementation is shown.

Using $ f_ {ij} (x_i, x_j) $ above, the probability distribution is

The sum-product algorithm finds the marginal distribution $ p (x_n) $ for a node $ x_n $. Specifically, the procedure is as follows:

(1) Select a node $ x_n $ and use it as the root. (2) Find all the leaf nodes for the root, and if it is a variable node (〇), propagate the message $ \ mu_ {x \ to f} (x) = 1 $ to the factor node (□). On the contrary, when it is a factor node (□), the message $ \ mu_ {f \ to x} (x) = f (x) $ is propagated to the variable node (〇).

(3) Propagate messages from the leaf node toward the root based on the following formula. However, $ N (x) $ and $ N (f) $ represent adjacent nodes of the variable node $ x $ and the factor node $ f $, respectively.

```
\mu_{x \to f}(x) = \prod_{l \in N(x) \setminus f} \mu_{f_l \to x}(x)\\
\mu_{f \to x}(x) = \sum_{x_1, \cdots, x_M} f(x,x_1,\cdots,x_M) \prod_{m \in N(f) \setminus x} \mu_{x_m \to f}(x_m)
```

(4) If you calculate $ \ prod_ {f \ in N (x_n)} \ mu_ {f \ to x_n} (x_n) $ on the root $ x_n $, this is $ \ tilde {p} ({\ bf x) }) It becomes $ \ tilde {p} (x_n) $ which is the marginalization of

(5) Obviously $ \ sum_ {x_n} \ tilde {p} (x_n) = Z (\ beta) \ sum_ {x_n} p (x_n) = Z (\ beta) $, so the normalization constant can be calculated. Therefore, $ p (x_n) = \ dfrac {1} {Z (\ beta)} \ tilde {p} (x_n) $ could be calculated.

The above is the sum-product algorithm.

From the sum-product algorithm described above, the normalized constant (partition function) $ Z (\ beta) $ is

For this problem, I implemented the sum-product algorithm in Python. As you can see from the algorithm above, it is not easy to implement for a general network (even if it is a tree). However, with this setting, the factor node is adjacent to only two spin variables, so it can be considerably simplified. How to manage the message $ \ mu $ is the key to implementation, but I decided to force it with a dictionary type. The flow of the algorithm is to start with the leaf node discovered when creating the network and replace the leaf node while approaching the root.

```
import copy
spin = [1.0, -1.0]
beta = 1.0
beta_start = 0.1
beta_end = 4.0
beta_list = np.linspace(beta_start, beta_end, 100) #Calculate partition function for some beta values
Z_list = np.ones(100)
for s,beta in enumerate(beta_list):
mu_x_to_f = {} #Dictionary-type management for readability
mu_f_to_x = {}
leaves = copy.deepcopy(network.leaves) #Otherwise network.leaves are replaced
children_of = copy.deepcopy(network.children_of)
while leaves != [0]: #Follow from leaves to root.
i = leaves.pop(0) #Take out one of the leaves
j = network.parent_of[i] #That parent(unique)Find out
#print(i,j)
direction = str(i)+"->"+str(j) #Dictionary key
interaction = str(j)+","+str(i)
mu_x_to_f[direction] = np.ones(2) #Initialization conditions
for k in network.children_of[i]:
#print("yey")
direction_previous = str(k)+"->"+str(i)
mu_x_to_f[direction] *= mu_f_to_x[direction_previous]
#The theory that it became difficult to understand if I wrote it round and round for readability
mu_f_to_x[direction] = np.zeros(2)
mu_f_to_x[direction][0] = f(beta,J[interaction],spin[0],spin[0])*mu_x_to_f[direction][0] + f(beta,J[interaction],spin[0],spin[1])*mu_x_to_f[direction][1]
mu_f_to_x[direction][1] = f(beta,J[interaction],spin[1],spin[0])*mu_x_to_f[direction][0] + f(beta,J[interaction],spin[1],spin[1])*mu_x_to_f[direction][1]
#print(mu_f_to_x[direction])
children_of[j].remove(i) #Remove i from the child nodes to determine if j can be left
if len(children_of[j]) == 0: #Add to leaves if there are no child nodes
leaves.append(j)
#print(leaves)
p0 = np.ones(2) # p(x_0)As a vector
for i in network.children_of[0]:
direction = str(i)+"->"+str(0)
p0 *= mu_f_to_x[direction]
Z_list[s] = sum(p0)
```

Since the partition function $ Z (\ beta) $ is obtained, the free energy $ F (\ beta) =-\ dfrac {1} {\ beta} \ log Z (\ beta) $ can be calculated. When I actually tried it, it became as follows.

First, the max-sum algorithm is briefly described, and the calculation of the ground state using it is explained. Finally, I will show the implementation and confirm that the state with the minimum energy is certainly obtained.

The max-sum algorithm is an algorithm obtained by replacing $ \ sum $ with $ \ max $ and $ \ prod $ with $ \ sum $ in the sum-prodoct algorithm, and is $ p ({\ bf x}) $. $ {\ Bf x} ^ {*} $, which gives the maximum and maximum values of, is calculated. The procedure is almost the same as the sum-product algorithm, but backpropagation from the root is required to calculate $ {\ bf x} ^ {*} $.

(1) Select a node $ x_n $ and use it as the root. (2) Find all the leaf nodes for the root, and if it is a variable node (〇), propagate the message $ \ mu_ {x \ to f} (x) = 0 $ to the factor node (□). On the contrary, when it is a factor node (□), the message $ \ mu_ {f \ to x} (x) = \ log f (x) $ is propagated to the variable node (〇).

(3) Propagate messages from the leaf node toward the root based on the following formula. However, $ N (x) $ and $ N (f) $ represent adjacent nodes of the variable node $ x $ and the factor node $ f $, respectively.

```
\mu_{x \to f}(x) = \sum_{l \in N(x) \setminus f} \mu_{f_l \to x}(x)\\
\mu_{f \to x}(x) = \max_{x_1, \cdots, x_M} \left[ \log f(x,x_1,\cdots,x_M) +\sum_{m \in N(f) \setminus x} \mu_{x_m \to f}(x_m) \right]
```

At that time, for each $ x $

(4) At root $ x_n $

(5) Backpropagate from root $ x_n $

The above is the procedure of the max-sum algorithm.

Canonical distribution
*} $, which gives the maximum value of $ p ({\ bf x}) $, gives the minimum value of Hamiltonian. In other words, the ground state of the spin system in the network can be found by finding $ {\ bf x} ^ {*} $ using the max-sum algorithm. If you read Hamiltonian as a loss function, this leads to solving a combinatorial optimization problem. As you can see from a closer look at the procedure of the max-sum algorithm, this is dynamic programming itself.
From a statistical mechanics point of view, the fact that the max-sum algorithm cannot be applied well in the presence of loops can be interpreted as the existence of frustration. I think it is a very interesting example that you can understand the close connection between graph theory, optimization problems, and statistical mechanics.

For this problem, I implemented the max-sum algorithm in Python. It is almost the same as the sum-product algorithm, but $ \ beta = 1 $. It is self-evident that this does not lose generality, considering that $ \ log f_ {ij} (x_i, x_j) = \ beta J_ {ij} x_i x_j $.

```
mu_x_to_f = {} #Dictionary-type management for readability
mu_f_to_x = {}
phi = {}
leaves = copy.deepcopy(network.leaves) #Otherwise network.leaves are replaced
children_of = copy.deepcopy(network.children_of)
spin = [1.0, -1.0]
while leaves != [0]: #Follow from leaves to root.
i = leaves.pop(0) #Take out one of the leaves
j = network.parent_of[i] #That parent(unique)Find out
#print(i,j)
direction = str(i)+"->"+str(j) #Dictionary key
interaction = str(j)+","+str(i)
mu_x_to_f[direction] = np.zeros(2) #Initialization conditions
for k in network.children_of[i]:
#print("yey")
direction_previous = str(k)+"->"+str(i)
mu_x_to_f[direction] += mu_f_to_x[direction_previous]
#The theory that it became difficult to understand if I wrote it round and round for readability
mu_f_to_x[direction] = np.zeros(2)
mu_f_to_x[direction][0] = max(J[interaction]*spin[0]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[0]*spin[1]+mu_x_to_f[direction][1])
mu_f_to_x[direction][1] = max(J[interaction]*spin[1]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[1]*spin[1]+mu_x_to_f[direction][1])
phi[direction] = np.array([1,1])
phi[direction][0] = np.argmax([J[interaction]*spin[0]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[0]*spin[1]+mu_x_to_f[direction][1]])
phi[direction][1] = np.argmax([J[interaction]*spin[1]*spin[0]+mu_x_to_f[direction][0],J[interaction]*spin[1]*spin[1]+mu_x_to_f[direction][1]])
#print(mu_f_to_x[direction])
children_of[j].remove(i) #Remove i from the child nodes to determine if j can be left
if len(children_of[j]) == 0: #Add to leaves if there are no child nodes
leaves.append(j)
nodes = copy.deepcopy(network.children_of[0]) #Otherwise network.leaves are replaced
children_of = copy.deepcopy(network.children_of)
x_optimal = [[] for _ in range(N)]
p_max = np.zeros(2)
for i in network.children_of[0]:
direction = str(i)+"->"+str(0)
p_max += mu_f_to_x[direction]
x_optimal[0] = np.argmax(p_max)
while len(nodes) != 0:
i = nodes.pop(0)
j = network.parent_of[i]
direction = str(i)+"->"+str(j)
x_optimal[i] = phi[direction][x_optimal[j]]
nodes.extend(network.children_of[i])
print(x_optimal)
def energy(x):
E = 0.0
for edge in network.edges:
i,j = edge
E += -J[str(i)+","+str(j)]*spin[x[i]]*spin[x[j]]
return E
print(energy(x_optimal))
energy_optimal = energy(x_optimal)
```

For the check, I generated about $ 10000 $ samples and compared the energies, but the $ {\ bf x} ^ * $ obtained by this algorithm certainly gave the minimum energy value.

In the first half, the partition function of the spin system on the tree-like network was calculated by using the sum-product algorithm. In the latter half, the ground state of this system was calculated by the max-sum algorithm. Through this example, I was able to see one aspect of the relationship between graph theory and optimization problems and statistical mechanics. Since this example is too simple, it may be interesting to make various extensions such as increasing the spin state and including three bodies as interactions.

Recommended Posts