[PYTHON] PRML: Chapter 8 Graphical Model Image Noise Removal

Overview

PRML Chapter 8 Graphical Model, 8.3.3 Example: Implemented image denoising in Python

Jupyter notebook summarizing the code and experimental results.

Graphical model

__ Probabilistic graphical model __ (probabilistic graohical model) is a schematic representation of the probability distribution. By using a graphical model for analysis, the dependency between random variables can be read intuitively.

The figure above shows the probability distributionp(x_{1}, x_{2}, x_{3}) = p(x_{1})p(x_{2}|x_{1})p(x_{3}|x_{2})To__Directed graph (Bayesian network)で表したものである．頂点To__node，辺To__Link__Is called.

The figure above shows the same probability distribution as __ undirected graph (Markov random field) . In the undirected graphical model, the set of linked nodes is called clique. In the figure above, $ \ {x {1}, x {2} \}, \ {x {2}, x {3} \} $ are the creeks. The undirected graph expresses the probability distribution in the form of the product of _potential function __ $ \ psi {C} (\ mathbf {x} _ {C}) $ on creek C.

p(\mathbf{X}) = \frac{1}{Z}\prod_{C}{\psi_{C}(\mathbf{x}_{C})}

Here, $ Z $ is a normalization constant. Since the potential function is limited to positive in a narrow sense, it is convenient to express it as an exponential function.

\psi_{C}(\mathbf{x_{\rm{C}}}) = \exp\\{-E(\mathbf{x_{\rm{C}}})\\}

Here, $ E (\ mathbf {x_ {\ rm {C}}}) $ is called __energy function __ (energy function), and this exponential function expression is __ Boltzman distribution __ (Boltzman distribution). Is called.

One of the important and beautiful features of the graphical model is that the __conditional independence __ of the joint distribution between multiple random variables can be read directly from the graph without analytical procedures. The general framework for achieving this is called __ directed separation __ (d-separation). Also known are algorithms that calculate the sum of joint distributions at high speed to infer peripheral probabilities, and algorithms that infer combinations of random variable values that maximize joint probabilities at high speed (sum of products by message passing, max- sum algorithm). See PRML for these important features.

Example: Image denoising

We explain how to use an undirected graph using an example of noise reduction in a binary image. It is assumed that the observed image including noise is described as a two-dimensional array of binary pixel values $ y_ {i} \ in \ {-1, +1 \} $. Here, $ i = 1, \ dots, D $ are the serial numbers of the pixels. This observation image has a small probability (10% here) from an unknown (noise-free) binary image described by $ x_ {i} \ in \ {-1, +1 \} $. It is assumed that it is obtained by randomly inverting the sign of the pixel.

Since the noise level is low enough, it is expected that a strong correlation remains between $ x_ {i} $ and $ y_ {i} $. We also know that there is a strong correlation between adjacent pixels $ x_ {i} $ and $ x_ {j} $ in the image. These prior knowledges are expressed by the Markov random field model corresponding to the undirected graph shown in the figure below.

This graph has two types of creeks consisting of two variables. Creeks in the form $ \ {x_ {i}, y_ {i} \} $ are associated with energy functions that represent the correlation between these variables. Here we use a very simple function of the form $-\ eta x_ {i} y_ {i} $ for these creeks. However, $ \ eta $ is a positive constant. This energy function has the desired effect of having low energy (high probability) when $ x_ {i} $ and $ y_ {i} $ have the same sign and high energy (low probability) when they have different signs. Have.

The other type of creek consists of a pair of adjacent variables $ \ {x_ {i}, x_ {j} \} $. However, $ i $ and $ j $ are the numbers of adjacent pixels. Regarding these creeks, we also want the energy to be lower when the two pixel values have the same sign than when they have different signs. Therefore, we use the energy function given by $-\ beta x_ {i} x_ {j} $. However, $ \ beta $ is also a positive constant.

Since it only needs to be any non-negative function on the maximal creek to be a potential function, it can be multiplied by any non-negative function on the subset of the creek. Here, the term $ hx_ {i} $ may be added as a function of the $ i $ th pixel value $ x_ {i} $ of the noise-free image. This term has the effect of biasing the pixel values so that they tend to have a particular sign.

To summarize the above, the total energy function of this model is

E(\mathbf{x}, \mathbf{y}) = h \sum_{i} x_{i} - \beta \sum_{\\{i, j\\}} x_{i}x_{j} - \eta \sum_{i} x_{i}y_{i}

Can be written. Also, using this, the joint distribution of $ \ mathbf {x} $ and $ \ mathbf {y} $ is

p(\mathbf{x}, \mathbf{y}) = \frac{1}{Z} \exp\\{-E(\mathbf{x}, \mathbf{y})\\}

Given by. If $ \ mathbf {y} $ is fixed to the observed image, the conditional distribution $ p (\ mathbf {x} | \ mathbf {y}) $ is determined. This problem is an example of the __ Ising model, which has been widely studied in statistical physics. To restore the highly probable image $ \ mathbf {x} $ from the observed image $ \ mathbf {y} $, use a simple iteration method called __ iterated conditional modes (ICM). You can use it. In this method, the gradient ascending method is simply applied for each axis. First, initialize the variable $ \ {x_ {i} \} $. For example, $ x_ {i} = y_ {i} $ for all $ i $. Then select the nodes $ x_ {j} $ one by one and keep the values of the other node variables fixed in two possible states of $ x_ {j} $ ($ x_ {j} = + 1 $ and $ Evaluate the total energy at x_ {j} =-1 $). Finally, set the value of $ x_ {j} $ to the one with the smaller energy. Such updates are repeated in different locations and terminate when certain appropriate stop conditions are met. Nodes may be updated regularly by raster scan, or nodes may be randomly selected.

The figure below shows the results when the parameters are $ h = 0, \ beta = 1, \ eta = 1 $.

Consideration

I want to implement it even if I replace ICM with max-sum
I lost my name to Bayesian network, Markov random field, and Boltzmann distribution, but d-separation and message passing are more important (I think).