Variational Bayesian inference implementation of topic model in python

Introduction

Implemented variational Bayesian estimation of topic model in python. As a textbook ["Topic Model"](https://www.amazon.co.jp/ Topic Model-Machine Learning Professional Series-Iwata-Guji / dp / 4061529048 / ref = sr_1_2? Ie = UTF8 & qid = 1501997285 & sr = 8- 2 & keywords = topic model) was used.

Structure of this article

Topic model

The explanation of the topic model is omitted because it is described in [Maximum likelihood estimation implementation of topic model in python](http://qiita.com/ta-ka/items/18248abf0135cca02b93#topic model).

Preparation

The necessary formulas are derived first.

Beta function

The beta function is expressed as follows.

\begin{align}
\int_0^1 \phi^{\alpha - 1} (1 - \phi)^{\beta - 1} d\phi &= \left[ \cfrac{\phi^{\alpha}}{\alpha} (1 - \phi)^{\beta - 1} \right]_0^1 + \int_0^1 \cfrac{\phi^{\alpha}}{\alpha} (\beta - 1) (1 - \phi)^{\beta - 2} d\phi \\
&= \cfrac{\beta - 1}{\alpha} \int_0^1 \phi^{\alpha} (1 - \phi)^{\beta - 2} d\phi \\
&= \cfrac{\beta - 1}{\alpha} \cdots \cfrac{1}{\alpha + \beta - 2} \int_0^1 \phi^{\alpha + \beta - 2} d\phi \\
&= \cfrac{(\beta - 1) \cdots 1}{\alpha \cdots (\alpha + \beta - 1)} \\
&= \cfrac{(\beta - 1)!}{\cfrac{(\alpha + \beta - 1)!}{(\alpha - 1)!}} \\
&= \cfrac{(\alpha - 1)!(\beta - 1)!}{(\alpha + \beta - 1)!} \\
&= \cfrac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)} \equiv B(\alpha, \beta)
\end{align}

From this, the following equation holds.

\begin{align}
\int_0^{1 - q} x^{\alpha - 1} (1 - q - x)^{\beta - 1} dx
&= \int_0^1 \bigl( (1 - q) y \bigr)^{\alpha - 1} \bigl( (1 - q) (1 - y) \bigr)^{\beta - 1} (1 - q) dy \ \ \ \ \because x = (1 - q)y \\
&= (1 - q)^{\alpha + \beta + 1} \int_0^1 y^{\alpha - 1} (1 - y)^{\beta - 1} dy \\
&= (1 - q)^{\alpha + \beta + 1} B(\alpha, \beta) \\
\end{align}

Dirichlet distribution

The Dirichlet distribution is expressed as follows.

\begin{align}
{\rm Diriclet}(\boldsymbol \phi \mid \boldsymbol \beta)
= \cfrac{\displaystyle \prod_{v = 1}^V \phi_v^{\beta_v - 1}}{\displaystyle \int \prod \limits_{v = 1}^V \phi_v^{\beta_v - 1} d \boldsymbol \phi} \\
\end{align}

Expand the normalization term.

\begin{align}
\int \prod_{v = 1}^V \phi_v^{\beta_v - 1} d\boldsymbol \phi
&= \int_0^1 \phi_1^{\beta_1 - 1} \int_0^{1 - \phi_1} \phi_2^{\beta_2 - 1} \cdots \int_0^{1 - \sum\limits_{v = 1}^{V - 2} \phi_v} \phi_{V - 1}^{\beta_{V - 1} - 1} \left( 1 - \sum_{v = 1}^{V - 1}\phi_v \right)^{\beta_V - 1} d\phi_{V - 1} \cdots d\phi_2 d\phi_1 \\
&= \int_0^1 \phi_1^{\beta_1 - 1} \int_0^{1 - \phi_1} \phi_2^{\beta_2 - 1} \cdots \int_0^{1 - \sum\limits_{v = 1}^{V - 3} \phi_v} \phi_{V - 2}^{\beta_{V - 2} - 1} \left( 1 - \sum_{v = 1}^{V - 2}\phi_v \right)^{\beta_{V - 1} + \beta_V - 1} B(\beta_{V - 1}, \beta_V) d\phi_{V - 2} \cdots d\phi_2 d\phi_1 \\
&= \int_0^1 \phi_1^{\beta_1 - 1} \int_0^{1 - \phi_1} \phi_2^{\beta_2 - 1} \cdots \int_0^{1 - \sum\limits_{v = 1}^{V - 4} \phi_v} \phi_{V - 3}^{\beta_{V - 3} - 1} \left( 1 - \sum_{v = 1}^{V - 3}\phi_v \right)^{\beta_{V - 2} + \beta_{V - 1} + \beta_V - 1} B(\beta_{V - 2}, \beta_{V - 1} + \beta_V) B(\beta_{V - 1}, \beta_V) d\phi_{V - 3} \cdots d\phi_2 d\phi_1 \\
&= B \left( \beta_1, \sum_{v = 2}^V \beta_v \right) \cdots B(\beta_{V - 2}, \beta_{V - 1} + \beta_V) B(\beta_{V - 1}, \beta_V) \\
&= \cfrac{\Gamma(\beta_1) \Gamma \left( \sum\limits_{v = 2}^V \beta_v \right)}{\Gamma \left( \sum\limits_{v = 1}^V \beta_v \right)} \cdots \cfrac{\Gamma(\beta_{V - 2}) \Gamma(\beta_{V - 1} + \beta_V)}{\Gamma(\beta_{V - 2} + \beta_{V - 1} + \beta_V)} \cfrac{\Gamma(\beta_{V - 1}) \Gamma(\beta_V)}{\Gamma(\beta_{V - 1} + \beta_V)} \\
&= \cfrac{\prod\limits_{v = 1}^V \Gamma(\beta_v)}{\Gamma \left( \sum\limits_{v = 1}^V \beta_v \right)} \\
\end{align}

Replaces the normalization term.

\begin{align}
{\rm Diriclet}(\boldsymbol \phi \mid \boldsymbol \beta)
= \cfrac{\displaystyle \Gamma \left( \sum_{v = 1}^V \beta_v \right)}{\displaystyle \prod_{v = 1}^V \Gamma(\beta_v)} \prod_{v = 1}^V \phi_v^{\beta_v - 1} \\
\end{align}

Random variables that follow the Dirichlet distribution

The expected value of the logarithm of the $ v $ th element $ \ phi_v $ of the random variable $ \ boldsymbol \ phi $ that follows the Dirichlet distribution can be expressed as follows.

\begin{align}
\int p(\boldsymbol \phi \mid \boldsymbol \beta) \log \phi_v d \boldsymbol \phi
&= \int \cfrac{\Gamma(\hat \beta)}{\prod\limits_{v' = 1}^V \Gamma(\beta_{v'})} \prod_{v' = 1}^V \phi_{v'}^{\beta_{v'} - 1} \log \phi_v d \boldsymbol \phi \\
&= \cfrac{\Gamma(\hat \beta)}{\prod\limits_{v' = 1}^V \Gamma(\beta_{v'})} \int \cfrac{\partial \prod\limits_{v' = 1}^V \phi_{v'}^{\beta_{v'} - 1}}{\partial \beta_v} d \boldsymbol \phi \\
&= \cfrac{\Gamma(\hat \beta)}{\prod\limits_{v' = 1}^V \Gamma(\beta_{v'})} \cfrac{\partial}{\partial \beta_v} \int \prod_{v' = 1}^V \phi_{v'}^{\beta_{v'} - 1} d \boldsymbol \phi \\
&= \cfrac{\Gamma(\hat \beta)}{\prod\limits_{v' = 1}^V \Gamma(\beta_{v'})} \cfrac{\partial}{\partial \beta_v} \cfrac{\prod\limits_{v' = 1}^V \Gamma(\beta_{v'})}{\Gamma(\hat \beta)} \\
&= \cfrac{\partial}{\partial \beta_v} \log \cfrac{\prod\limits_{v' = 1}^V \Gamma(\beta_{v'})}{\Gamma(\hat \beta)} \\
&= \cfrac{\partial}{\partial \beta_v} \left( \log \prod_{v' = 1}^V \Gamma(\beta_{v'}) - \log \Gamma(\hat \beta) \right) \\
&= \cfrac{\partial \log \Gamma(\beta_v)}{\partial \beta_v} - \cfrac{\partial \log \Gamma(\hat \beta)}{\partial \hat \beta} = \Psi(\beta_v) - \Psi(\hat \beta) \\
\end{align}

Where $ \ hat \ beta $ is the sum of the parameters and $ \ Psi (x) $ is the digamma function.

\begin{align}
& \hat \beta = \sum_{v = 1}^V \beta_v \\
& \Psi(x) = \cfrac{d}{dx} \log \Gamma(x) \\
\end{align}

Variational Bayesian inference

Variational Bayesian estimation estimates the posterior distribution of unknown variables. There are three types of unknown variables in the topic model.

Unknown variable Notation
Topic set \boldsymbol Z = (z_{11}, \cdots, z_{1N_1}, z_{21}, \cdots, z_{DN_D}), \ \ z_{dn} \in \\{1, \cdots, K\\}
Topic distribution set \boldsymbol \Theta = (\boldsymbol \theta_1, \cdots, \boldsymbol \theta_D), \ \ \boldsymbol \theta_d = (\theta_{d1}, \cdots, \theta_{dK})
Word distribution set \boldsymbol \Phi = (\boldsymbol \phi_1, \cdots, \boldsymbol \phi_K), \ \ \boldsymbol \phi_k = (\phi_{k1}, \cdots, \phi_{kV})

$ D $ is the number of documents, $ K $ is the number of topics, $ V $ is the number of vocabularies, and $ N_d $ is the length of the document $ d $. $ z_ {dn} $ is the topic of the $ n $ th word in the document $ d $, $ \ theta_ {dk} $ is the probability that the topic $ k $ will be assigned to the document $ d $. $ \ phi_ {kv} $ represents the probability that the topic $ k $ will generate the vocabulary $ v $ Also, the topic distribution set $ \ Theta $ is generated from the Dirichlet distribution with the parameter $ \ alpha $. The word distribution set $ \ Phi $ is generated from the Dirichlet distribution with the parameter $ \ beta $.

Variational lower limit

Use Jensen's inequality to find the variational lower limit $ F $ of $ \ log p (\ boldsymbol W \ mid \ alpha, \ beta) $.

\begin{align}
\log p(\boldsymbol W \mid \alpha, \beta)
&= \log \int \int \sum_{\boldsymbol Z} p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) d\boldsymbol \Theta d \boldsymbol \Phi \\
&= \log \int \int \sum_{\boldsymbol Z} q(\boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi) \cfrac{p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta)}{q(\boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi)} d\boldsymbol \Theta d \boldsymbol \Phi \\
&\geq \int \int \sum_{\boldsymbol Z} q(\boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi) \log \cfrac{p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta)}{q(\boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi)} d\boldsymbol \Theta d \boldsymbol \Phi \\
&= \int \int \sum_{\boldsymbol Z} q(\boldsymbol Z) q(\boldsymbol \Theta, \boldsymbol \Phi) \log \cfrac{p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta)}{q(\boldsymbol Z) q(\boldsymbol \Theta, \boldsymbol \Phi)} d\boldsymbol \Theta d \boldsymbol \Phi \equiv F \\
\end{align}

Variational posterior distribution of topics

Define $ F (q (\ boldsymbol Z)) $ as follows and variation it with $ q (\ boldsymbol Z) $.

\begin{align}
F(q(\boldsymbol Z))
&\equiv \int \int \sum_{\boldsymbol Z} q(\boldsymbol Z) q(\boldsymbol \Theta, \boldsymbol \Phi) \log \cfrac{p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta)}{q(\boldsymbol Z) q(\boldsymbol \Theta, \boldsymbol \Phi)} d \boldsymbol \Theta d \boldsymbol \Phi + \lambda \left( \sum_{\boldsymbol Z} q(\boldsymbol Z) - 1 \right) \\
\cfrac{\partial F(q(\boldsymbol Z))}{\partial q(\boldsymbol Z)}
&= \int \int q(\boldsymbol \Theta, \boldsymbol \Phi) \bigl( \log p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) - \log q(\boldsymbol Z) - \log q(\boldsymbol \Theta, \boldsymbol \Phi) - 1 \bigr) d\boldsymbol \Theta d \boldsymbol \Phi + \lambda \\
&= \mathbb{E}_{q(\boldsymbol \Theta, \boldsymbol \Phi)} \bigl[ \log p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) \bigr] - \log q(\boldsymbol Z) + \lambda - C \\
\end{align}

Solve $ \ displaystyle \ cfrac {\ partial F (q (\ boldsymbol Z))} {\ partial q (\ boldsymbol Z)} = 0 $.

\begin{align}
q(\boldsymbol Z)
&\propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta, \boldsymbol \Phi)} \bigl[ \log p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) \bigr] \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta, \boldsymbol \Phi)} \bigl[ \log p(\boldsymbol \Theta \mid \alpha) + \log p(\boldsymbol \Phi \mid \beta) + \log p(\boldsymbol Z \mid \boldsymbol \Theta) + \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] \Bigr) \\
&\propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta, \boldsymbol \Phi)} \bigl[ \log p(\boldsymbol Z \mid \boldsymbol \Theta) + \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta)} \bigl[ \log p(\boldsymbol Z \mid \boldsymbol \Theta) \bigr] + \mathbb{E}_{q(\boldsymbol \Phi)} \bigl[ \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta)} \bigl[ \log \prod_{d = 1}^D \prod_{n = 1}^{N_d} \theta_{dz_{dn}} \bigr] + \mathbb{E}_{q(\boldsymbol \Phi)} \bigl[ \log \prod_{d = 1}^{D} \prod_{n = 1}^{N_d} \phi_{z_{dn}w_{dn}} \bigr] \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta)} \bigl[ \sum_{d = 1}^D \sum_{n = 1}^{N_d} \log \theta_{dz_{dn}} \bigr] + \mathbb{E}_{q(\boldsymbol \Phi)} \bigl[ \sum_{d = 1}^D \sum_{n = 1}^{N_d} \log \phi_{z_{dn}w_{dn}} \bigr] \Bigr) \\
&= \prod_{d = 1}^D \prod_{n = 1}^{N_d} \exp \Bigl( \mathbb{E}_{q(\boldsymbol \Theta)} \bigl[ \log \theta_{dz_{dn}} \bigr] + \mathbb{E}_{q(\boldsymbol \Phi)} \bigl[ \log \phi_{z_{dn}w_{dn}} \bigr] \Bigr) \\
&= \prod_{d = 1}^D \prod_{n = 1}^{N_d} \exp \Bigl( \Psi \bigl( \alpha_{dz_{dn}} \bigr) - \Psi \bigl( \sum_{k = 1}^K \alpha_{dk} \bigr) + \Psi \bigl( \beta_{z_{dn}w_{dn}} \bigr) - \Psi \bigl( \sum_{v = 1}^V \beta_{z_{dn}v} \bigr) \Bigr) \\
\end{align}

From $ \ displaystyle q (\ boldsymbol Z) = \ prod_ {d = 1} ^ D \ prod_ {n = 1} ^ {N_d} q_ {dnz_ {dn}} $, $ q_ {dnk} $ is as follows It becomes.

\begin{align}
q_{dnk}
&\propto \exp \Bigl( \Psi \bigl( \alpha_{dk} \bigr) - \Psi \bigl( \sum_{k' = 1}^K \alpha_{dk'} \bigr) + \Psi \bigl( \beta_{kw_{dn}} \bigr) - \Psi \bigl( \sum_{v = 1}^V \beta_{kv} \bigr) \Bigr) \\
\end{align}

Decomposition of topic distribution and word distribution

Define $ F (q (\ boldsymbol \ Theta, \ boldsymbol \ Phi)) $ as follows and variation it with $ q (\ boldsymbol \ Theta, \ boldsymbol \ Phi) $.

\begin{align}
F(q(\boldsymbol \Theta, \boldsymbol \Phi))
&\equiv \int \int \sum_{\boldsymbol Z} q(\boldsymbol Z) q(\boldsymbol \Theta, \boldsymbol \Phi) \log \cfrac{p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta)}{q(\boldsymbol Z) q(\boldsymbol \Theta, \boldsymbol \Phi)} d \boldsymbol \Theta d \boldsymbol \Phi + \lambda \left( \int \int q(\boldsymbol \Theta, \boldsymbol \Phi) d \boldsymbol \Theta d \boldsymbol \Phi - 1 \right) \\
\cfrac{\partial F(q(\boldsymbol \Theta, \boldsymbol \Phi))}{\partial q(\boldsymbol \Theta, \boldsymbol \Phi)}
&= \sum_{\boldsymbol Z} q(\boldsymbol Z) \bigl( \log p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) - \log q(\boldsymbol Z) - \log q(\boldsymbol \Theta, \boldsymbol \Phi) - 1 \bigr) + \lambda \\
&= \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) \bigr] - \log q(\boldsymbol \Theta, \boldsymbol \Phi) + \lambda - C \\
\end{align}

Solve $ \ displaystyle \ cfrac {\ partial F (q (\ boldsymbol \ Theta, \ boldsymbol \ Phi))} {\ partial q (\ boldsymbol \ Theta, \ boldsymbol \ Phi)} = 0 $.

\begin{align}
q(\boldsymbol \Theta, \boldsymbol \Phi)
&\propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol W, \boldsymbol Z, \boldsymbol \Theta, \boldsymbol \Phi \mid \alpha, \beta) \bigr] \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol \Theta \mid \alpha) + \log p(\boldsymbol \Phi \mid \beta) + \log p(\boldsymbol Z \mid \boldsymbol \Theta) + \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol Z \mid \boldsymbol \Theta) \bigr] + \log p(\boldsymbol \Theta \mid \alpha) \Bigr) \times \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] + \log p(\boldsymbol \Phi \mid \beta) \Bigr) \\
\end{align}

From the above equation, it can be decomposed as $ q (\ boldsymbol \ Theta, \ boldsymbol \ Phi) = q (\ boldsymbol \ Theta) q (\ boldsymbol \ Phi) $, and each can be expressed as follows.

\begin{align}
& q(\boldsymbol \Theta) \propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol Z \mid \boldsymbol \Theta) \bigr] + \log p(\boldsymbol \Theta \mid \alpha) \Bigr) \\
& q(\boldsymbol \Phi) \propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] + \log p(\boldsymbol \Phi \mid \beta) \Bigr) \\
\end{align}

Variational posterior distribution of topic distribution

$ q (\ boldsymbol \ Theta) $ can be calculated as follows.

\begin{align}
q(\boldsymbol \Theta) &\propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol Z \mid \boldsymbol \Theta) \bigr] + \log p(\boldsymbol \Theta \mid \alpha) \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \sum_{d = 1}^D \sum_{n = 1}^{N_d} \log \theta_{dz_{dn}} \bigr] + \log \prod_{d = 1}^D p(\boldsymbol \theta_d \mid \alpha) \Bigr) \\
&= \exp \Bigl( \sum_{d = 1}^D \sum_{n = 1}^{N_d} \sum_{k = 1}^{K} q_{dnk} \log \theta_{dk} + \sum_{d = 1}^D \log \cfrac{\Gamma(\alpha K)}{\Gamma(\alpha)^K} \prod_{k = 1}^K \theta_{dk}^{\alpha - 1} \Bigr) \\
&\propto \exp \Bigl( \sum_{d = 1}^D \sum_{n = 1}^{N_d} \sum_{k = 1}^K q_{dnk} \log \theta_{dk} + \sum_{d = 1}^D \sum_{k = 1}^K \log \theta_{dk}^{\alpha - 1} \Bigr) \\
&= \exp \Bigl( \sum_{d = 1}^D \sum_{k = 1}^K \log \theta_{dk}^{\sum \limits_{n = 1}^{N_d} q_{dnk}} + \sum_{d = 1}^D \sum_{k = 1}^K \log \theta_{dk}^{\alpha - 1} \Bigr) \\
&= \prod_{d = 1}^D \prod_{k = 1}^K \theta_{dk}^{\alpha + \sum \limits_{n = 1}^{N_d} q_{dnk} - 1} \\
q(\boldsymbol \Theta) &= \prod_{d = 1}^D {\rm Diriclet}(\boldsymbol \theta_d \mid \alpha_{d1}, \cdots, \alpha_{dK}) \\
\end{align}

Here, $ \ alpha_ {dk} $ is defined as follows.

\begin{align}
\alpha_{dk} &= \alpha + \sum_{n = 1}^{N_d} q_{dnk} \\
\end{align}

Variational posterior distribution of word distribution

$ q (\ boldsymbol \ Phi) $ can be calculated as follows.

\begin{align}
q(\boldsymbol \Phi) &\propto \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \log p(\boldsymbol W \mid \boldsymbol Z, \boldsymbol \Phi) \bigr] + \log p(\boldsymbol \Phi \mid \beta) \Bigr) \\
&= \exp \Bigl( \mathbb{E}_{q(\boldsymbol Z)} \bigl[ \sum_{d = 1}^D \sum_{n = 1}^{N_d} \log \phi_{z_{dn}w_{dn}} \bigr] + \log \prod_{k = 1}^K p(\boldsymbol \phi_k \mid \beta) \Bigr) \\
&= \exp \Bigl( \sum_{d = 1}^D \sum_{n = 1}^{N_d} \sum_{k = 1}^K q_{dnk} \log \phi_{kw_{dn}} + \sum_{k = 1}^K \log \cfrac{\Gamma(\beta V)}{\Gamma(\beta)^V} \prod_{v = 1}^V \phi_{kv}^{\beta - 1} \Bigr) \\
&\propto \exp \Bigl( \sum_{d = 1}^D \sum_{n = 1}^{N_d} \sum_{k = 1}^K q_{dnk} \log \phi_{kw_{dn}} + \sum_{k = 1}^K \sum_{v = 1}^V \phi_{kv}^{\beta - 1} \Bigr) \\
&= \exp \Bigl( \sum_{k = 1}^K \sum_{v = 1}^V \log \phi_{kv}^{\sum \limits_{d = 1}^D \sum \limits_{n: w_{dn}= v} q_{dnk}} + \sum_{k = 1}^K \sum_{v = 1}^V \log \phi_{kv}^{\beta - 1} \Bigr) \\
&= \prod_{k = 1}^K \prod_{v = 1}^V \phi_{kv}^{\beta + \sum \limits_{d = 1}^D \sum \limits_{n: w_{dn}= v} q_{dnk} - 1} \\
q(\boldsymbol \Phi) &= \prod_{k = 1}^K {\rm Diriclet}(\boldsymbol \phi_k \mid \beta_{k1}, \cdots, \beta_{kV}) \\
\end{align}

Here, $ \ beta_ {kv} $ is defined as follows.

\begin{align}
\beta_{kv} &= \beta + \sum_{d = 1}^D \sum_{n: w_{dn}= v} q_{dnk} \\
\end{align}

Parameter estimation

Estimate the parameters using the derived results.

  1. Initialize $ \ alpha_ {dk} $ and $ \ beta_ {kv} $ with random numbers
  2. $q_{dnk} \ propto \ exp \ Bigl (\ Psi \ bigl (\ alpha_ {dk} \ bigr)-\ Psi \ bigl (\ sum_ {k'= 1} ^ K \ alpha_ {dk'} \ bigr) + \ Psi \ bigl (\ beta_ {kw_ {dn}} \ bigr)-\ Psi \ bigl (\ sum_ {v = 1} ^ V \ beta_ {kv} \ bigr) \ Bigr) Calculate the variational posterior distribution of topics from $
  3. Normalize $ q_ {dnk} $
  4. Update parameters $ \ alpha_ {dk} \ leftarrow \ alpha_ {dk} + q_ {dnk} $, $ \ beta_ {kw_ {dn}} \ leftarrow \ beta_ {kw_ {dn}} + q_ {dnk} $

Implementation

I implemented a toy program for variational Bayesian estimation of topic model in python. I had a hard time understanding and expanding the formula, but the results are beautiful and the code is short.

import numpy as np
from scipy.special import digamma

def normalize(ndarray, axis):
    return ndarray / ndarray.sum(axis = axis, keepdims = True)

def normalized_random_array(d0, d1):
    ndarray = np.random.rand(d0, d1)
    return normalize(ndarray, axis = 1)

if __name__ == "__main__":

    # initialize parameters
    D, K, V = 1000, 2, 6
    alpha0, beta0 = 1.0, 1.0
    alpha = alpha0 + np.random.rand(D, K)
    beta = beta0 + np.random.rand(K, V)
    theta = normalized_random_array(D, K)
    phi = normalized_random_array(K, V)

    # for generate documents
    _theta = np.array([theta[:, :k+1].sum(axis = 1) for k in range(K)]).T
    _phi = np.array([phi[:, :v+1].sum(axis = 1) for v in range(V)]).T

    # generate documents
    W, Z = [], []
    N = np.random.randint(100, 300, D)
    for (d, N_d) in enumerate(N):
        Z.append((np.random.rand(N_d, 1) < _theta[d, :]).argmax(axis = 1))
        W.append((np.random.rand(N_d, 1) < _phi[Z[-1], :]).argmax(axis = 1))

    # estimate parameters
    T = 30
    for t in range(T):
        dig_alpha = digamma(alpha) - digamma(alpha.sum(axis = 1, keepdims = True))
        dig_beta = digamma(beta) - digamma(beta.sum(axis = 1, keepdims = True))

        alpha_new = np.ones((D, K)) * alpha0
        beta_new = np.ones((K, V)) * beta0
        for (d, N_d) in enumerate(N):
            # q
            q = np.zeros((V, K))
            v, count = np.unique(W[d], return_counts = True)
            q[v, :] = (np.exp(dig_alpha[d, :].reshape(-1, 1) + dig_beta[:, v]) * count).T
            q[v, :] /= q[v, :].sum(axis = 1, keepdims = True)

            # alpha, beta
            alpha_new[d, :] += count.dot(q[v])
            beta_new[:, v] += count * q[v].T

        alpha = alpha_new.copy()
        beta = beta_new.copy()

    theta_est = np.array([np.random.dirichlet(a) for a in alpha])
    phi_est = np.array([np.random.dirichlet(b) for b in beta])

result

The $ \ boldsymbol \ Phi $ generated from the $ \ boldsymbol \ beta $ obtained by the above toy program is posted. Originally, perplexity and log-likelihood should be used as evaluation scales, but I do not do it because it is troublesome.

phi
[[ 0.26631554  0.04657097  0.29425041  0.1746378   0.03077238  0.1874529 ]
 [ 0.2109456   0.01832505  0.30360253  0.09073456  0.14039401  0.23599826]]
phi_est
[[ 0.27591967  0.0424522   0.26088712  0.18220604  0.02874477  0.2097902 ]
 [ 0.19959096  0.02327517  0.34107528  0.08462041  0.14291539  0.20852279]]

in conclusion

The variational Bayesian estimation of the topic model has been derived and implemented. The results of this article were not very interesting, I looked at LDA for Pokemon analysis and tried to classify the gold and silver versions of Pokemon. I will summarize that in Qiita at a later date.

Recommended Posts

Variational Bayesian inference implementation of topic model in python
Maximum likelihood estimation implementation of topic model in python
Implementation of quicksort in Python
Implementation of original sorting in Python
Implemented in Python PRML Chapter 1 Bayesian Inference
Python implementation of continuous hidden Markov model
I tried to implement TOPIC MODEL in Python
Explanation of edit distance and implementation in Python
ValueObject implementation in Python
SVM implementation in python
[Python] Implementation of clustering using a mixed Gaussian model
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
A python implementation of the Bayesian linear regression class
A addictive point in "Bayesian inference experience with Python"
Overview of generalized linear models and implementation in Python
A reminder about the implementation of recommendations in Python
Continuous space topic model implementation
Visualize Keras model in Python 3.5
Equivalence of objects in Python
Neural network implementation in python
[Python] Bayesian inference with Pyro
Pixel manipulation of images in Python
Sorting algorithm and implementation in Python
HMM parameter estimation implementation in python
Python implementation of self-organizing particle filters
Mixed normal distribution implementation in python
Implementation of login function in Django
Division of timedelta in Python 2.7 series
Bayesian optimization package GPyOpt in Python
MySQL-automatic escape of parameters in python
Handling of JSON files in Python
Waveform display of audio in Python
Implementation of desktop notifications using Python
Python implementation of non-recursive Segment Tree
General Gaussian state-space model in Python
Implementation of Light CNN (Python Keras)
Implementation of Dijkstra's algorithm with python
Reversible scrambling of integers in Python
Custom state space model in Python
"Introduction to Machine Learning by Bayesian Inference" Approximate inference of Poisson mixed model implemented only with Python numpy
Implementation of custom user model authentication in Django REST Framework with djoser
Implementation of particle filters in Python and application to state space models
Explanation of production optimization model by Python
Conversion of string <-> date (date, datetime) in Python
PRML Chapter 4 Bayesian Logistic Regression Python Implementation
(Bad) practice of using this in Python
General Theory of Relativity in Python: Introduction
Variational Bayesian estimation of mixed Gaussian distribution
Output tree structure of files in Python
Display a list of alphabets in Python 3
PRML Chapter 14 Conditional Mixed Model Python Implementation
Implementation module "deque" in queue and Python
Comparison of Japanese conversion module in Python3
It's an implementation of ConnectionPool in redis.py
PRML Chapter 10 Variational Gaussian Distribution Python Implementation
PRML Chapter 1 Bayesian Curve Fitting Python Implementation
Summary of various for statements in Python
Sum of variables in a mathematical model
I tried using Bayesian Optimization in Python
The result of installing python in Anaconda
Gang of Four (GoF) Patterns in Python