PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models

In Chapter 13, the hidden Markov model is introduced as a model for handling series data. This time, we will implement maximum likelihood estimation with the Hidden Markov Model (HMM). I've done several times to fix the parameters and estimate the hidden state using the forward-backward algorithm, but I've never estimated the parameters, so it's a good opportunity to do it. I decided to see it. By the way, HMM is the basis of Kalman filters and particle filters.

Hidden Markov model

In the hidden Markov model, we consider series data (for example, the front and back of a coin thrown 100 times). And, it is said that there is a latent variable (for example, a coin whose back (front) is easy to come out was used) behind those observed series data. The graphical model of the hidden Markov model is shown in the figure below (from PRML Figure 13.5). graphical_model.png $ \ {x_i \} $ is the observation variable (both front and back of the coin), and $ \ {z_i \} $ is the latent variable (coin bias). Latent variables depend only on the previous latent variable (eg, after a coin that is easy to get out of the table is used, coins that are biased to the same table are used again), and the observed variable depends only on the latent variable at that time. (Example: If coins that are biased to the front are used, the front is easy to come out). When a latent variable depends only on the previous latent variable in this way, it is called a first-order Markov chain.

Let's take a closer look using mathematical formulas. The latent variable $ z_n $ depends only on $ z_ {n-1} $, and the relationship is represented by the transition probability matrix $ A $. Each component of the matrix $ A $ represents the transition probability of a latent variable. The component $ A_ {ij} $ is $ j $ ($ z_) when $ z_ {n-1} $ is in the state $ i $ ($ z_ {n-1, i} = 1 $) and $ z_n $ is in the state $ j $ ($ z_ It is the probability that {n, j} = 1 $). However, it is assumed that the one-to-K coding method is used for latent variables. If you write these as mathematical formulas,

p(z_n|z_{n-1},A) = \prod_{i=1}^K\prod_{j=1}^KA_{ij}^{z_{n-1,i},z_{nj}}

It will be. However, since the first latent variable $ z_1 $ does not have a latent variable before it, we use the K-dimensional probability vector $ \ pi $.

p(z_1|\pi) = \prod_{i=1}^K\pi_i^{z_{1i}}

will do.

If the observed variable $ x $ is a discrete variable that takes D discrete states (eg, a discrete variable that has two states on the front and back of a coin), the conditional distribution of the observed variables is

p(x_n|z_n,\mu) = \prod_{i=1}^D\prod_{k=1}^K\mu_{ik}^{x_{ni}z_{nk}}

It will be. However, the pair K coding method is also used for $ x $.

Maximum likelihood estimation of HMM

In the hidden Markov model introduced above, three parameters $ \ pi, A, \ mu $ have appeared. Here, the maximum likelihood estimation of these three parameters is made from the observed series data $ X = \ {x_n \} _ {n = 1} ^ N $. If these three parameters are collectively expressed as $ \ theta $, the likelihood function to be maximized is

p(X|\theta) = \sum_Zp(X,Z|\theta)

It will be. However, $ Z = \ {z_n \} _ {n = 1} ^ N $. The number in the case of Z is the number of states K of $ z $ to the Nth power (for example: if there are two coin biases and the coin is thrown 100 times, $ 2 ^ {100} \ approx10 ^ {30} ($ Street), it cannot be calculated directly. Therefore, we will use a method to reduce the amount of calculation by taking advantage of the fact that it is a first-order hidden Markov model in the framework of the EM algorithm.

The EM algorithm iterates the following formula calculation and maximization for $ \ theta $.

Q(\theta,\theta^{old}) = \sum_Zp(Z|X,\theta^{old})\ln p(X,Z|\theta)

M step

Let $ \ gamma (z_n) $ be the peripheral posterior distribution of the latent variable $ z_n $, and let $ \ xi (z_ {n-1}, z_n) $ be the simultaneous posterior distribution of consecutive latent variables.

\begin{align}
\gamma(z_n) &= p(z_n|X,\theta^{old})\\
\xi(z_{n-1},z_n) &= p(z_{n-1},z_n|\theta^{old})
\end{align}

Using these, if you rewrite the formula to be maximized,

Q(\theta,\theta^{old}) = \sum_{k=1}^K\gamma(z_{1k})\ln\pi_k + \sum_{n=2}^N\sum_{j=1}^K\sum_{k=1}^K\xi(z_{n-1,j},z_{n,k})\ln A_{jk} + \sum_{n=1}^N\sum_{k=1}^K \gamma(z_{nk})\ln p(x_n|z_n,\mu)

When this is maximized for the parameters $ \ pi, A, \ mu $, the parameters at that time use the Lagrange multiplier method.

\begin{align}
\pi_k &= {\gamma(z_{1k})\over\sum_{k=1}^K\gamma(z_{1j})}\\
A_{jk} &= {\sum_{n=2}^N\xi(z_{n-1,j},z_{nk})\over\sum_{l=1}^K\sum_{n=2}^N\xi(z_{n-1,j},z_{nl})}\\
\mu_{ik} &= {\sum_{n=1}^N\gamma(z_{nk})x_{ni}\over\sum_{n=1}^N\gamma(z_{nk})}
\end{align}

I can find it.

E step

In this step, we will calculate the two posterior probability distributions $ \ gamma (z_n), \ xi (z_ {n-1}, z_n) $ for the latent variables used in step M. The method is called the forward-backward algorithm, and as the name suggests, the posterior probability distribution can be calculated efficiently by performing two calculations from the front and the back for the series data.

Forward

Recursively $ \ alpha (z_n) = p (x_1, \ dots, x_n, z_n) $ with $ \ alpha (z_1) = p (x_1, z_1) = p (x_1 | z_1) p (z_1) $ Calculate.

\begin{align}
\alpha(z_n) &= p(x_1,\dots,x_n,z_n)\\
&= \sum_{z_{n-1}}p(x_1,\dots,x_n,z_n,z_{n-1})\\
&= \sum_{z_{n-1}}p(x_1,\dots,x_n|z_n,z_{n-1})p(z_n|z_{n-1})p(z_{n-1})\\
&= \sum_{z_{n-1}}p(x_1,\dots,x_{n-1}|z_{n-1})p(x_n|z_n)p(z_n|z_{n-1})p(z_{n-1})\\
&= p(x_n|z_n)\sum_{z_{n-1}}\alpha(z_{n-1})p(z_n|z_{n-1})
\end{align}

Backward

\beta(z_N)=p(x_N|z_N)As\beta(z_n)=p(x_{n+1},\dots,x_N|z_n)Is calculated recursively.

\begin{align}
\beta(z_n)&=p(x_{n+1},\dots,x_N|z_n)\\
 &= \sum_{z_{n+1}}p(z_{n+1},x_{n+1},\dots,x_N|z_n)\\
&= \sum_{z_{n+1}}p(x_{n+1},\dots,x_N|z_n,z_{n+1})p(z_{n+1}|z_n)\\
&= \sum_{z_{n+1}}p(x_{n+2},\dots,x_N|z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n)\\
&= \sum_{z_{n+1}}\beta(z_{n+1})p(x_{n+1}|z_{n+1})p(z_{n+1}|z_n)
\end{align}

Posterior probability distribution

From forward and backward above

\begin{align}
\gamma(z_n) &= {\alpha(z_n)\beta(z_n)\over p(X)}\\
\xi(z_{n-1},z_n) &= {\alpha(z_{n-1})p(x_n|z_n)p(z_n|z_{n-1})\beta(z_n)\over p(X)}
\end{align}

The posterior probability distribution can be obtained as shown below.

code

import Use numpy and matplotlib as usual.

import matplotlib.pyplot as plt
import numpy as np

Hidden Markov model

class HiddenMarkovModel(object):

    def __init__(self, n_states_hidden, n_states_observe):
        #Number of states of latent variable
        self.n_states_hidden = n_states_hidden

        #Number of states of observed variables
        self.n_states_observe = n_states_observe

        #Initialization of parameter pi of distribution of initial latent variable
        self.initial = np.ones(n_states_hidden) / n_states_hidden

        #Initialization of transition probability matrix A
        self.transition = np.ones((n_states_hidden, n_states_hidden)) / (2 * n_states_hidden)
        self.transition += np.eye(n_states_hidden) * 0.5

        #Initialization of parameter mu of distribution of observed variables
        self.observation = np.random.rand(n_states_observe, n_states_hidden)
        self.observation /= np.sum(self.observation, axis=0, keepdims=True)

    # pi,A,Maximum likelihood estimation of mu
    def fit(self, sequence, iter_max=100):
        #Repeat EM step
        for i in xrange(iter_max):
            params = np.hstack((self.transition.ravel(), self.observation.ravel()))

            #E step
            p_hidden, p_transition = self.expectation(sequence)

            #M step
            self.maximization(sequence, p_hidden, p_transition)

            #Check if it has converged
            if np.allclose(params, np.hstack((self.transition.ravel(), self.observation.ravel()))):
                break

    #E step
    def expectation(self, sequence):
        N = len(sequence)
        forward = np.zeros(shape=(N, self.n_states_hidden))
        # alpha(z_1)=p(x_1,z_1)Calculate
        forward[0] = self.initial * self.observation[sequence[0]]
        backward = np.zeros_like(forward)
        # beta(z_N)=p(x_N|z_N)Calculate
        backward[-1] = self.observation[sequence[-1]]

        #Forward
        for i in xrange(1, len(sequence)):
            #PRML formula(13.36)
            forward[i] = self.transition.dot(forward[i - 1]) * self.observation[sequence[i]]

        #Backward
        for j in xrange(N - 2, -1, -1):
            #PRML formula(13.38)
            backward[j] = (self.observation[sequence[j + 1]] * backward[j + 1]).dot(self.transition)

        #PRML formula(13.33)Latent variable z_posterior probability distribution of n gamma(z_n)Calculate
        p_hidden = forward * backward
        p_hidden /= np.sum(p_hidden, axis=-1, keepdims=True)

        #PRML formula(13.43)Simultaneous posterior probability distribution of consecutive latent variables xi(z_{n-1},z_n)Calculate
        p_transition = self.transition * (self.observation[sequence[1:]] * backward[1:])[:, :, None] * forward[:-1, None, :]
        p_transition /= np.sum(p_transition, axis=(1, 2), keepdims=True)

        return p_hidden, p_transition

    #M step
    def maximization(self, sequence, p_hidden, p_transition):
        #PRML formula(13.18)Update parameters for distribution of initial latent variables
        self.initial = p_hidden[0] / np.sum(p_hidden[0])

        #PRML formula(13.19)Update the transition probability matrix
        self.transition = np.sum(p_transition, axis=0) / np.sum(p_transition, axis=(0, 2))
        self.transition /= np.sum(self.transition, axis=0, keepdims=True)

        #PRML formula(13.23)Observation model parameter update
        x = p_hidden[:, None, :] * (np.eye(self.n_states_observe)[sequence])[:, :, None]
        self.observation = np.sum(x, axis=0) / np.sum(p_hidden, axis=0)

data

#Create coin throwing data
def create_toy_data(sample_size=100):

    #Throw coins according to the bias
    def throw_coin(bias):
        if bias == 1:
            #Table is 0.Throw coins with a probability of 8
            return np.random.choice(range(2), p=[0.2, 0.8])
        else:
            #The back is 0.Throw coins with a probability of 8
            return np.random.choice(range(2), p=[0.8, 0.2])

    #Bias of the first coin
    bias = np.random.uniform() > 0.5
    coin = []
    cheats = []
    for i in xrange(sample_size):
        coin.append(throw_coin(bias))
        cheats.append(bias)

        # 0.Change the coin bias with a probability of 01
        bias = bias + np.random.choice(range(2), p=[0.99, 0.01])
        bias = bias % 2
    coin = np.asarray(coin)

    return coin, cheats

Whole code

import matplotlib.pyplot as plt
import numpy as np


class HiddenMarkovModel(object):

    def __init__(self, n_states_hidden, n_states_observe):
        self.n_states_hidden = n_states_hidden
        self.n_states_observe = n_states_observe
        self.initial = np.ones(n_states_hidden) / n_states_hidden
        self.transition = np.ones((n_states_hidden, n_states_hidden)) / (2 * n_states_hidden)
        self.transition += np.eye(n_states_hidden) * 0.5
        self.observation = np.random.rand(n_states_observe, n_states_hidden)
        self.observation /= np.sum(self.observation, axis=0, keepdims=True)

    def fit(self, sequence, iter_max=100):
        for i in xrange(iter_max):
            params = np.hstack((self.transition.ravel(), self.observation.ravel()))
            p_hidden, p_transition = self.expectation(sequence)
            self.maximization(sequence, p_hidden, p_transition)
            if np.allclose(params, np.hstack((self.transition.ravel(), self.observation.ravel()))):
                break

    def expectation(self, sequence):
        N = len(sequence)
        forward = np.zeros(shape=(N, self.n_states_hidden))
        forward[0] = self.initial * self.observation[sequence[0]]
        backward = np.zeros_like(forward)
        backward[-1] = self.observation[sequence[-1]]
        for i in xrange(1, len(sequence)):
            forward[i] = self.transition.dot(forward[i - 1]) * self.observation[sequence[i]]
        for j in xrange(N - 2, -1, -1):
            backward[j] = (self.observation[sequence[j + 1]] * backward[j + 1]).dot(self.transition)
        p_hidden = forward * backward
        p_hidden /= np.sum(p_hidden, axis=-1, keepdims=True)
        p_transition = self.transition * (self.observation[sequence[1:]] * backward[1:])[:, :, None] * forward[:-1, None, :]
        p_transition /= np.sum(p_transition, axis=(1, 2), keepdims=True)

        return p_hidden, p_transition

    def maximization(self, sequence, p_hidden, p_transition):
        self.initial = p_hidden[0] / np.sum(p_hidden[0])
        self.transition = np.sum(p_transition, axis=0) / np.sum(p_transition, axis=(0, 2))
        self.transition /= np.sum(self.transition, axis=0, keepdims=True)
        x = p_hidden[:, None, :] * (np.eye(self.n_states_observe)[sequence])[:, :, None]
        self.observation = np.sum(x, axis=0) / np.sum(p_hidden, axis=0)


def create_toy_data(sample_size=100):

    def throw_coin(bias):
        if bias == 1:
            return np.random.choice(range(2), p=[0.2, 0.8])
        else:
            return np.random.choice(range(2), p=[0.8, 0.2])

    bias = np.random.uniform() > 0.5
    coin = []
    cheats = []
    for i in xrange(sample_size):
        coin.append(throw_coin(bias))
        cheats.append(bias)
        bias = bias + np.random.choice(range(2), p=[0.99, 0.01])
        bias = bias % 2
    coin = np.asarray(coin)

    return coin, cheats


def main():
    coin, cheats = create_toy_data(200)

    hmm = HiddenMarkovModel(2, 2)
    hmm.fit(coin, 100)
    p_hidden, _ = hmm.expectation(coin)

    plt.plot(cheats)
    plt.plot(p_hidden[:, 1])
    for i in xrange(0, len(coin), 2):
        plt.annotate(str(coin[i]), (i - .75, coin[i] / 2. + 0.2))
    plt.ylim(-0.1, 1.1)
    plt.show()


if __name__ == '__main__':
    main()

result

If you run the code above, you may get the result shown in the figure below. result.png The horizontal axis shows the number of coin throws, and the vertical axis shows the probability. The blue line shows the bias of the coins actually used (two types of coins, one that is easy to come out and the other that is easy to come out). Is used. The green line shows the probability that the coin thrown at that time is in state 1 (I don't know which coin is in state 1). The 0s and 1s in the graph represent the front and back of the actually observed coins. However, it is difficult to read if all are displayed, so it is displayed only once every two times.

Depending on the initial value, the EM algorithm may get stuck in the local optimum solution, so the above result may not always be obtained.

At the end

If parameters are also made into variables, they will fit into bad local optimal solutions quite often, so it may be good to introduce prior knowledge instead of maximum likelihood estimation to perform maximum posteriori probability estimation.

Recommended Posts

PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models
Python implementation of continuous hidden Markov model
PRML Chapter 11 Markov Chain Monte Carlo Python Implementation
Maximum likelihood estimation implementation of topic model in python
Explanation and implementation of PRML Chapter 4
PRML Chapter 5 Neural Network Python Implementation
PRML Chapter 3 Evidence Approximation Python Implementation
PRML Chapter 8 Product Sum Algorithm Python Implementation
PRML Chapter 4 Bayesian Logistic Regression Python Implementation
PRML Chapter 5 Mixed Density Network Python Implementation
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
PRML Chapter 14 Conditional Mixed Model Python Implementation
PRML Chapter 10 Variational Gaussian Distribution Python Implementation
PRML Chapter 6 Gaussian Process Regression Python Implementation
PRML Chapter 2 Student's t Distribution Python Implementation
PRML Chapter 1 Bayesian Curve Fitting Python Implementation
PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems
Python implementation of particle filters
Implementation of quicksort in Python
Maximum likelihood estimation implementation of topic model in python
EM of mixed Gaussian distribution
PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models
PRML ยง3.3.1 Reproduce the convergence diagram of the parameter distribution by Bayesian linear regression
Example of python code for exponential distribution and maximum likelihood estimation (MLE)
[Python of Hikari-] Chapter 09-03 Class (inheritance)
Python implementation of self-organizing particle filters
Implementation of life game in Python
Implementation of desktop notifications using Python
Python implementation of non-recursive Segment Tree
Implementation of Light CNN (Python Keras)
Implementation of original sorting in Python
Implementation of Dijkstra's algorithm with python
How to use hmmlearn, a Python library that realizes hidden Markov models
Implementation of particle filters in Python and application to state space models
Example of python code for exponential distribution and maximum likelihood estimation (MLE)
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function