Since I started posting qiita this time, I will write about the implementation of the continuous hidden Markov model by python, which I had a hard time implementing a few months ago. As a memorandum, I will write from the basics in the order in which I studied. Also, I think that this result is probably made, but there are some suspicious parts, so if you notice something, I would appreciate it if you could let me know.

What is a Markov model?

A Markov model is a model created based on a Markov process, and a Markov process is a stochastic process with Markov properties. Markov came out a lot and I couldn't understand it easily, so I will explain my own understanding. First, consider a diagram like the graph theory below. スクリーンショット 2020-09-15 19.21.15.png

In this figure, it is assumed that after a certain period of time, the state changes stochastically in the direction in which the arrow points, and that probability is the number attached to the arrow. If it is in state A now, it will transition to state B with a probability of 0.3 and to state C with a probability of 0.7. If the A state comes from X or Y, the transition probability from state A to state B or state C does not change. In other words, Markov property is a property in which the transition probability to the next state (B or C) depends only on the current state (A). A stochastic process is written as "a random variable that changes with time.-Wikipedia-". In this case, the probability of taking a state changes with time, so it can be said to be a stochastic process. ~~ I'm sorry, I'll leave the strict definition to other information sites ~~ A model created in this way can be called a Markov model.

Hidden Markov Model

The hidden Markov model is a model like the Markov model mentioned earlier, but it is a model used when it has a certain value (output) depending on the state instead of not knowing the state. Let's make a concrete example based on natural language processing. First, consider the example sentence (output) of "I book a film" (I reserve a movie). At this time, "I" and "film" are nouns, "book" is a verb, and "a" is an article. Such part of speech is considered hidden because it is not known from this sentence. Suppose you want the machine to decide the part of speech of this sentence. However, although "I" and "a" are almost uniquely determined, "book" and "film" can also be used as verbs such as taking a book or movie, which is a noun. Therefore, let's consider whether it is possible to classify part of speech well using the hidden Markov model.

In English, the order of part of speech is fixed to some extent, such as 5 sentence patterns. Therefore, the ease of changing part of speech is likely to be as shown in the table below. (Price is set by your own sense)

	noun	verb	article	End of sentence
Beginning of sentence	0.6	0.2	0.2	0
noun	0.3	0.5	0.1	0.1
verb	0.6	0	0.3	0.1
article	1	0	0	0
End of sentence	0	0	0	0

Think of the left as the current state and the top as the next state. To give an example, the probability that a noun will come next to a noun is 0.3, the probability that a verb will come is 0.5, the probability that an article will come is 0.1, and the probability that the end of a sentence will come is 0.1. ~~ Using 0 for the probability is not good for zero frequency problem ~~ Then, it is assumed that the output is taken with the following probabilities in each state. スクリーンショット 2020-09-16 0.30.15.png

In other words, when it is a noun, "I" is 0.4, "a" is 0.0, "book" is 0.3, and "film" is 0.2. The method of state transition at this time is as follows. (The state where the transition probability is 0 is deleted in advance)

スクリーンショット 2020-09-16 1.50.16.png

Then, the number of words is n, the word string (observable) is $ W = w_1, w_2, ..., w_n $, and the part of speech sequence (unobservable: state) is $ T = t_1, t_2 ,. .., t_n $, the probability that this model will give when deciding one route is P(w)=　\prod_{i=1}^{n}P(t_i|t_{i-1})P(w_i|t_i)

Can be represented by. In other words, the probability of changing from the previous state to the current stateP(t_i|t_{i-1})The probability of transition and the probability of outputting the word in the state of a certain part of speechP(w_i|t_i)It can be obtained by multiplying the product of the appearance probabilities. In this model, the initial state (initial state) was the beginning of the sentence. It is obvious that the first state is the beginning of the sentence this time, but when adapting to other data, which state the first state is likely to start from (initial probability) is also greatly related to the probability. Therefore, the resulting probabilities of this model are $ \ pi_i $ for the initial probability, $ a_ {i, j} $ for the transition probability, and $ b_j (w_t) $ for the output function (where i, j represent a certain state. t indicates the order of output.) And the probability when the order of states is $ I = i_1, i_2, ..., i_n $ is P(w)=\pi_{i_0}\prod_{t=1}^{n}a_{i_{t-1},i_t}b_{i_t}(w_t)

(However, in the example of the sentence used this time, $ w_0 $ is blank, $ w_1 $ is'I', ..., $ w_n $ is blank.) Then, the part of speech of this sentence can be predicted by finding the transition method with the highest probability. By the way, in the case of this example, it can be said that it is biased in an easy-to-understand manner. ..

Digression --- The initial probability will continue to be $ \ pi $ , The transition probability is $ a_ {i, j} $, and the appearance probability is $ b_j (w) $.

Continuous hidden Markov model

Since the hidden Markov model mentioned earlier was thinking about four words, the output can be thought of as a four-value output. So the output was a discrete value. Therefore, it can be said to be a discrete hidden Markov model. Therefore, the hidden Markov model when using continuous values such as temperature is called the continuous hidden Markov model. In the discrete Markov model mentioned earlier, the appearance probabilities of each state are calculated individually, but since it is not realistic to assign all values to continuous values, we will define the appearance probabilities using a mixed Gaussian distribution. .. Before we get into the implementation of the Hidden Markov Model, let's first write about the mixed Gaussian distribution.

Gaussian / mixed Gaussian distribution

The mixed Gaussian distribution is a representation of the probability distribution by adding a number of Gaussian distributions. In the first place, the Gaussian distribution is also called the normal distribution. N(x| \mu,\sigma)=\frac{1}{\sqrt{2 \pi\sigma^2}}\exp(-\frac{(x-\mu)^2}{2\sigma^2})

It is represented by. 300px-Standard_deviation_diagram.svg.png

-Image borrowed from wikipedia- As you can see from this graph, the closer this x is to the average $ \ mu $ (0 in this figure), the larger the value. And if the variance is large, it will be flat. In other words, if this value is considered as a probability, the closer the value of x is to the average $ \ mu $, the higher the value will be, and if the variance is large, it will be possible to take that value even if it is far from the average. I can say.

Also, the mixed Gaussian distribution is like adding multiple normal distributions together, and if K Gaussian distributions are combined, the formula for the mixed Gaussian distribution is P(x|\pi,\mu,\sigma)=\sum_{k=1}^{K}\pi_kN(x|\mu_k,\sigma_k)

However, there is a condition of $ \ sum_ {k = 1} ^ K \ pi_k = 1 $.

It is represented by. If you try to illustrate this スクリーンショット 2020-09-17 23.27.20.png

It looks like this.

Clustering with mixed Gaussian distribution

In the future implementation of the continuous hidden Markov model, it is necessary to identify the Gaussian distribution from the data points. Therefore, I would like to see the flow by implementing soft clustering with a mixed Gaussian distribution.

First, let's plot the points based on the three Gaussian distributions. スクリーンショット 2020-09-18 14.57.21.png

This is the plot. These points are in a state where we know from which Gaussian distribution they were generated. (Complete data) However, when actually clustering, I do not know from which distribution it was generated. (Incomplete data) Therefore, from now on, we will adjust the parameters of the mixed Gaussian distribution consisting of three Gaussian distributions so that they can be best divided into three according to the Gaussian distribution.

How can we divide it into three well? It means that the simultaneous probability of all points according to the desired mixed Gaussian distribution should be the highest. In other words ln(p(X|\pi,\mu,\sigma))= ln(\prod_{i=1}^{n}\sum_{k=1}^K \pi_kN(x_i|\mu_k,\sigma_k))=\sum_{i=1}^nln(\sum_{k=1}^K\pi_kN(x_i|\mu_k,\sigma_k))

$ X = [x_1, x_2, ..., x_n] $ ・・・ Each data point $ K $ ・・・ Number of models (number of clusters)

Adjust the parameters so that is the highest. This is also called the log-likelihood function. I would like to change the parameters so as to maximize this function, but since the sum of the sums appears and it is difficult to solve analytically, I will solve it using the EM algorithm.

As a method of EM algorithm E step: Calculation of expected value M step: Adjust the parameters to maximize the expected value It is a method to find the optimum solution by repeating.

The flow this time is

Determine the initial value of $ \ pi, \ mu, \ sigma $
Calculate the burden rate (E step)
Adjust the values of the parameters $ \ pi, \ mu, \ sigma $.
If there is no difference in likelihood, the process ends. If it still changes, go back to 2.

I will do it in the flow.

initial value##

The initial initial value is decided appropriately. (Of course, if you know this value somehow, it seems that it will converge faster if you adjust it.)

E step

Next is the E step. The term burden rate comes up here, which is the probability that a point x occurs from model k. Given the plot above, I didn't know which Gaussian distribution model the point came from. Consider the latent variable z indicating whether that was born from any model for that. If $ z_ {i, k} = 1 $, then the point $ x_i $ is a point born from model k.

When this is made into an expression p(z_{i,k}=1|x_i)=\frac{p(z_{i,x}=1,x_i)}{p(x_i)}

Also, p(x_i|x_{i,k}=1)=\frac{p(z_{i,k}=1,x_i)}{p(z_{i,k}=1)}

Because p(z_{i,k}=1|x_i)=\frac{p(x_i|z_{i,k}=1)p(z_{i,z}=1)}{p(x_i)}

Is. (Bayes' theorem) Also, since $ p (x_i) $ is a value marginalized with respect to z, it is expressed by the formula of mixed Gaussian distribution. p(z_{i,k}=1|x_i)=\frac{p(x_i|z_{i,k}=1)p(z_{i,k}=1)}{\sum_{k=1}^{K}\pi_kN(x|\mu_k,\sigma_k)}=\gamma(z_{i,k})

It will be. here,p(x_i|z_{i,k}=1)Is from model kx_iIs born, so the normal distribution of model kN(x_i|\mu_k,\sigma_k)is. And because $ p (z_ {i, k} = 1) = \ pi_k $ \gamma(z_{i,k})=\frac{\pi_kN(x|\mu_k,\sigma_k)}{\sum_{k=1}^{K}\pi_kN(x|\mu_k,\sigma_k)} It will be.

I haven't caught up with why $ p (z_ {i, k} = 1) = \ pi_k $, but this is a value that has nothing to do with x, and how much it belongs to model k. It is a feeling that it may certainly be the case if it is an amount to compare whether it is easy or not.)

The reason why this is called the burden rate is that it shows how much the model k occupies the proportion of the mixed Gaussian distribution. If you look at the figure two years ago, you can imagine it somehow.

M step

Next is the parameter update. As for each policy, for $ \ mu and \ sigma $, the likelihood function should be the highest, so we will proceed with the policy of differentiating and searching for the point where the value becomes 0. Also, since $ \ pi $ has the condition of $ \ sum_ {k = 1} ^ K \ pi_k = 1 $, it will be solved by Lagrange's undetermined multiplier method.

The derivation of this derivative is EM algorithm thorough explanation, so other people explain it in a very easy-to-understand manner, so I will summarize only the results. ~~ I personally derived it, but it was very difficult ~~ result \mu_k*=\frac{1}{N_k}\sum_{i=1}^N\gamma(z_{i,k})x_i \sigma_k*=\frac{1}{N_k}\sum_{i=1}^N\gamma(z_{i,k})(x_i-\mu_k)(x_i-\mu_k)^T \pi_k*=\frac{N_k}{N}=\frac{\sum_{n=1}^N\gamma(z_{i,k})}{N}

The result is.

Implementation of mixed Gaussian distribution

Since the derivation was organized, I actually implemented clustering of a mixed Gaussian distribution.

`ex_gmm.py`


import numpy as np
import matplotlib.pyplot as plt



#difine
data_num = 100 
model_num = 3
dim = 2
color = ['#ff0000','#00ff00','#0000ff']
epsilon = 1.0e-3
check = False # if not check == False

#gaussian calculator
def cal_gauss(x,mu,sigma):
	return np.exp(-np.dot((x-mu),np.dot(np.linalg.inv(sigma),(x-mu).T))/2)/np.sqrt(2*np.pi*abs(np.linalg.det(sigma)))



#init create param
mu =[]
sigma = []
for i in range(model_num):
	mu.append([np.random.rand()*10 ,np.random.rand()*10])
	temp = np.random.rand()-np.random.rand()
	sigma.append([[np.random.rand()+1,temp],[temp,np.random.rand()+1]])
	


#create data
values = []

for i in range(model_num):
	values.append(np.random.multivariate_normal(mu[i],sigma[i],data_num))
for i in range(model_num):
	plt.scatter(values[i][:,0],values[i][:,1],c=color[i],marker='+')

plt.show()



#create training data
train_values =values[0]
for i in range(model_num-1):
	train_values = np.concatenate([train_values,values[i+1]])
plt.scatter(train_values[:,0],train_values[:,1],marker='+')




#init model param
model_mu =[]
model_sigma =[]
 
for i in range(model_num):
	model_mu.append([np.random.rand()*10 ,np.random.rand()*10])
	temp = np.random.rand()-np.random.rand()
	model_sigma.append([[np.random.rand()+1,temp],[temp,np.random.rand()+1]])
	
model_mu = np.array(model_mu)
model_sigma = np.array(model_sigma)
for i in range(model_num):
	plt.scatter(model_mu[i,0],model_mu[i,1],c=color[i],marker='o') #plot pre train mu
temp = []
for i in range(model_num):
	temp.append(np.random.rand())
model_pi = np.zeros_like(temp)
for i in range(model_num):
	model_pi[i] = temp[i]/np.sum(temp) 

plt.show()



#training
old_L = 1
new_L = 0
gamma = np.zeros((data_num*model_num,model_num))
while(epsilon<abs(new_L-old_L)):

	#gamma calculation
	for i in range(data_num*model_num):
		result_gauss = []
		for j in range(model_num):
			result_gauss.append(cal_gauss(train_values[i,:],model_mu[j],model_sigma[j]))	
		result_gauss = np.array(result_gauss)
		old_L = np.sum(model_pi*result_gauss) #old_L calculation	
		for j in range(model_num):
			gamma[i,j] = (model_pi[j]*result_gauss[j])/np.sum(model_pi*result_gauss,axis=0)

	N_k =np.sum(gamma,axis=0)
	
	#calculation sigma
	for i in range(model_num):
		temp = np.zeros((dim,dim))
		for j in range(data_num*model_num):	
			temp+=gamma[j,i]*np.dot((train_values[j,:]-model_mu[i]).reshape(-1,1),(train_values[j,:]-model_mu[i]).reshape(1,-1))
	
		model_sigma[i] = temp/N_k[i]


	#calculation mu
	for i in range(model_num):
		model_mu[i] = np.sum(gamma[:,i].reshape(-1,1)*train_values,axis=0)/N_k[i]


	#calculation pi
	for i in range(model_num):
		model_pi[i]=N_k[i]/(data_num*model_num)

	#check plot
	if(check == True ):
		gamma_crasta = np.argmax(gamma,axis=1)	
		for i in range(data_num*model_num):
			color_temp = color[gamma_crasta[i]]
			plt.scatter(train_values[i,0],train_values[i,1],c=color_temp,marker='+')
		
		for i in range(model_num):
			plt.scatter(model_mu[i,0],model_mu[i,1],c=color[i],marker='o')

		plt.show()
	##new_Lcalculation
	for i in range(data_num*model_num):
		result_gauss = []
		for j in range(model_num):
			result_gauss.append(cal_gauss(train_values[i,:],model_mu[j],sigma[j]))
		result_gauss = np.array(result_gauss)
		new_L = np.sum(model_pi*result_gauss)
	
gamma_crasta = np.argmax(gamma,axis=1)	
for i in range(data_num*model_num):
	color_temp = color[gamma_crasta[i]]
	plt.scatter(train_values[i,0],train_values[i,1],c=color_temp,marker='+')

for i in range(model_num):
	plt.scatter(model_mu[i,0],model_mu[i,1],c=color[i],marker='o')

plt.show()

It should work like this. In this program, all the parameters are randomly output, so it may not be possible to cluster well. ~~ Actually, when such data comes in, clustering is not possible ... ~~

Implementation of continuous hidden Markov model

It's been a detour, but I'd like to see the implementation of the continuous Markov model from now on. When the observation information y = {$ y_1, y_2, ..., y_t $}, which is a continuous Markov model, is obtained, the probability $ p (y | M) $ that this model recalls this value is determined by the trellis algorithm. can do. The trellis algorithm $ p (y | M) = \ sum_ {i} \ alpha (i, T) $ (however, this i is the final state)

Is required at. And this $ \ alpha (i, t) $ is

\begin{eqnarray}
\alpha_{i,t}=\left\{ \begin{array}{ll}
\ \pi_i & (t=0) \\
\ \sum_j \alpha(j,t-1)a_{j,i}・ B_{i}(y_t) & (|l-k|=1) \\
\end{array} \right.
\end{eqnarray}

is. And $ a_ {j, i} $ and $ b_ {j, i} (y_t) $ that appear here are a little while ago, but they show the transition probability and the appearance probability that also appeared in the hidden Markov model. I am. (This story had nothing to do with learning ...) This $ \ alpha (i, t) $ traces the probability from the front (forward function), but defines a function that traces this from the end (backward function).

\begin{eqnarray}
\beta_{i,T}=\left\{ \begin{array}{ll}
\ 1 & (Final state) \\
\ 0 & (Not in final state) \\
\end{array} \right.
\end{eqnarray}

\beta(i,t)=\sum_{j}a_{i,j}・ B_{i}(y_t)・\beta(j,t+1) \qquad (t=T,T-1,...,1)

The Baum-Welch algorithm, which is a type of EM algorithm, can be used to train the model parameters of the hidden Markov model. The Baum-Welch algorithm is used to calculate in E steps. $ \ Gamma (i, j, t) = \ frac {\ alpha (i, t-1) ・ a_ {i, j} ・ b_ {j} ・ \ beta (j, t)} {p (y | M )} $

This value. This value is considered to be the probability of transitioning from the state i to j at time t and outputting $ y_t $ divided by the probability of y occurring in this model.

And the parameter estimation (update) formula is

\pi*=\frac{\sum_j\gamma(i,j,1)}{\sum_i\sum_j\gamma(i,j,1)}

a_{i,j}*=\frac{\sum_t\gamma(i,j,t)}{\sum_t\sum_j\gamma(i,j,t)}

Next, Because b has a Gaussian distribution

\mu_{i,j}*=\frac{\sum_{t=1}^T\gamma(i,j,t)y_t}{\sum_{t=1}^T\gamma(i,j,t)}

\sigma_{i,j}=\frac{\sum_{t=1}^T \gamma(i,j,t)(y_t-\mu_j)(y_t-\mu_j)^t}{\sum_{t=1}^T\gamma(i,j,t)}

Is required at.

Then it is a code. Implementation file of Gaussian distribution because it was adjusted sequentially. It is divided into three parts: a hidden Markov model file and an executable file. (At first, I was planning to make it with GMM-HMM, so there are many parts that are useless to implement. Please wait for a while as I will update it again as soon as GMM-HMM works.)

`class_b.py`


import numpy as np
import matplotlib.pyplot as plt
import math
import sys

#construct
#c	=	mixing coefficient	
#mu	=	mean
#sigma	=	dispersion
#m	=	model num
#size	=	datasize
#x	=	data

##method
#gaussian(x)	:	calculation gaussian:x=data
#updata(x)	:	update gmm :x=data
#output(x)	:	gmm output:x=data (size=1)
class function_b:
	upsilon = 1.0e-50
	def __init__(self,c,mu,sigma,m,size,x):

		self.c = np.array(c)
		self.c = self.c.reshape((-1,1))

		self.mu = np.array(mu)
		self.mu = self.mu.reshape((-1,1))

		self.sigma = np.array(sigma)
		self.sigma = self.sigma.reshape((-1,1))

		self.size = size
		self.model_num = m
		self.x = np.array(x).reshape(1,-1)
		self.ganma = np.zeros((self.model_num,self.size))
		self.sum_gaus = 0	
		self.N_k=np.zeros(self.model_num)
		

	def gaussian(self,x):
		return np.exp(-((x-self.mu)**2)/(2*self.sigma))/np.sqrt(2*math.pi*self.sigma)
		
		
	def update(self,pre_ganma):
		#x=[x1~xt]
		self.ganma = np.zeros((self.model_num,self.size))
	
		#calculate ganma
		for m in range(self.model_num):
			self.ganma[m] = pre_ganma*((self.c[m]*self.gaussian(self.x)[m]+function_b.upsilon)/np.sum(self.c*self.gaussian(self.x)+function_b.upsilon,axis=0))+function_b.upsilon
		
		for m in range(self.model_num):
			self.N_k[m] = np.sum(self.ganma[m])
		#calculate c
		for m in range(self.model_num):
			self.c[m]=self.N_k[m]/np.sum(self.N_k)
		#calculate sigma
		for m in range(self.model_num):
			self.sigma[m]=np.sum(self.ganma[m]*((self.x-self.mu[m])**2))/self.N_k[m]

		#calculate mu
		for m in range(self.model_num):
			self.mu[m]=np.sum(self.ganma[m]*self.x)/self.N_k[m]
		
	def output(self,x):
		return np.sum(self.c.reshape((1,-1))*self.gaussian(x).reshape((1,-1)))

`class_hmm_gmm.py`


import numpy as np
import sys
import matplotlib.pyplot as plt
from class_b import function_b 
import copy
#construct
#s	=	state num
#x	=	data
#pi	=	init prob
#a 	=	transition prob
#c	=	gmm mixing coefficient
#mu	=	gmm mean
#sigma	=	gmm dispersion
#m	=	gmm model num

#method
#cal_alpha()	:	calculation alpha
#cal_beta()	:	calculation beta
#cal_exp()	:	calculation expected value(?)
#update()	:	update prams
#makestatelist	:	make statelist 
#viterbi	:	viterbi algorithm


class Hmm_Gmm():
	upsilon=1.0e-300
	def __init__(self,s,x,pi,a,c,mu,sigma,m):
		self.s = s
		self.x = x
		self.a = np.array(a)
		self.size=len(x)
		self.b=[]	
	
		for i in range(s):	
			self.b.append(function_b(c[i],mu[i],sigma[i],m,self.size,x))
		self.pi = np.array(pi)	
	
		self.alpha = np.zeros((self.s,self.size))
		self.beta = np.zeros((self.s,self.size))		
		self.prob = None
		self.kusai = np.zeros((self.s,self.s,self.size-1))
		self.ganma = np.zeros((self.s,self.size))
		self.preganma = None
		self.delta = None
		self.phi = None
		self.F = None
		self.P = None
		self.max_statelist=None

	def cal_alpha(self):
		
	
		for t in range(self.size):
			for j in range(self.s):
				if t==0:
					self.alpha[j,t]=self.pi[j]*self.b[j].output(self.x[t])
				else:
					self.alpha[j,t]=0
					for i in range(self.s): 
						self.alpha[j,t]+=self.alpha[i,t-1]*self.a[i,j]*self.b[j].output(self.x[t])
		self.prob =0
		for i in range(self.s):
			self.prob+=self.alpha[i,self.size-1]+Hmm_Gmm.upsilon
		
 
	def cal_beta(self):
		for t in reversed(range(self.size)):
			for i in range(self.s):
				
				if t ==self.size-1:
					self.beta[i,t]=1	
				else:
					self.beta[i,t]=0	
					for j in range(self.s):
						self.beta[i,t]+=self.beta[j,t+1]*self.a[i,j]*self.b[j].output(self.x[t+1])
	
	def cal_exp(self):
		for t in range(self.size-1):
			for i in range(self.s):
				for j in range(self.s):
								
					self.kusai[i,j,t] = ((self.alpha[i,t]*self.a[i,j]*self.b[j].output(self.x[t+1])*self.beta[j,t+1])+(Hmm_Gmm.upsilon/(self.s)))/self.prob
			
		for t in range(self.size):
			for i in range(self.s):
				#self.ganma[i,t] = (self.alpha[i,t]*self.beta[i,t])/self.prob
				#self.ganma[i,t] = (self.alpha[i,t]*self.beta[i,t])/self.prob + Hmm_Gmm.uplilon
				self.ganma[i,t] = ((self.alpha[i,t]*self.beta[i,t])+(Hmm_Gmm.upsilon))/self.prob
	def update(self):
		#cal exp

		self.cal_alpha()
		self.cal_beta()
		self.cal_exp() 
		#cal pi
		self.pi = self.ganma[:,0]
			
		#cal a
		for i in range(self.s):
			for j in range(self.s):
				
				#self.a[i,j]=np.sum(self.ganma[i,:])

				self.a[i,j]=np.sum(self.kusai[i,j,:])/np.sum(self.ganma[i,:])
				#self.a[i,j]=np.sum(self.kusai[i,j,:])

	
		#cal b
		#cal preganma
		self.preganma = np.zeros((self.s,self.size))	
		for t in range(self.size):	
			temp = self.alpha[:,t]*self.beta[:,t]+self.upsilon
			self.preganma[:,t] = temp/np.sum(temp)
		for i in range(self.s):
			self.b[i].update(np.array(self.preganma[i]))	
	
	def makestatelist(self,f,size):
		statelist = np.zeros(size)
		for t in reversed(range(size)):
			if t == size-1:
				statelist[t] = f
			else:	
				statelist[t] = self.phi[statelist[t+1],t+1]
		return statelist

	def viterbi(self,x):
		data=np.array(x)
		self.delta = np.zeros((self.s,data.shape[0]))
		self.phi = np.zeros((self.s,data.shape[0]))
		
		for t in range(data.shape[0]):
			for i in range(self.s):	
				if t == 0:
					self.phi[i,t] = np.nan
					self.delta[i,t] = np.log(self.pi[i])	
				else: 
					self.phi[i,t] = np.argmax(self.delta[:,t-1]+np.log(self.a[:,i]))					
					self.delta[i,t] = np.max(self.delta[:,t-1]+np.log(self.a[:,i]*self.b[i].output(data[t])+self.upsilon))	
		self.F = np.argmax(self.delta[:,data.shape[0]-1])
		self.P = np.max(self.delta[:,data.shape[0]-1])	
		#print(self.makestatelist(self.F,data.shape[0]))
		self.max_statelist=self.makestatelist(self.F,data.shape[0])

`ex_hmm_demo.py`



import numpy as np
import matplotlib.pyplot as plt
import sys 
from class_hmm_gmm import Hmm_Gmm 

model_num = 1
state_num = 4
color=['#ff0000','#00ff00','#0000ff','#888888']
value_size = int(10*model_num*state_num)


create_mu = []
create_sigma= []
create_pi =[]
for i in range(state_num):
	temp = []
	temp1 = []
	temp2=[]
	for j in range(model_num):
		temp.extend([(np.random.rand()*30-np.random.rand()*30)])
		temp1.extend([np.random.rand()+1])
		temp2.extend([np.random.rand()])
		
	create_mu.append(temp)
	create_sigma.append(temp1)	
	create_pi.append((temp2/np.sum(temp2)).tolist())

#create data
values = []
for i in range(state_num):
	temp = []
	for it in range(int(value_size/state_num/model_num)):
		for j in range(model_num):
			temp.extend(np.random.normal(create_mu[i][j],create_sigma[i][j],1).tolist())
	values.append(temp)

x_plot = np.linspace(0,value_size,value_size)
for i in range(state_num):
	plt.scatter(x_plot[i*len(values[0]):i*len(values[0])+len(values[0])],values[i],c=color[i],marker='+')

for i in range(state_num):
	for j in range(model_num):
		plt.plot(x_plot[i*len(values[0]):i*len(values[0])+len(values[0])],np.full(len(values[0]),create_mu[i][j]),c=color[i])
	

plt.show()


train_values=[]
for i in range(state_num):
	train_values.extend(values[i])

train_mu = []
train_sigma= []
train_pi =[]
for i in range(state_num):
	temp = []
	temp1 = []
	temp2=[]
	for j in range(model_num):
		temp.extend([np.random.rand()*30-np.random.rand()*30])
		temp1.extend([np.random.rand()+1])
		temp2.extend([np.random.rand()])
		
	train_mu.append(temp)
	train_sigma.append(temp1)	
	train_pi.append((temp2/np.sum(temp2)).tolist())


##backup train
b_train_mu = train_mu
b_train_sigma = train_sigma

plt.scatter(x_plot,train_values,c="#000000",marker='+')

for i in range(state_num):
	for j in range(model_num):
		plt.plot(x_plot[i*len(values[0]):i*len(values[0])+len(values[0])],np.full(len(values[0]),train_mu[i][j]),c=color[i])
	
plt.show()
#init a
a = []

for j in range(state_num):
	temp = []
	for i in range(state_num):	
		temp.append(np.random.rand())	
		
	a.append(temp)

for i in range(state_num):
	for j in range(state_num):	
		#a[j][i] = a[j][i]/np.sum(np.array(a)[:,i])
		a = (np.array(a)/np.sum(np.array(a),axis=0)).tolist()

#init init_prob
init_prob = []

for j in range(state_num):
	temp2.extend([np.random.rand()])
#temp2[0]+=1
init_prob.extend((temp2/np.sum(temp2)).tolist())
#init Hmm_Gmm model
hmm_gmm = Hmm_Gmm(state_num,train_values,init_prob,a,train_pi,train_mu,train_sigma,model_num)

#train roop
print(create_mu)
print(a)
count = 0
while(count != 100):	
	hmm_gmm.update()	
	count +=1
	
#viterbi
hmm_gmm.viterbi(train_values)
print(hmm_gmm.max_statelist)
#*************************************
plt.scatter(x_plot,train_values,c="#000000",marker='+')

##plot of init train mu 	

for i in range(state_num):
	for j in range(model_num):
		plt.plot(x_plot[i*len(values[0]):i*len(values[0])+len(values[0])],np.full(len(values[0]),b_train_mu[i][j]),c="#dddddd")

for i in range(state_num):
	for j in range(model_num):
		plt.plot(x_plot[i*len(values[0]):i*len(values[0])+len(values[0])],np.full(len(values[0]),hmm_gmm.b[i].mu[j]),c=color[i])
	
plt.show()

result スクリーンショット 2020-09-19 2.45.03.png The points output from the initial $ \ mu $ value and the Gaussian distribution. This time, four states are assumed, and points are generated from each of the four Gaussian distributions. From the beginning, each state is called state 1 (red), state 2 (green), state 3 (blue), and state 4 (gray).

スクリーンショット 2020-09-19 2.45.12.png

The average value $ \ mu $ was reset with a random number. We will continue to learn about this. This is the result of updating 100 times. スクリーンショット 2020-09-19 2.45.29.png

It is a learning result. The states are in the order of state 2, state 1, state 4, and state 3 from the beginning, but as a result, I feel that it is working well.

Finally##

This time I tried to summarize the hidden Markov model in my own way, but the amount is larger than I expected, and there are many places where I forget how to think even though I implemented it a few months ago, so it is not possible to output early. I really felt it was important. Because of that influence, I can't explain the bitterbi algorithm and the function of the state list. I will summarize it again at a later date. ~~ (There is also a reason why I couldn't enter because of my lack of writing ability ...) ~~ Also, the Gaussian distribution may not fit properly in this program, so I would like to think a little more about whether it is due to my program or whether it has fallen into a locally optimal solution. I also tried to implement GMM-HMM (changing model_num makes it GMM), but it didn't work well, and where did it go wrong again (or did it fall into the local optimal solution) again? I would like to review it. I'm sorry it wasn't completed properly. Also, I found a paper that seems to describe a method for solving a problem that falls into a locally optimal solution, so I will study this as well.

[reference]##

https://mieruca-ai.com/ai/markov_model_hmm/ [Technical explanation] Markov model and hidden Markov model https://qiita.com/kenmatsu4/items/59ea3e5dfa3d4c161efb Thorough explanation of EM algorithm

Python implementation of continuous hidden Markov model