Implementation of Interactive Topic Model and its results

Introduction

There is a concept called a topic model as a method of extracting contents from a set of documents in natural language processing technology.

Among them, there is an Interactive Topic Model as a method of intentionally manipulating words that appear in a topic.

Therefore, in this article, we will implement the Interactive Topic Model and verify its effect.

Method

Topic model

In the topic model, the probability that a topic (for example, a newspaper article contains a topic such as politics or sports) appears from a set of documents, the topic distribution $ \ theta $, and how within that topic. It is a method to estimate the word distribution $ \ phi $ to see if the word is easy to come out.

A description of the topic model http://qiita.com/GushiSnow/items/8156d440540b0a11dfe6 http://statmodeling.hatenablog.com/entry/topic-model-4 Please refer to the materials around for easy understanding.

Latent Dirichlet Allocation（LDA）

Of the various topic models, Latent Dirichlet Allocation (LDA) is the most famous.

LDA considers that there are multiple topics (politics, news, etc.) in one document (newspaper article), and each topic has a different word distribution.

The graphical model is as shown in the figure below

By the way, $ \ theta $ is the topic distribution, $ \ phi $ is the word distribution, $ z $ is the topic assigned to the words in the document, $ v $ is the words in the document, $ N $ is the number of words in the document, $ D $ is the number of documents, $ K $ is the number of topics, $ \ alpha $ and $ \ beta $ are hyperparameters.

Gibbs sampling and variational Bayes can be used to calculate LDA, but is collapsed Gibbs sampling (CGS) the most famous? Is.

The pseudo code for calculating LDA with collapsed Gibbs sampling is as follows


N_dk = 0  #The number of words in document d to which topic k is assigned
N_kv = 0  #The number of times the word v appears in topic k
N_k = 0   #Number of words to which topic k is assigned
d = 1, …, D  #Document number
k = 1, …, K  #Topic number
v = 1, …, V  #Vocabulary number

initialize(z_dn)  #Randomly initialize the topic of the nth word in document d

repeat
  for d = 1, …, D do
    for n = 1, …, N_d do # N_d is the number of words used in document d

      N_d[k=z_dn] -= 1  #Subtract from count
      N_[k=z_dn][v=w_n] -= 1
      N_[k=z_dn] -= 1

      for k = 1, …, K do
        cal(p(z_dn = k)) #Calculate the probability that topic k will be assigned to the nth word in document d
      endfor

      z_dn ~ Categorical(p(z_dn))  # z_Sampling dn topics

      N_d[k=z_dn] += 1  #Count newly assigned topics
      N_[k=z_dn][v=w_n] += 1
      N_[k=z_dn] += 1

    endfor
  endfor
until the end condition is met

The probability of becoming $ p (z_ {dn} = k) $ in the pseudo code is calculated as follows.

p(z_{dn}=k) \propto (N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V}

Can be calculated with.

Interactive Topic Model（ITM）

When I calculate a topic with LDA, I sometimes want this word and this word to be the same topic.

You can tackle this problem by constraining that word A and word B should come from the same topic.

That is the Interactive Topic Model (ITM) http://dl.acm.org/citation.cfm?id=2002505

To give a sensuous explanation, ITM considers constrained words as one word, and evenly distributes the probability of occurrence of those words, making it easier for constrained words to appear on the same topic. スクリーンショット 2017-02-16 16.12.51.png

From the paper of Interactive Topic Model

The calculation is simple, and the formula for calculating $ p (z_ {dn} = k) $ in LDA is rewritten as follows.

p(z_{dn} = k) \propto \begin{cases}
(N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V} \;\;\;\;(w_{dn} \notin \Omega)\\
(N_{dk}+\alpha)\frac{N_{k\Omega}+|\Omega|\beta}{N_k+\beta V}\frac{W_{k\Omega w_{dn}}+\eta}{W_{k \Omega} + |\Omega|\eta} \;\;\;\;(w_{dn} \in \Omega)

\end{cases}

However,\OmegaIs a constraint,|\Omega|Is the number of words contained within the constraint,N_{k\Omega}Is a topickConstraint with\OmegaThe number of times that came out,W_{k\Omega w_{dn}}Is a topickに割り当てられた制約\OmegaWords inw_{dn}Represents the number of times that appears.

In other words, if the word $ w_ {dn} $ is not included in the constraint $ \ Omega $, the same formula as LDA can be used, and if it is, p (z_ {dn} = k) can be calculated using the following formula. ..

Experiment

In the experiment, the accuracy of ITM will be verified.

data set

A livedoor corpus was used for the data set. http://bookmarks2022.blogspot.jp/2015/06/livedoor.html

code

ITM code here https://github.com/kenchin110100/machine_learning/blob/master/sampleITM.py

Experimental result

Fixed number of topics $ K = 10 $, $ \ alpha = 0.1 $, $ \ beta = 0.01 $, $ \ eta = 100 $

First, 50 iterations without restrictions (that is, the same as a normal LDA)

The table below shows the top words that appear in each topic.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
function	App	Female	the work	Japan
Release	powered by	golf	movies	update
update	screen	myself	153	Relation
article	smartphone	male	181	world
Use	Presentation	marriage	directed by	Popular
Digi	Correspondence	Many	Release	movies
Relation	Max Co., Ltd.	Opponent	3	http://
smartphone	Android	jobs	96	myself
software	display	Christmas	13	topic
user	year 2012	Girls	Book	Wow

You can see some topics with this alone.

Next, constrain the blue words in the table so that they are on the same topic. Turn another 50 iterations.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
Release	App	Female	153	movies
Use	powered by	myself	181	Japan
function	smartphone	golf	3	the work
update	Presentation	male	96	Release
article	Correspondence	marriage	13	world
service	Ma	Many	552	directed by
Relation	Max Co., Ltd.	Opponent	144	Relation
Digi	Android	jobs	310	Special feature
software	display	Christmas	98	http://
information	year 2012	Good	Hero	Wow

Words in blue with restrictions appear in common in Topic 5.

Summary

The Interactive Topic Model (ITM) constrained words to estimate the topic distribution.

At first glance, the content I posted looks good, but in reality, it is the result of many trials and errors ...

ITM also has a journal, so if you implement that content, it may be more accurate.

[PYTHON] Manipulate topic models ~ Interactive Topic Model ~