Implementation of Interactive Topic Model and its results
There is a concept called a topic model as a method of extracting contents from a set of documents in natural language processing technology.
Among them, there is an Interactive Topic Model as a method of intentionally manipulating words that appear in a topic.
Therefore, in this article, we will implement the Interactive Topic Model and verify its effect.
In the topic model, the probability that a topic (for example, a newspaper article contains a topic such as politics or sports) appears from a set of documents, the topic distribution $ \ theta $, and how within that topic. It is a method to estimate the word distribution $ \ phi $ to see if the word is easy to come out.
A description of the topic model http://qiita.com/GushiSnow/items/8156d440540b0a11dfe6 http://statmodeling.hatenablog.com/entry/topic-model-4 Please refer to the materials around for easy understanding.
Latent Dirichlet Allocation(LDA)
Of the various topic models, Latent Dirichlet Allocation (LDA) is the most famous.
LDA considers that there are multiple topics (politics, news, etc.) in one document (newspaper article), and each topic has a different word distribution.
The graphical model is as shown in the figure below
By the way, $ \ theta $ is the topic distribution, $ \ phi $ is the word distribution, $ z $ is the topic assigned to the words in the document, $ v $ is the words in the document, $ N $ is the number of words in the document, $ D $ is the number of documents, $ K $ is the number of topics, $ \ alpha $ and $ \ beta $ are hyperparameters.
Gibbs sampling and variational Bayes can be used to calculate LDA, but is collapsed Gibbs sampling (CGS) the most famous? Is.
The pseudo code for calculating LDA with collapsed Gibbs sampling is as follows
N_dk = 0 #The number of words in document d to which topic k is assigned
N_kv = 0 #The number of times the word v appears in topic k
N_k = 0 #Number of words to which topic k is assigned
d = 1, …, D #Document number
k = 1, …, K #Topic number
v = 1, …, V #Vocabulary number
initialize(z_dn) #Randomly initialize the topic of the nth word in document d
repeat
for d = 1, …, D do
for n = 1, …, N_d do # N_d is the number of words used in document d
N_d[k=z_dn] -= 1 #Subtract from count
N_[k=z_dn][v=w_n] -= 1
N_[k=z_dn] -= 1
for k = 1, …, K do
cal(p(z_dn = k)) #Calculate the probability that topic k will be assigned to the nth word in document d
endfor
z_dn ~ Categorical(p(z_dn)) # z_Sampling dn topics
N_d[k=z_dn] += 1 #Count newly assigned topics
N_[k=z_dn][v=w_n] += 1
N_[k=z_dn] += 1
endfor
endfor
until the end condition is met
The probability of becoming $ p (z_ {dn} = k) $ in the pseudo code is calculated as follows.
p(z_{dn}=k) \propto (N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V}
Can be calculated with.
Interactive Topic Model(ITM)
When I calculate a topic with LDA, I sometimes want this word and this word to be the same topic.
You can tackle this problem by constraining that word A and word B should come from the same topic.
That is the Interactive Topic Model (ITM) http://dl.acm.org/citation.cfm?id=2002505
To give a sensuous explanation, ITM considers constrained words as one word, and evenly distributes the probability of occurrence of those words, making it easier for constrained words to appear on the same topic.
The calculation is simple, and the formula for calculating $ p (z_ {dn} = k) $ in LDA is rewritten as follows.
p(z_{dn} = k) \propto \begin{cases}
(N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V} \;\;\;\;(w_{dn} \notin \Omega)\\
(N_{dk}+\alpha)\frac{N_{k\Omega}+|\Omega|\beta}{N_k+\beta V}\frac{W_{k\Omega w_{dn}}+\eta}{W_{k \Omega} + |\Omega|\eta} \;\;\;\;(w_{dn} \in \Omega)
\end{cases}
However,
In other words, if the word $ w_ {dn} $ is not included in the constraint $ \ Omega $, the same formula as LDA can be used, and if it is, p (z_ {dn} = k) can be calculated using the following formula. ..
In the experiment, the accuracy of ITM will be verified.
A livedoor corpus was used for the data set. http://bookmarks2022.blogspot.jp/2015/06/livedoor.html
ITM code here https://github.com/kenchin110100/machine_learning/blob/master/sampleITM.py
Fixed number of topics $ K = 10 $, $ \ alpha = 0.1 $, $ \ beta = 0.01 $, $ \ eta = 100 $
First, 50 iterations without restrictions (that is, the same as a normal LDA)
The table below shows the top words that appear in each topic.
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
function | App | Female | the work | Japan |
Release | powered by | golf | movies | update |
update | screen | myself | 153 | Relation |
article | smartphone | male | 181 | world |
Use | Presentation | marriage | directed by | Popular |
Digi | Correspondence | Many | Release | movies |
Relation | Max Co., Ltd. | Opponent | 3 | http:// |
smartphone | Android | jobs | 96 | myself |
software | display | Christmas | 13 | topic |
user | year 2012 | Girls | Book | Wow |
You can see some topics with this alone.
Next, constrain the blue words in the table so that they are on the same topic. Turn another 50 iterations.
Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
---|---|---|---|---|
Release | App | Female | 153 | movies |
Use | powered by | myself | 181 | Japan |
function | smartphone | golf | 3 | the work |
update | Presentation | male | 96 | Release |
article | Correspondence | marriage | 13 | world |
service | Ma | Many | 552 | directed by |
Relation | Max Co., Ltd. | Opponent | 144 | Relation |
Digi | Android | jobs | 310 | Special feature |
software | display | Christmas | 98 | http:// |
information | year 2012 | Good | Hero | Wow |
Words in blue with restrictions appear in common in Topic 5.
The Interactive Topic Model (ITM) constrained words to estimate the topic distribution.
At first glance, the content I posted looks good, but in reality, it is the result of many trials and errors ...
ITM also has a journal, so if you implement that content, it may be more accurate.