[PYTHON] Manipulate topic models ~ Interactive Topic Model ~

Implementation of Interactive Topic Model and its results

Introduction

There is a concept called a topic model as a method of extracting contents from a set of documents in natural language processing technology.

Among them, there is an Interactive Topic Model as a method of intentionally manipulating words that appear in a topic.

Therefore, in this article, we will implement the Interactive Topic Model and verify its effect.

Method

Topic model

In the topic model, the probability that a topic (for example, a newspaper article contains a topic such as politics or sports) appears from a set of documents, the topic distribution $ \ theta $, and how within that topic. It is a method to estimate the word distribution $ \ phi $ to see if the word is easy to come out.

A description of the topic model http://qiita.com/GushiSnow/items/8156d440540b0a11dfe6 http://statmodeling.hatenablog.com/entry/topic-model-4 Please refer to the materials around for easy understanding.

Latent Dirichlet Allocation(LDA)

Of the various topic models, Latent Dirichlet Allocation (LDA) is the most famous.

LDA considers that there are multiple topics (politics, news, etc.) in one document (newspaper article), and each topic has a different word distribution.

The graphical model is as shown in the figure below

LDA.png By the way, $ \ theta $ is the topic distribution, $ \ phi $ is the word distribution, $ z $ is the topic assigned to the words in the document, $ v $ is the words in the document, $ N $ is the number of words in the document, $ D $ is the number of documents, $ K $ is the number of topics, $ \ alpha $ and $ \ beta $ are hyperparameters.

Gibbs sampling and variational Bayes can be used to calculate LDA, but is collapsed Gibbs sampling (CGS) the most famous? Is.

The pseudo code for calculating LDA with collapsed Gibbs sampling is as follows


N_dk = 0  #The number of words in document d to which topic k is assigned
N_kv = 0  #The number of times the word v appears in topic k
N_k = 0   #Number of words to which topic k is assigned
d = 1, …, D  #Document number
k = 1, …, K  #Topic number
v = 1, …, V  #Vocabulary number

initialize(z_dn)  #Randomly initialize the topic of the nth word in document d

repeat
  for d = 1, …, D do
    for n = 1, …, N_d do # N_d is the number of words used in document d

      N_d[k=z_dn] -= 1  #Subtract from count
      N_[k=z_dn][v=w_n] -= 1
      N_[k=z_dn] -= 1

      for k = 1, …, K do
        cal(p(z_dn = k)) #Calculate the probability that topic k will be assigned to the nth word in document d
      endfor

      z_dn ~ Categorical(p(z_dn))  # z_Sampling dn topics

      N_d[k=z_dn] += 1  #Count newly assigned topics
      N_[k=z_dn][v=w_n] += 1
      N_[k=z_dn] += 1

    endfor
  endfor
until the end condition is met

The probability of becoming $ p (z_ {dn} = k) $ in the pseudo code is calculated as follows.

p(z_{dn}=k) \propto (N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V}

Can be calculated with.

Interactive Topic Model(ITM)

When I calculate a topic with LDA, I sometimes want this word and this word to be the same topic.

You can tackle this problem by constraining that word A and word B should come from the same topic.

That is the Interactive Topic Model (ITM) http://dl.acm.org/citation.cfm?id=2002505

To give a sensuous explanation, ITM considers constrained words as one word, and evenly distributes the probability of occurrence of those words, making it easier for constrained words to appear on the same topic. スクリーンショット 2017-02-16 16.12.51.png

The calculation is simple, and the formula for calculating $ p (z_ {dn} = k) $ in LDA is rewritten as follows.

p(z_{dn} = k) \propto \begin{cases}
(N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V} \;\;\;\;(w_{dn} \notin \Omega)\\
(N_{dk}+\alpha)\frac{N_{k\Omega}+|\Omega|\beta}{N_k+\beta V}\frac{W_{k\Omega w_{dn}}+\eta}{W_{k \Omega} + |\Omega|\eta} \;\;\;\;(w_{dn} \in \Omega)

\end{cases}

However,\OmegaIs a constraint,|\Omega|Is the number of words contained within the constraint,N_{k\Omega}Is a topickConstraint with\OmegaThe number of times that came out,W_{k\Omega w_{dn}}Is a topickに割り当てられた制約\OmegaWords inw_{dn}Represents the number of times that appears.

In other words, if the word $ w_ {dn} $ is not included in the constraint $ \ Omega $, the same formula as LDA can be used, and if it is, p (z_ {dn} = k) can be calculated using the following formula. ..

Experiment

In the experiment, the accuracy of ITM will be verified.

data set

A livedoor corpus was used for the data set. http://bookmarks2022.blogspot.jp/2015/06/livedoor.html

code

ITM code here https://github.com/kenchin110100/machine_learning/blob/master/sampleITM.py

Experimental result

Fixed number of topics $ K = 10 $, $ \ alpha = 0.1 $, $ \ beta = 0.01 $, $ \ eta = 100 $

First, 50 iterations without restrictions (that is, the same as a normal LDA)

The table below shows the top words that appear in each topic.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
function App Female the work Japan
Release powered by golf movies update
update screen myself 153 Relation
article smartphone male 181 world
Use Presentation marriage directed by Popular
Digi Correspondence Many Release movies
Relation Max Co., Ltd. Opponent 3 http://
smartphone Android jobs 96 myself
software display Christmas 13 topic
user year 2012 Girls Book Wow

You can see some topics with this alone.

Next, constrain the blue words in the table so that they are on the same topic. Turn another 50 iterations.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Release App Female 153 movies
Use powered by myself 181 Japan
function smartphone golf 3 the work
update Presentation male 96 Release
article Correspondence marriage 13 world
service Ma Many 552 directed by
Relation Max Co., Ltd. Opponent 144 Relation
Digi Android jobs 310 Special feature
software display Christmas 98 http://
information year 2012 Good Hero Wow

Words in blue with restrictions appear in common in Topic 5.

Summary

The Interactive Topic Model (ITM) constrained words to estimate the topic distribution.

At first glance, the content I posted looks good, but in reality, it is the result of many trials and errors ...

ITM also has a journal, so if you implement that content, it may be more accurate.

Recommended Posts

Manipulate topic models ~ Interactive Topic Model ~
Creating an interactive application using a topic model
Continuous space topic model implementation
Pokemon classification by topic model