I tried to implement TOPIC MODEL in Python

The following materials were used as a reference.

NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)

Please refer to the following for a rough explanation of the TOPIC model.

http://qiita.com/GushiSnow/items/8156d440540b0a11dfe6

I will describe the whole configuration when implementing it in Python.

TOPIC MODEL.jpg

The configuration is not difficult, so I think it's easy to implement.

Unlike general machine learning, the TOPIC model is not given the topic of the document that corresponds to the label.

It is a method of practicing how to estimate a topic in that situation.

Generally, unsupervised learning techniques are used.

sampling

Page 11 quote NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)

Below is the implemented code.

 def sampleOne(probs):
     z = 0
     for k, v in probs.items():
         z = z + v
     remaining = random.uniform(0, z)
     for k,v in probs.items():
         remaining = remaining - v
         if  remaining <= 0:
             return k

Pass the dictionary data that holds the topic and probability values for the sample obtained from the distribution to the function.
Calculate the sum of the probability values. Get random values with a uniform distribution in the range of sums of probability values from 3.0. (Change this distribution depending on the problem)
Subtract the value of each probability from the random value, and the topic obtained when it becomes less than 0 becomes the topic of the word.

A specific example is as follows.

String A B C D Topic column 1 2 2 3

If the probability that a topic is output is as follows from the probability distribution,

1:1/2 2:1/3 3:1/4

The sum is 1/2 + 1/3 + 1/3 + 1/4

The tentatively output value in the range from 0 to the sum

1/2 + 1/3

in the case of

A B C D 1 2 2 4 1/2 1/3 1/3 1/4

You can see that the topic "2" obtained by subtracting the probability output by the character strings up to B above is the topic obtained in this sample.

Gibbs sampling

The method used this time is Gibbs sampling.

This method is a method to generate a sample according to a certain distribution.

Where a distribution is important, the choice of this distribution depends on the problem you want to solve.

The joint probability distribution $ P (X, Y) $ is given this time, but sampling is impossible because two variables are given from the joint probability.

Therefore, sampling is performed using a conditional probability distribution. In summary

Fixed string and sampled Fixed topic and sampled

Just do the above.

Sampling of a specific topic model.

Quotes on pages 16-19 NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)

Delete string / topic pairs and recalculate probabilities

Multiply topic probabilities by word probabilities to calculate simultaneous probabilities

Sampling one from the updated joint probability distribution and updating based on the output words and topics

Since many counts fall to 0, smoothing is performed.

Initialization

Initialize and define the required values Definition of init part Define words and topics in the document corpus self.xcorpus = numpy.array([]) self.ycorpus = numpy.array([]) Count and hold the number of words and topics self.xcounts = {} self.ycounts = {} Topic vector self.topics_vector = numpy.array([]) Number of topics self.TOPICS = 7 Document id self.docid = 1 Different number of words self.different_word = 0 In the initilize part, the topic of the initial word is given at random. This randomly given part is a part that can be devised such as using a conjugate prior.

#-*- coding:utf-8 -*-
__author__ = 'ohgushimasaya'
from numpy import *
from numpy.random import *
import numpy
from Add_Count import add_count
import os.path
class initilaze_topic_model:
      def __init__(self):
          self.xcorpus = numpy.array([])
          self.ycorpus = numpy.array([])
          self.xcounts = {}
          self.ycounts = {}
          self.topics_vector = numpy.array([])
          self.TOPICS = 7
          self.docid = 1
          self.different_word = 0
      def initilize(self):
          first_time = 1
          adder = add_count(self.xcounts, self.ycounts)
          self.docid = os.path.getsize("07-train.txt")
          for line in open("07-train.txt", "r"):
              rline = line.rstrip("¥n")
              words = numpy.array(rline.split(" "))
              topics_vector = []
              self.different_word = set(words)
          for word in words:
              topic = randint(self.TOPICS) + 1
              topics_vector.append(topic)
              adder.add_counter(word, topic, self.docid, 1)
              array_topics_vector = numpy.array(topics_vector)
              if first_time == 1:
                 self.xcorpus = numpy.hstack((self.xcorpus, words))
                 self.ycorpus = numpy.hstack((self.ycorpus, array_topics_vector))
                 first_time = first_time + 1
              else:
                 self.xcorpus=numpy.vstack((self.xcorpus, words))
                 self.ycorpus = numpy.vstack((self.ycorpus, array_topics_vector))

counter

__author__ = 'ohgushimasaya'
class add_count:
      def __init__(self, xcounts, ycoutns):
          self.xcounts = xcounts
          self.ycounts = ycoutns
      def add_counter(self, word, topic, docid, amount):
      #Word
          self.xcounts = add_count.check_dict(topic, self.xcounts, amount)
          self.xcounts = add_count.check_dict((word, topic), self.xcounts, amount)
          #TOPIC
          self.ycounts = add_count.check_dict(docid, self.ycounts, amount)
          self.ycounts = add_count.check_dict((topic, docid), self.ycounts, amount)
      @staticmethod
      def check_dict(key, w_t_count, amount):
          if w_t_count.has_key(key):
             w_t_count.update({key:w_t_count[key] + amount})
             return w_t_count
          else:
             w_t_count[key] = 1
             return w_t_count

Count the number of topics and words Also calculates the number of words when a topic is given and the topic when a document id is given

sampling

__author__ = 'ohgushimasaya'
from numpy import *
from numpy.random import *
import numpy
import random
from Add_Count import add_count
import os.path
class Sampling:
      def __init__(self, xcorpus, ycorpus):
          self.iteration = 1000
          self.xcorpus = xcorpus
          self.ycorpus = ycorpus
          self.alpha = 0.01
          self.beta = 0.03
      def sampling(self, TOPICS, xcounts, ycounts, docId, different_word):
          for i in range(0, self.iteration):
              Sampling.sampler(self, i, TOPICS, xcounts, ycounts, docId, different_word)
      @staticmethod
      def sampler(self, i, TOPICS, xcounts, ycounts, docId, different_word):
          ll = 0
          adder = add_count(xcounts, ycounts)
          probs = {}
          for i in range(0, len(self.xcorpus)):
              for j in range(0, len(self.xcorpus[i])):
                  x = self.xcorpus[i][j]
                  y = self.ycorpus[i][j]
                  adder.add_counter(x, y, i, -1)
                  for k in range(TOPICS):
                      if xcounts.has_key(k) and (x, k) in xcounts and ycounts.has_key(docId) \
                      and (y, docId) in ycounts:
                         if xcounts[k] != 0 and ycounts[docId] != 0:
                            p_x_y = 1.0 * xcounts[(x, k)] + self.alpha / xcounts[k] + self.alpha * len(different_word)
                            p_y_Y = 1.0 * ycounts[(y, docId)] + self.beta / ycounts[docId] + self.beta * TOPICS
                            probs.update({k : p_x_y * p_y_Y})
              new_y = Sampling.sampleOne(probs)
              ll = ll + log(probs[new_y])
              adder.add_counter(x, new_y, i ,1)
              self.ycorpus[i][j] = new_y
          print ll
      @staticmethod
      def sampleOne(probs):
          z = 0
          for k, v in probs.items():
              z = z + v
          remaining = random.uniform(0, z)
          for k,v in probs.items():
              remaining = remaining - v
              if remaining <= 0:
                 return k

Pick up words and topics and subtract
Calculate the conditional probability of a word for a topic and the conditional probability of a document for a topic for each topic by the number of topics.
Update simultaneous probabilities for each topic
Generate topics using simultaneous probabilities
Calculate the log probability using the generated topic
Add topics to words
Add topics to the topic corpus

I'll put the code below.

https://github.com/SnowMasaya/TOPIC_MODEL