I tried to implement TOPIC MODEL in Python

I tried to implement TOPIC MODEL in Python

The following materials were used as a reference.

NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)

Please refer to the following for a rough explanation of the TOPIC model.

http://qiita.com/GushiSnow/items/8156d440540b0a11dfe6

I will describe the whole configuration when implementing it in Python.

TOPIC MODEL.jpg

The configuration is not difficult, so I think it's easy to implement.

Unlike general machine learning, the TOPIC model is not given the topic of the document that corresponds to the label.

It is a method of practicing how to estimate a topic in that situation.

Generally, unsupervised learning techniques are used.

sampling

Page 11 quote NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)

TOPIC_MODEL_example.png

Below is the implemented code.

 def sampleOne(probs):
     z = 0
     for k, v in probs.items():
         z = z + v
     remaining = random.uniform(0, z)
     for k,v in probs.items():
         remaining = remaining - v
         if  remaining <= 0:
             return k
  1. Pass the dictionary data that holds the topic and probability values for the sample obtained from the distribution to the function.
  2. Calculate the sum of the probability values. Get random values with a uniform distribution in the range of sums of probability values from 3.0. (Change this distribution depending on the problem)
  3. Subtract the value of each probability from the random value, and the topic obtained when it becomes less than 0 becomes the topic of the word.

A specific example is as follows.

String A B C D Topic column 1 2 2 3

If the probability that a topic is output is as follows from the probability distribution,

1:1/2 2:1/3 3:1/4

The sum is 1/2 + 1/3 + 1/3 + 1/4

The tentatively output value in the range from 0 to the sum

1/2 + 1/3

in the case of

A B C D 1 2 2 4 1/2 1/3 1/3 1/4

You can see that the topic "2" obtained by subtracting the probability output by the character strings up to B above is the topic obtained in this sample.

Gibbs sampling

The method used this time is Gibbs sampling.

This method is a method to generate a sample according to a certain distribution.

Where a distribution is important, the choice of this distribution depends on the problem you want to solve.

The joint probability distribution $ P (X, Y) $ is given this time, but sampling is impossible because two variables are given from the joint probability.

Therefore, sampling is performed using a conditional probability distribution. In summary

Fixed string and sampled Fixed topic and sampled

Just do the above.

Sampling of a specific topic model.

Quotes on pages 16-19 NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)

Delete string / topic pairs and recalculate probabilities

TOPIC_MODEL_example1.png

Multiply topic probabilities by word probabilities to calculate simultaneous probabilities

TOPIC_MODEL_example2.png

Sampling one from the updated joint probability distribution and updating based on the output words and topics

TOPIC_MODEL_example3.png

Since many counts fall to 0, smoothing is performed.

TOPIC_MODEL_example4.png

Initialization

Initialize and define the required values Definition of init part Define words and topics in the document corpus self.xcorpus = numpy.array([]) self.ycorpus = numpy.array([]) Count and hold the number of words and topics self.xcounts = {} self.ycounts = {} Topic vector self.topics_vector = numpy.array([]) Number of topics self.TOPICS = 7 Document id self.docid = 1 Different number of words self.different_word = 0 In the initilize part, the topic of the initial word is given at random. This randomly given part is a part that can be devised such as using a conjugate prior.

#-*- coding:utf-8 -*-
__author__ = 'ohgushimasaya'
from numpy import *
from numpy.random import *
import numpy
from Add_Count import add_count
import os.path
class initilaze_topic_model:
      def __init__(self):
          self.xcorpus = numpy.array([])
          self.ycorpus = numpy.array([])
          self.xcounts = {}
          self.ycounts = {}
          self.topics_vector = numpy.array([])
          self.TOPICS = 7
          self.docid = 1
          self.different_word = 0
      def initilize(self):
          first_time = 1
          adder = add_count(self.xcounts, self.ycounts)
          self.docid = os.path.getsize("07-train.txt")
          for line in open("07-train.txt", "r"):
              rline = line.rstrip("¥n")
              words = numpy.array(rline.split(" "))
              topics_vector = []
              self.different_word = set(words)
          for word in words:
              topic = randint(self.TOPICS) + 1
              topics_vector.append(topic)
              adder.add_counter(word, topic, self.docid, 1)
              array_topics_vector = numpy.array(topics_vector)
              if first_time == 1:
                 self.xcorpus = numpy.hstack((self.xcorpus, words))
                 self.ycorpus = numpy.hstack((self.ycorpus, array_topics_vector))
                 first_time = first_time + 1
              else:
                 self.xcorpus=numpy.vstack((self.xcorpus, words))
                 self.ycorpus = numpy.vstack((self.ycorpus, array_topics_vector))

counter

__author__ = 'ohgushimasaya'
class add_count:
      def __init__(self, xcounts, ycoutns):
          self.xcounts = xcounts
          self.ycounts = ycoutns
      def add_counter(self, word, topic, docid, amount):
      #Word
          self.xcounts = add_count.check_dict(topic, self.xcounts, amount)
          self.xcounts = add_count.check_dict((word, topic), self.xcounts, amount)
          #TOPIC
          self.ycounts = add_count.check_dict(docid, self.ycounts, amount)
          self.ycounts = add_count.check_dict((topic, docid), self.ycounts, amount)
      @staticmethod
      def check_dict(key, w_t_count, amount):
          if w_t_count.has_key(key):
             w_t_count.update({key:w_t_count[key] + amount})
             return w_t_count
          else:
             w_t_count[key] = 1
             return w_t_count

Count the number of topics and words Also calculates the number of words when a topic is given and the topic when a document id is given

sampling

__author__ = 'ohgushimasaya'
from numpy import *
from numpy.random import *
import numpy
import random
from Add_Count import add_count
import os.path
class Sampling:
      def __init__(self, xcorpus, ycorpus):
          self.iteration = 1000
          self.xcorpus = xcorpus
          self.ycorpus = ycorpus
          self.alpha = 0.01
          self.beta = 0.03
      def sampling(self, TOPICS, xcounts, ycounts, docId, different_word):
          for i in range(0, self.iteration):
              Sampling.sampler(self, i, TOPICS, xcounts, ycounts, docId, different_word)
      @staticmethod
      def sampler(self, i, TOPICS, xcounts, ycounts, docId, different_word):
          ll = 0
          adder = add_count(xcounts, ycounts)
          probs = {}
          for i in range(0, len(self.xcorpus)):
              for j in range(0, len(self.xcorpus[i])):
                  x = self.xcorpus[i][j]
                  y = self.ycorpus[i][j]
                  adder.add_counter(x, y, i, -1)
                  for k in range(TOPICS):
                      if xcounts.has_key(k) and (x, k) in xcounts and ycounts.has_key(docId) \
                      and (y, docId) in ycounts:
                         if xcounts[k] != 0 and ycounts[docId] != 0:
                            p_x_y = 1.0 * xcounts[(x, k)] + self.alpha / xcounts[k] + self.alpha * len(different_word)
                            p_y_Y = 1.0 * ycounts[(y, docId)] + self.beta / ycounts[docId] + self.beta * TOPICS
                            probs.update({k : p_x_y * p_y_Y})
              new_y = Sampling.sampleOne(probs)
              ll = ll + log(probs[new_y])
              adder.add_counter(x, new_y, i ,1)
              self.ycorpus[i][j] = new_y
          print ll
      @staticmethod
      def sampleOne(probs):
          z = 0
          for k, v in probs.items():
              z = z + v
          remaining = random.uniform(0, z)
          for k,v in probs.items():
              remaining = remaining - v
              if remaining <= 0:
                 return k
  1. Pick up words and topics and subtract
  2. Calculate the conditional probability of a word for a topic and the conditional probability of a document for a topic for each topic by the number of topics.
  3. Update simultaneous probabilities for each topic
  4. Generate topics using simultaneous probabilities
  5. Calculate the log probability using the generated topic
  6. Add topics to words
  7. Add topics to the topic corpus

I'll put the code below.

https://github.com/SnowMasaya/TOPIC_MODEL

Recommended Posts

I tried to implement TOPIC MODEL in Python
I tried to implement PLSA in Python
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried to implement selection sort in python
I tried to implement a pseudo pachislot in Python
I tried to implement Dragon Quest poker in Python
I tried to implement GA (genetic algorithm) in Python
I tried to implement the mail sending function in Python
I tried to implement blackjack of card game in Python
I tried to implement a misunderstood prisoner's dilemma game in Python
I tried to implement PCANet
I tried to implement StarGAN (1)
I tried to implement Bayesian linear regression by Gibbs sampling in python
I tried to implement a card game of playing cards in Python
I tried to graph the packages installed in Python
I tried to implement Minesweeper on terminal with python
I tried to implement an artificial perceptron with python
I tried to summarize how to use pandas in python
I tried to implement merge sort in Python with as few lines as possible
I tried to implement Deep VQE
I tried to touch Python (installation)
I tried to implement adversarial validation
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried to implement what seems to be a Windows snipping tool in Python
I tried Line notification in Python
I tried to implement a basic Recurrent Neural Network model
I tried to create API list.csv in Python from swagger.yaml
I tried "How to get a method decorated in Python"
I tried to make a stopwatch using tkinter in python
I tried to implement SSD with PyTorch now (model edition)
I tried to summarize Python exception handling
I tried to implement Autoencoder with TensorFlow
Python3 standard input I tried to summarize
I tried using Bayesian Optimization in Python
I wanted to solve ABC159 in Python
I tried to implement CVAE with PyTorch
[Python] I tried to calculate TF-IDF steadily
I tried to touch Python (basic syntax)
[Python] I tried to implement stable sorting, so make a note
I tried to implement anomaly detection using a hidden Markov model
Implement Enigma in python
I tried to implement reading Dataset with PyTorch
I want to do Dunnett's test in Python
Try to implement Oni Maitsuji Miserable in python
How to implement Discord Slash Command in Python
I was able to recurse in Python: lambda
I want to create a window in Python
How to implement shared memory in Python (mmap.mmap)
I tried Python> autopep8
I tried to integrate with Keras in TFv1.1
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
Implement recommendations in Python
I wrote "Introduction to Effect Verification" in Python
Implement XENO in python
I tried to get CloudWatch data with Python
[Memo] I tried a pivot table in Python