The following materials were used as a reference.
NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)
Please refer to the following for a rough explanation of the TOPIC model.
http://qiita.com/GushiSnow/items/8156d440540b0a11dfe6
I will describe the whole configuration when implementing it in Python.
The configuration is not difficult, so I think it's easy to implement.
Unlike general machine learning, the TOPIC model is not given the topic of the document that corresponds to the label.
It is a method of practicing how to estimate a topic in that situation.
Generally, unsupervised learning techniques are used.
Page 11 quote NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)
Below is the implemented code.
def sampleOne(probs):
z = 0
for k, v in probs.items():
z = z + v
remaining = random.uniform(0, z)
for k,v in probs.items():
remaining = remaining - v
if remaining <= 0:
return k
A specific example is as follows.
String A B C D Topic column 1 2 2 3
If the probability that a topic is output is as follows from the probability distribution,
1:1/2 2:1/3 3:1/4
The sum is 1/2 + 1/3 + 1/3 + 1/4
The tentatively output value in the range from 0 to the sum
1/2 + 1/3
in the case of
A B C D 1 2 2 4 1/2 1/3 1/3 1/4
You can see that the topic "2" obtained by subtracting the probability output by the character strings up to B above is the topic obtained in this sample.
The method used this time is Gibbs sampling.
This method is a method to generate a sample according to a certain distribution.
Where a distribution is important, the choice of this distribution depends on the problem you want to solve.
The joint probability distribution $ P (X, Y) $ is given this time, but sampling is impossible because two variables are given from the joint probability.
Therefore, sampling is performed using a conditional probability distribution. In summary
Fixed string and sampled Fixed topic and sampled
Just do the above.
Sampling of a specific topic model.
Quotes on pages 16-19 NLP Programming Tutorial 7-Topic Model http://www.phontron.com/slides/nlp-programming-ja-07-topic.pdf, (See 2015-06-25)
Delete string / topic pairs and recalculate probabilities
Multiply topic probabilities by word probabilities to calculate simultaneous probabilities
Sampling one from the updated joint probability distribution and updating based on the output words and topics
Since many counts fall to 0, smoothing is performed.
Initialize and define the required values Definition of init part Define words and topics in the document corpus self.xcorpus = numpy.array([]) self.ycorpus = numpy.array([]) Count and hold the number of words and topics self.xcounts = {} self.ycounts = {} Topic vector self.topics_vector = numpy.array([]) Number of topics self.TOPICS = 7 Document id self.docid = 1 Different number of words self.different_word = 0 In the initilize part, the topic of the initial word is given at random. This randomly given part is a part that can be devised such as using a conjugate prior.
#-*- coding:utf-8 -*-
__author__ = 'ohgushimasaya'
from numpy import *
from numpy.random import *
import numpy
from Add_Count import add_count
import os.path
class initilaze_topic_model:
def __init__(self):
self.xcorpus = numpy.array([])
self.ycorpus = numpy.array([])
self.xcounts = {}
self.ycounts = {}
self.topics_vector = numpy.array([])
self.TOPICS = 7
self.docid = 1
self.different_word = 0
def initilize(self):
first_time = 1
adder = add_count(self.xcounts, self.ycounts)
self.docid = os.path.getsize("07-train.txt")
for line in open("07-train.txt", "r"):
rline = line.rstrip("¥n")
words = numpy.array(rline.split(" "))
topics_vector = []
self.different_word = set(words)
for word in words:
topic = randint(self.TOPICS) + 1
topics_vector.append(topic)
adder.add_counter(word, topic, self.docid, 1)
array_topics_vector = numpy.array(topics_vector)
if first_time == 1:
self.xcorpus = numpy.hstack((self.xcorpus, words))
self.ycorpus = numpy.hstack((self.ycorpus, array_topics_vector))
first_time = first_time + 1
else:
self.xcorpus=numpy.vstack((self.xcorpus, words))
self.ycorpus = numpy.vstack((self.ycorpus, array_topics_vector))
__author__ = 'ohgushimasaya'
class add_count:
def __init__(self, xcounts, ycoutns):
self.xcounts = xcounts
self.ycounts = ycoutns
def add_counter(self, word, topic, docid, amount):
#Word
self.xcounts = add_count.check_dict(topic, self.xcounts, amount)
self.xcounts = add_count.check_dict((word, topic), self.xcounts, amount)
#TOPIC
self.ycounts = add_count.check_dict(docid, self.ycounts, amount)
self.ycounts = add_count.check_dict((topic, docid), self.ycounts, amount)
@staticmethod
def check_dict(key, w_t_count, amount):
if w_t_count.has_key(key):
w_t_count.update({key:w_t_count[key] + amount})
return w_t_count
else:
w_t_count[key] = 1
return w_t_count
Count the number of topics and words Also calculates the number of words when a topic is given and the topic when a document id is given
__author__ = 'ohgushimasaya'
from numpy import *
from numpy.random import *
import numpy
import random
from Add_Count import add_count
import os.path
class Sampling:
def __init__(self, xcorpus, ycorpus):
self.iteration = 1000
self.xcorpus = xcorpus
self.ycorpus = ycorpus
self.alpha = 0.01
self.beta = 0.03
def sampling(self, TOPICS, xcounts, ycounts, docId, different_word):
for i in range(0, self.iteration):
Sampling.sampler(self, i, TOPICS, xcounts, ycounts, docId, different_word)
@staticmethod
def sampler(self, i, TOPICS, xcounts, ycounts, docId, different_word):
ll = 0
adder = add_count(xcounts, ycounts)
probs = {}
for i in range(0, len(self.xcorpus)):
for j in range(0, len(self.xcorpus[i])):
x = self.xcorpus[i][j]
y = self.ycorpus[i][j]
adder.add_counter(x, y, i, -1)
for k in range(TOPICS):
if xcounts.has_key(k) and (x, k) in xcounts and ycounts.has_key(docId) \
and (y, docId) in ycounts:
if xcounts[k] != 0 and ycounts[docId] != 0:
p_x_y = 1.0 * xcounts[(x, k)] + self.alpha / xcounts[k] + self.alpha * len(different_word)
p_y_Y = 1.0 * ycounts[(y, docId)] + self.beta / ycounts[docId] + self.beta * TOPICS
probs.update({k : p_x_y * p_y_Y})
new_y = Sampling.sampleOne(probs)
ll = ll + log(probs[new_y])
adder.add_counter(x, new_y, i ,1)
self.ycorpus[i][j] = new_y
print ll
@staticmethod
def sampleOne(probs):
z = 0
for k, v in probs.items():
z = z + v
remaining = random.uniform(0, z)
for k,v in probs.items():
remaining = remaining - v
if remaining <= 0:
return k
I'll put the code below.
https://github.com/SnowMasaya/TOPIC_MODEL
Recommended Posts