I tried to implement PLSA in Python

Overview

--I implemented PLSA (Probabilistic Latent Semantic Analysis) in Python.

--Speeding up and error handling (such as log (0) countermeasures) will be done at a later date.

What is PLSA

One of the clustering methods,

  1. Document d is selected by P (d)
  2. Topic z is selected by P (z | d)
  3. The word w is generated by P (w | z)

It is a model called. In the formula

P(d,w) = P(d)\sum_{z}P(z|d)P(w|z)

However, it is slightly deformed

P(d,w) = \sum_{z}P(z)P(d|z)P(w|z)

I often deal with.

Log-likelihood for this model

L = \sum_{d}\sum_{w}N(d,w)\log P(d,w)

Is the maximumP(z), P(d|z), P(w|z)Is calculated using the EM algorithm. N (d, w) is the number of times the word w appears in document d.

EM algorithm

Repeat the following E and M steps until the log-likelihood converges.

--E step

P(z|d,w) = \frac{P\left( z \right)P\left( d | z \right)P\left( w | z \right)}{\sum_{z} P\left( z \right)P\left( d | z \right)P\left( w | z \right)}

--M step

\begin{eqnarray}
P\left( z \right) & = & \frac{\sum_{d} \sum_{w} N_{d, w} P\left( z | d, w \right)}{\sum_{d} \sum_{w} N_{d, w}} \\
P\left( d | z \right) & = & \frac{\sum_{w} N_{d, w} P \left( z | d, w \right)}{\sum_{d} \sum_{w} N_{d, w} P \left( z | d, w \right)} \\
P\left( w | z \right) & = & \frac{\sum_{d} N_{d, w} P \left( z | d, w \right)}{\sum_{d} \sum_{w} N_{d, w} P \left( z | d, w \right)}
\end{eqnarray}

Implementation

With the idea that it can be used for other than documents and words,

P(x,y) = \sum_{z}P(z)P(x|z)P(y|z)

It is said.

plsa.py


import numpy as np


class PLSA(object):
    def __init__(self, N, Z):
        self.N = N
        self.X = N.shape[0]
        self.Y = N.shape[1]
        self.Z = Z

        # P(z)
        self.Pz = np.random.rand(self.Z)
        # P(x|z)
        self.Px_z = np.random.rand(self.Z, self.X)
        # P(y|z)
        self.Py_z = np.random.rand(self.Z, self.Y)

        #Normalization
        self.Pz /= np.sum(self.Pz)
        self.Px_z /= np.sum(self.Px_z, axis=1)[:, None]
        self.Py_z /= np.sum(self.Py_z, axis=1)[:, None]

    def train(self, k=200, t=1.0e-7):
        '''
Repeat steps E and M until the log-likelihood converges
        '''
        prev_llh = 100000
        for i in xrange(k):
            self.e_step()
            self.m_step()
            llh = self.llh()

            if abs((llh - prev_llh) / prev_llh) < t:
                break

            prev_llh = llh

    def e_step(self):
        '''
E step
        P(z|x,y)Update
        '''
        self.Pz_xy = self.Pz[None, None, :] \
            * self.Px_z.T[:, None, :] \
            * self.Py_z.T[None, :, :]
        self.Pz_xy /= np.sum(self.Pz_xy, axis=2)[:, :, None]

    def m_step(self):
        '''
M step
        P(z), P(x|z), P(y|z)Update
        '''
        NP = self.N[:, :, None] * self.Pz_xy

        self.Pz = np.sum(NP, axis=(0, 1))
        self.Px_z = np.sum(NP, axis=1).T
        self.Py_z = np.sum(NP, axis=0).T

        self.Pz /= np.sum(self.Pz)
        self.Px_z /= np.sum(self.Px_z, axis=1)[:, None]
        self.Py_z /= np.sum(self.Py_z, axis=1)[:, None]

    def llh(self):
        '''
Log likelihood
        '''
        Pxy = self.Pz[None, None, :] \
            * self.Px_z.T[:, None, :] \
            * self.Py_z.T[None, :, :]
        Pxy = np.sum(Pxy, axis=2)
        Pxy /= np.sum(Pxy)

        return np.sum(self.N * np.log(Pxy))

I use it like this. Examples are X = 5, Y = 4, Z = 2.

import numpy as np
from plsa import PLSA


N = np.array([
    [20, 23, 1, 4],
    [25, 19, 3, 0],
    [2, 1, 31, 28],
    [0, 1, 22, 17],
    [1, 0, 18, 24]
])

plsa = PLSA(N, 2)
plsa.train()

print 'P(z)'
print plsa.Pz
print 'P(x|z)'
print plsa.Px_z
print 'P(y|z)'
print plsa.Py_z
print 'P(z|x)'
Pz_x = plsa.Px_z.T * plsa.Pz[None, :]
print Pz_x / np.sum(Pz_x, axis=1)[:, None]
print 'P(z|y)'
Pz_y = plsa.Py_z.T * plsa.Pz[None, :]
print Pz_y / np.sum(Pz_y, axis=1)[:, None]

result


P(z)
[ 0.3904831  0.6095169]
P(x|z)
[[  4.62748437e-01   5.01515512e-01   2.44650515e-02   1.11544339e-02   1.16565226e-04]
 [  3.16718954e-02   1.99625810e-09   4.08159551e-01   2.66294583e-01   2.93873968e-01]]
P(y|z)
[[  4.92518417e-01   4.69494935e-01   3.79855484e-02   1.09954245e-06]
 [  1.25999484e-02   5.73485868e-06   4.88365926e-01   4.99028390e-01]]
P(z|x)
[[  9.03477222e-01   9.65227775e-02]
 [  9.99999994e-01   6.21320706e-09]
 [  3.69800871e-02   9.63019913e-01]
 [  2.61337075e-02   9.73866292e-01]
 [  2.54046982e-04   9.99745953e-01]]
P(z|y)
[[  9.61600593e-01   3.83994074e-02]
 [  9.99980934e-01   1.90663270e-05]
 [  4.74646871e-02   9.52535313e-01]
 [  1.41157067e-06   9.99998588e-01]]

Impressions

――Broadcasting of Numpy is convenient.

Recommended Posts

I tried to implement PLSA in Python
I tried to implement PLSA in Python 2
I tried to implement permutation in Python
I tried to implement ADALINE in Python
I tried to implement PPO in Python
I tried to implement TOPIC MODEL in Python
I tried to implement selection sort in python
I tried to implement a pseudo pachislot in Python
I tried to implement Dragon Quest poker in Python
I tried to implement GA (genetic algorithm) in Python
I tried to implement a one-dimensional cellular automaton in Python
I tried to implement the mail sending function in Python
I tried to implement blackjack of card game in Python
I tried to implement PCANet
I tried to implement a misunderstood prisoner's dilemma game in Python
I tried to implement StarGAN (1)
I tried to implement Bayesian linear regression by Gibbs sampling in python
I tried to implement a card game of playing cards in Python
I tried to graph the packages installed in Python
I want to easily implement a timeout in python
I tried to implement Minesweeper on terminal with python
I tried to implement an artificial perceptron with python
I tried to summarize how to use pandas in python
I tried to implement Deep VQE
I tried to touch Python (installation)
I tried to implement adversarial validation
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried Line notification in Python
I tried to implement what seems to be a Windows snipping tool in Python
I tried to create API list.csv in Python from swagger.yaml
I tried "How to get a method decorated in Python"
I tried to make a stopwatch using tkinter in python
I tried to implement Autoencoder with TensorFlow
Python3 standard input I tried to summarize
I tried using Bayesian Optimization in Python
I wanted to solve ABC159 in Python
I tried to implement CVAE with PyTorch
[Python] I tried to calculate TF-IDF steadily
I tried to touch Python (basic syntax)
[Python] I tried to implement stable sorting, so make a note
I tried Python> autopep8
Implement recommendations in Python
Implement XENO in python
I tried to debug.
I tried to paste
Implement sum in Python
I tried Python> decorator
Implement Traceroute in Python 3
I tried to implement reading Dataset with PyTorch
I want to do Dunnett's test in Python
Try to implement Oni Maitsuji Miserable in python
How to implement Discord Slash Command in Python
I was able to recurse in Python: lambda
I want to create a window in Python
I tried playing a typing game in Python
How to implement shared memory in Python (mmap.mmap)
I tried to integrate with Keras in TFv1.1
I tried simulating the "birthday paradox" in Python
I tried the least squares method in Python
I wrote "Introduction to Effect Verification" in Python