[PYTHON] Comparison of LDA implementations

(I will post an article that I forgot to publish a year ago for draft digestion)

I've touched on some LDA implementations for comparison. It seems that the options are as follows, but I will summarize the procedure to execute LDA of each language written in a memo.

	Language used	Usage pattern	Load expandability	algorithm
Spark MLlib 1.6.0	Scala,Java, Python,R	Library	Parallel/Distributed	Variational Bayes, EM
gensim	Python	Library	Parallel	Variational Bayes
GibbsLDA++	shell	command	-	Gibbs Sampling
R	R	Library	-	Gibbs Sampling

Execution cost

Each is executed and the time is measured, but the execution conditions are quite different as shown below. Please read it without any misunderstanding.

Execution environment (physical / virtual / Linux / Windows)
Execution condition (whether it is read from a file or already in memory)
Preprocessing condition of target data (from BoW generation, etc.)

$ 13,033 $ Document $ 20,780 $ Words that have been morphologically analyzed and unnecessary morphemes removed are executed under the conditions of $ k = 100 $ and $ iter = 50 $.

Spark MLlib

algorithm	Nodes	Cores	Execution time
EM	5	80	224.092
EM	1	8	81.854
EM	1	1	112.606
Variational Bayes	1	8	220.147
Variational Bayes	1	1	310.367

The execution time is slower than others because it also includes the process of RDD reading a local file and creating a BoW. The slowest result of distributed execution is probably because the cost of shuffling and data transfer between nodes is higher than the calculation cost of LDA itself.

gensim

algorithm	Nodes	Cores	Execution time
Variational Bayes	1	4	15.396
Variational Bayes	1	1	20.576

Python's gensim seems to be variational Bayes only, but it seems to have a good balance of performance with more workers, parallel processing, fast speed.

GibbsLDA++

algorithm	Nodes	Cores	Execution time
Gibbs Sampling	1	1	58.993

It was faster than gensim near $ k = 10 $ with input / output processing of local files in addition to single thread, but is it slower as $ k $ gets bigger? The scale may be tough, but is it suitable for a little ad hoc execution?

R language

algorithm	Nodes	Cores	Execution time
Collapsed Gibbs Sampling	1	1	24.247

The calculation itself was surprisingly faster than GibbsLDA ++ R's lda.collapsed.gibbs.sampler {lda} (However, the process of reading a local file and converting it to a format that can be passed to the library was slow, so it is slow in total). R is faster than practical speed lda {lda} and lda.cvb0 {lda} //www.inside-r.org/packages/cran/lda/docs/lda.cvb0) It is a good impression to expect the part where you can try many implementations and algorithms.

Try using each implementation

gensim

# -*- coding: utf-8 -*-
#LDA sample by Python gensim
from gensim import corpora, models
import time

#Reading document data(Morphological analysis/A text file in which words are separated by spaces in a line-by-line document with nouns extracted.)
texts = [ ]
for line in open('docs.gibbs', 'r'):
    texts.append(line.split())

#Creating a dictionary(id:word:Number of appearances)
dictionary = corpora.Dictionary(texts)
dictionary.save_as_text('./docs.dic')
# dictionary = corpora.Dictionary.load_from_text("./docs.dic")

#Corpus creation
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('corpus.mm', corpus)
# corpus = corpora.MmCorpus('corpus.mm')

#LDA calculation
t0 = int(time.time() * 1000)
lda = models.ldamodel.LdaModel(corpus, num_topics=100, iterations=50)
t1 = int(time.time() * 1000)
print t1 - t0

#LDA calculation(Multiprocessor compatible version)
t0 = int(time.time() * 1000)
lda = models.ldamulticore.LdaMulticore(corpus, num_topics=10, iterations=50, workers=4)
t1 = int(time.time() * 1000)
print t1 - t0

GibbsLDA++

GibbsLDA ++ and its Java port of JGibbsLDA is a command line LDA implementation. I have implemented Gibbs Sampling, but it does not seem to support parallel processing (multi-core support) or distributed processing.

Build

Download GibbsLDA ++-0.2.tar.gz from GibbsLDA ++: A C / C ++ Gibbs Sampling LDA, unzip and build.

$ tar zxvf GibbsLDA++-0.2.tar.gz
$ cd GibbsLDA++-0.2/
$ make all

If g ++ is available, make all will complete the build and the src / lda command will be completed.

If ʻerror:'atof' was not declared in this scope, ʻerror:'printf' was not declared in this scope occurs [here](http://yuutookun.hatenablog.com/entry/20120831/ Add #include to 2 files referring to 1346394002).

$ vim src/util.cpp
...
#include <stdio.h>
#include <stdlib.h>  //add to
#include <string>
...
$ vim src/lda.cpp
...
*/

#include <stdio.h>  //add to
#include "model.h"
...
$ make clean; make all
make -C src/ -f Makefile clean
make[1]: Entering directory `/export/home/t_takami/git/gibbslda++/GibbsLDA++-0.2/src'
...
make[1]: Leaving directory `/export/home/t_takami/git/gibbslda++/GibbsLDA++-0.2/src'

$ src/lda --help
Please specify the task you would like to perform (-est/-estc/-inf)!
Command line usage:
        lda -est -alpha <double> -beta <double> -ntopics <int> -niters <int> -savestep <int> -twords <int> -dfile <string>
        lda -estc -dir <string> -model <string> -niters <int> -savestep <int> -twords <int>
        lda -inf -dir <string> -model <string> -niters <int> -twords <int> -dfile <string>

Data preparation

Create a file in which the words contained in the document are listed separated by spaces so that the number of documents is on the first line and then "1 line = 1 document". If there are blank lines (documents with 0 words that represent features), the error ʻInvalid (empty) document!` Will occur.

$ head -n 5 docs.gibbs
13033
Storage Request Select Card Card Case Answer Method Method Possible Internal Model Capacity Recording Recording Recording Recording Set Time Save Save Save Save
The world, the world, the world, the important things, the common sense, the common sense, the common sense, the common sense, the common sense, the self, the self
Character clothes operation game Blonde character model Silver hair
Best Best Delusion Date Date
Phone Phone Phone Farewell Real Present Present Like Present Like Please Please Tears Tears Tears Tears Feeling Narita Conversation Negative Eyes Relationship Room Love Uruguay Uruguay Uruguay Uruguay Owl Final Feelings Behind the Scenes…
…

Execution result

Execute by specifying the number of topics $ k $ and the number of repetitions on the command line.

$ src/lda -est -niters 50 -ntopics 10 -twords 5 -dfile docs.gibbs
Sampling 50 iterations!
Iteration 1 ...
...
Iteration 50 ...
Gibbs sampling completed!
Saving the final model!

When the execution is finished, you should have some files.

model-final.others is the execution parameter when this model was created, and wordmap.txt contains the ID assigned to the word.

$ cat model-final.others
alpha=5.000000
beta=0.100000
ntopics=10
ndocs=13033
nwords=20779
liter=50

$ cat wordmap.txt
20779
T-shirt 4601
Ai 1829
Aiko 19897
Greeting 2125
...

model-final.twords lists the most characteristic words for each topic, as specified by the -twords option (this file is not created if the -twords option is omitted).

$ cat model-final.twords
Topic 0th:
Person 0.047465
Feeling 0.019363
Qi 0.018178
Other 0.016968
Normal 0.016004
Topic 1th:
Job 0.043820
Time 0.024824
Home 0.019440
Now 0.017962
Mother 0.016881
Topic 2th:
If 0.033522
Month 0.018820
Company 0.018083
Insurance 0.015252
Request 0.012468
…

model-final.phi is the word appearance probability for each topic. 1 row = 1 topic, and the number of columns in one row matches the number of words on the corpus. In other words, it is a matrix of "$ k $ rows (total number of words) columns". It's the numerical data that is the basis of model-final.twords.

$ head -c 100 model-final.phi
0.000002 0.000002 0.000002 0.000002 0.000799 0.000002 0.000002 0.002415 0.000002 0.000002 0.000002 0

model-final.theta is the probability that each document belongs to which topic. It corresponds to 1 line = 1 document and 1 column = 1 topic. In other words, it is the matrix data of "(number of documents) rows $ k $ columns".

$ head -n 5 model-final.theta
0.081081 0.216216 0.067568 0.067568 0.162162 0.081081 0.067568 0.108108 0.081081 0.067568
0.076923 0.076923 0.123077 0.092308 0.076923 0.076923 0.107692 0.076923 0.076923 0.215385
0.086207 0.103448 0.103448 0.086207 0.137931 0.086207 0.103448 0.086207 0.086207 0.120690
0.090909 0.090909 0.109091 0.127273 0.090909 0.090909 0.090909 0.090909 0.090909 0.127273
0.035971 0.028777 0.111511 0.323741 0.050360 0.086331 0.248201 0.039568 0.028777 0.046763

model-final.tassign is a list of original data in the form of [word ID: topic] for each line = 1 document.

$ head -n 5 model-final.tassign
0:1 1:4 2:1 3:7 3:7 4:8 4:7 5:5 6:1 6:1 7:4 8:1 9:4 10:1 11:4 11:4 11:4 11:4 12:1 13:0 14:1 14:1 14:1 14:1
15:9 15:9 15:9 16:3 17:6 18:6 19:2 19:9 19:2 19:2 19:9 19:9 20:9 20:9 20:9
21:2 22:9 23:1 24:4 25:6 26:9 9:4 27:4
28:9 28:2 29:9 30:3 30:3
31:4 31:6 31:6 32:3 33:2 34:3 34:3 35:3 35:3 1:5 1:2 36:3 36:3 36:3 37:6 38:2 39:6 40:6 41:6 42:3 42:6 43:6 44:3 …

GibbsLDA ++ is almost up to the point of creating a numerical model, so you need to map the results other than twords to documents and words yourself.

R language

Personally, it's an R language that is difficult to get into the language function and is a little shy, but [lda.collapsed.gibbs.sampler {lda}](http://www.inside-r.org/packages/cran/lda/docs/lda If you look at .collapsed.gibbs.sampler), you can see that some models of Collapsed Gibbs Sampling are implemented.

It doesn't seem to support parallel processing (multi-core support) or distributed processing.

Installation

First, install the R language on CentOS 7.2 (if you get a 404 with the following rpm command, delete the file part of the URL to find the appropriate version).

$ sudo rpm -ihv http://ftp.riken.jp/Linux/fedora/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
$ sudo yum install R

After the R installation is complete, launch the interactive interface to install the LDA package. Along the way, I try to install the LDA package in / usr / lib64 / R / library, but I'm not the root user, so I'm specifying my personal library. You will also be asked for the download source, so select Japan (Tokyo) CRAN mirror.

Below we have additionally installed reshape2 and ggplot2 to run the demo.

$ R
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
...
> install.packages("lda")
...
Would you like to use a personal library instead?  (y/n) y
...
Selection: 13
...
> install.packages("reshape2")
> install.packages("ggplot2")
> require("lda")
> demo(lda)

It seems that not following the specification change of ggplot2 stat_count () must not be used with ay aesthetic. Will occur and the graph will not be displayed. If you start the qplot () part correctly and manually, it will work, but since the LDA calculation process itself is finished, ignore it and proceed.

> top.topic.words(result$topics, 5)
     [,1]         [,2]        [,3]        [,4]       [,5]       [,6]
[1,] "algorithm"  "learning"  "model"     "learning" "neural"   "learning"
[2,] "model"      "algorithm" "report"    "neural"   "networks" "network"
[3,] "algorithms" "paper"     "models"    "networks" "network"  "feature"
[4,] "data"       "examples"  "bayesian"  "paper"    "learning" "model"
[5,] "results"    "number"    "technical" "network"  "research" "features"
     [,7]        [,8]          [,9]        [,10]
[1,] "knowledge" "problem"     "paper"     "learning"
[2,] "system"    "genetic"     "decision"  "reinforcement"
[3,] "reasoning" "control"     "algorithm" "paper"
[4,] "design"    "performance" "results"   "problem"
[5,] "case"      "search"      "method"    "method"

The results are coming out.

Data preparation

vocab to use [lda.collapsed.gibbs.sampler {lda} ](http://www.inside-r.org/packages/cran/lda/docs/lda.collapsed.gibbs.sampler) Prepare the data files corresponding to , documents.

Since vocab represents a corpus, first make a list of $ N $ words (as strings) contained in all documents. This will cause each word to be represented by an index $ i $ on the vocab.

{\tt vocab} = \{w_0, w_1, ..., w_{N-1}\} \\
{\tt vocab[}i{\tt]} = w_i

As a data file, create a corpus of corpus_r.txt with 1 line = 1 word, which is a collection of words in all documents without duplication.

$ head -n 5 corpus_r.txt
Mediation
capital
person
Drip
Lanthanoid
$ wc -l corpus_r.txt
20778 corpus_r.txt

documents is a list of documents $ d_k $. The document $ d_k $ is represented by a $ m \ times 2 $ matrix of the index $ i_ {k, j} $ of the word contained in the document and the number of occurrences of that word in the document $ c_j $.

{\tt documents} = \{ d_0, d_1, ..., d_k, ..., d_{m-1} \} \\
d_k = \begin{pmatrix}
i_{k,0} & i_{k,1} & ... & i_{k,n-1} \\
c_{k,0} & c_{k,1} & ... & c_{k,n-1}
\end{pmatrix}

As a data file, 1 line = 1 document, and prepare bow_r.txt with the index (starting from 0) of the words contained in the document and the number of occurrences separated by a colon.

$ head -n 5 bow_r.txt
74:1    1109:1  1788:1  7000:2  10308:2 10552:1 12332:2 13489:1 14996:1 15448:1 15947:1 16354:4 17577:1 18262:1 19831:4
3256:3  5278:1  9039:1  12247:1 14529:6 17026:3
2181:1  4062:1  6270:1  6508:1  7405:1  8662:1  15448:1 18905:1
8045:2  9323:1  5934:2
288:3   624:1   691:1   820:2   1078:2  1109:2  1148:3  1251:1  2025:1  2050:1  2072:1  2090:1  2543:2  2626:1  2759:1…
$ wc -l bow_r.txt
13017 bow_r.txt

Execution result

Run it with the following code. It is bigger to analyze the data file and create the assumed data structure than the LDA processing (maybe it is easier for someone who is good at R?).

require("lda")

vocab <- as.vector(read.table("corpus_r.txt", header=FALSE)$V1)

file <- file("bow_r.txt", "r")
docs <- list()
repeat {
  line <- readLines(con=file, 1)
  if(length(line) == 0) break
  doc <- NULL
  for(tc in strsplit(line, "\t")[[1]]){
    col <- c()
    for(c in strsplit(tc, ":")[[1]]){
      col <- c(col, as.integer(c))
    }
    if(is.null(doc))  doc <- cbind(col)
    else              doc <- cbind(doc, col)
  }
  docs <- append(docs, list(doc))
}
close(file)

result = lda.collapsed.gibbs.sampler(docs, 100, vocab, 50, 0.1, 0.1)

top.topic.words(result$topics, 5)

Running top.topic.words () will list the words that characterize each of the $ k = 100 $ topics.

     [,1]   [,2]     [,3]     [,4]     [,5]     [,6]   [,7]       [,8]
[1,] "problem" "period"   "Ichiban"   "Eye"     "room"   "Issue" "Idol" "box"
[2,] "Congressman" "Please" "weather"   "Lock" "House"     "from now on" "fan"   "part"
[3,] "society" "trial"   "China"   "rotation"   "toilet" "Temple" "Weird"       "sticker"
[4,] "religion" "everyone" "mistake" "Unnecessary"   "water"     "Impossible" "Interest"     "varnish"
[5,] "Principle" "Full-time employee" "All"   "China"   "Washing"   "Contents" "Obedient"     "information"
     [,9]   [,10]      [,11]    [,12]      [,13]    [,14]      [,15]  [,16]
[1,] "Female" "Silver"       "image"   "site"   "Song"     "Please"   "circuit" "jobs"
[2,] "Man"   "Myopia"     "Attachment"   "Method"     "Please" "My"     "regiment" "Company"
[3,] "male" "grown up"     "Multiple"   "Registration"     "piano" "Reason"     "infantry" "boss"
[4,] "Man"   "snack" "Photo"   "domain" "musics"   "Sales"     "Current" "Man"
[5,] "woman"   "Picture book"     "Earth" "page"   "Feeling"   "number" "Photo" "workplace"
... (Continues up to 100 topics)