(I will post an article that I forgot to publish a year ago for draft digestion)
I've touched on some LDA implementations for comparison. It seems that the options are as follows, but I will summarize the procedure to execute LDA of each language written in a memo.
Language used | Usage pattern | Load expandability | algorithm | |
---|---|---|---|---|
Spark MLlib 1.6.0 | Scala,Java, Python,R |
Library | Parallel/Distributed | Variational Bayes, EM |
gensim | Python | Library | Parallel | Variational Bayes |
GibbsLDA++ | shell | command | - | Gibbs Sampling |
R | R | Library | - | Gibbs Sampling |
Each is executed and the time is measured, but the execution conditions are quite different as shown below. Please read it without any misunderstanding.
$ 13,033 $ Document $ 20,780 $ Words that have been morphologically analyzed and unnecessary morphemes removed are executed under the conditions of $ k = 100 $ and $ iter = 50 $.
Spark MLlib
algorithm | Nodes | Cores | Execution time |
---|---|---|---|
EM | 5 | 80 | 224.092 |
EM | 1 | 8 | 81.854 |
EM | 1 | 1 | 112.606 |
Variational Bayes | 1 | 8 | 220.147 |
Variational Bayes | 1 | 1 | 310.367 |
The execution time is slower than others because it also includes the process of RDD reading a local file and creating a BoW. The slowest result of distributed execution is probably because the cost of shuffling and data transfer between nodes is higher than the calculation cost of LDA itself.
gensim
algorithm | Nodes | Cores | Execution time |
---|---|---|---|
Variational Bayes | 1 | 4 | 15.396 |
Variational Bayes | 1 | 1 | 20.576 |
Python's gensim seems to be variational Bayes only, but it seems to have a good balance of performance with more workers, parallel processing, fast speed.
GibbsLDA++
algorithm | Nodes | Cores | Execution time |
---|---|---|---|
Gibbs Sampling | 1 | 1 | 58.993 |
It was faster than gensim near $ k = 10 $ with input / output processing of local files in addition to single thread, but is it slower as $ k $ gets bigger? The scale may be tough, but is it suitable for a little ad hoc execution?
algorithm | Nodes | Cores | Execution time |
---|---|---|---|
Collapsed Gibbs Sampling | 1 | 1 | 24.247 |
The calculation itself was surprisingly faster than GibbsLDA ++ R's lda.collapsed.gibbs.sampler {lda}
(However, the process of reading a local file and converting it to a format that can be passed to the library was slow, so it is slow in total). R is faster than practical speed lda {lda}
and lda.cvb0 {lda}
//www.inside-r.org/packages/cran/lda/docs/lda.cvb0) It is a good impression to expect the part where you can try many implementations and algorithms.
gensim
# -*- coding: utf-8 -*-
#LDA sample by Python gensim
from gensim import corpora, models
import time
#Reading document data(Morphological analysis/A text file in which words are separated by spaces in a line-by-line document with nouns extracted.)
texts = [ ]
for line in open('docs.gibbs', 'r'):
texts.append(line.split())
#Creating a dictionary(id:word:Number of appearances)
dictionary = corpora.Dictionary(texts)
dictionary.save_as_text('./docs.dic')
# dictionary = corpora.Dictionary.load_from_text("./docs.dic")
#Corpus creation
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('corpus.mm', corpus)
# corpus = corpora.MmCorpus('corpus.mm')
#LDA calculation
t0 = int(time.time() * 1000)
lda = models.ldamodel.LdaModel(corpus, num_topics=100, iterations=50)
t1 = int(time.time() * 1000)
print t1 - t0
#LDA calculation(Multiprocessor compatible version)
t0 = int(time.time() * 1000)
lda = models.ldamulticore.LdaMulticore(corpus, num_topics=10, iterations=50, workers=4)
t1 = int(time.time() * 1000)
print t1 - t0
GibbsLDA++
GibbsLDA ++ and its Java port of JGibbsLDA is a command line LDA implementation. I have implemented Gibbs Sampling, but it does not seem to support parallel processing (multi-core support) or distributed processing.
Download GibbsLDA ++-0.2.tar.gz
from GibbsLDA ++: A C / C ++ Gibbs Sampling LDA, unzip and build.
$ tar zxvf GibbsLDA++-0.2.tar.gz
$ cd GibbsLDA++-0.2/
$ make all
If g ++
is available, make all
will complete the build and the src / lda
command will be completed.
If ʻerror:'atof' was not declared in this scope, ʻerror:'printf' was not declared in this scope
occurs [here](http://yuutookun.hatenablog.com/entry/20120831/ Add #include
to 2 files referring to 1346394002).
$ vim src/util.cpp
...
#include <stdio.h>
#include <stdlib.h> //add to
#include <string>
...
$ vim src/lda.cpp
...
*/
#include <stdio.h> //add to
#include "model.h"
...
$ make clean; make all
make -C src/ -f Makefile clean
make[1]: Entering directory `/export/home/t_takami/git/gibbslda++/GibbsLDA++-0.2/src'
...
make[1]: Leaving directory `/export/home/t_takami/git/gibbslda++/GibbsLDA++-0.2/src'
$ src/lda --help
Please specify the task you would like to perform (-est/-estc/-inf)!
Command line usage:
lda -est -alpha <double> -beta <double> -ntopics <int> -niters <int> -savestep <int> -twords <int> -dfile <string>
lda -estc -dir <string> -model <string> -niters <int> -savestep <int> -twords <int>
lda -inf -dir <string> -model <string> -niters <int> -twords <int> -dfile <string>
Create a file in which the words contained in the document are listed separated by spaces so that the number of documents is on the first line and then "1 line = 1 document". If there are blank lines (documents with 0 words that represent features), the error ʻInvalid (empty) document!` Will occur.
$ head -n 5 docs.gibbs
13033
Storage Request Select Card Card Case Answer Method Method Possible Internal Model Capacity Recording Recording Recording Recording Set Time Save Save Save Save
The world, the world, the world, the important things, the common sense, the common sense, the common sense, the common sense, the common sense, the self, the self
Character clothes operation game Blonde character model Silver hair
Best Best Delusion Date Date
Phone Phone Phone Farewell Real Present Present Like Present Like Please Please Tears Tears Tears Tears Feeling Narita Conversation Negative Eyes Relationship Room Love Uruguay Uruguay Uruguay Uruguay Owl Final Feelings Behind the Scenes…
…
Execute by specifying the number of topics $ k $ and the number of repetitions on the command line.
$ src/lda -est -niters 50 -ntopics 10 -twords 5 -dfile docs.gibbs
Sampling 50 iterations!
Iteration 1 ...
...
Iteration 50 ...
Gibbs sampling completed!
Saving the final model!
When the execution is finished, you should have some files.
model-final.others
is the execution parameter when this model was created, and wordmap.txt
contains the ID assigned to the word.
$ cat model-final.others
alpha=5.000000
beta=0.100000
ntopics=10
ndocs=13033
nwords=20779
liter=50
$ cat wordmap.txt
20779
T-shirt 4601
Ai 1829
Aiko 19897
Greeting 2125
...
model-final.twords
lists the most characteristic words for each topic, as specified by the -twords
option (this file is not created if the -twords
option is omitted).
$ cat model-final.twords
Topic 0th:
Person 0.047465
Feeling 0.019363
Qi 0.018178
Other 0.016968
Normal 0.016004
Topic 1th:
Job 0.043820
Time 0.024824
Home 0.019440
Now 0.017962
Mother 0.016881
Topic 2th:
If 0.033522
Month 0.018820
Company 0.018083
Insurance 0.015252
Request 0.012468
…
model-final.phi
is the word appearance probability for each topic. 1 row = 1 topic, and the number of columns in one row matches the number of words on the corpus. In other words, it is a matrix of "$ k $ rows (total number of words) columns". It's the numerical data that is the basis of model-final.twords
.
$ head -c 100 model-final.phi
0.000002 0.000002 0.000002 0.000002 0.000799 0.000002 0.000002 0.002415 0.000002 0.000002 0.000002 0
model-final.theta
is the probability that each document belongs to which topic. It corresponds to 1 line = 1 document and 1 column = 1 topic. In other words, it is the matrix data of "(number of documents) rows $ k $ columns".
$ head -n 5 model-final.theta
0.081081 0.216216 0.067568 0.067568 0.162162 0.081081 0.067568 0.108108 0.081081 0.067568
0.076923 0.076923 0.123077 0.092308 0.076923 0.076923 0.107692 0.076923 0.076923 0.215385
0.086207 0.103448 0.103448 0.086207 0.137931 0.086207 0.103448 0.086207 0.086207 0.120690
0.090909 0.090909 0.109091 0.127273 0.090909 0.090909 0.090909 0.090909 0.090909 0.127273
0.035971 0.028777 0.111511 0.323741 0.050360 0.086331 0.248201 0.039568 0.028777 0.046763
model-final.tassign
is a list of original data in the form of [word ID: topic] for each line = 1 document.
$ head -n 5 model-final.tassign
0:1 1:4 2:1 3:7 3:7 4:8 4:7 5:5 6:1 6:1 7:4 8:1 9:4 10:1 11:4 11:4 11:4 11:4 12:1 13:0 14:1 14:1 14:1 14:1
15:9 15:9 15:9 16:3 17:6 18:6 19:2 19:9 19:2 19:2 19:9 19:9 20:9 20:9 20:9
21:2 22:9 23:1 24:4 25:6 26:9 9:4 27:4
28:9 28:2 29:9 30:3 30:3
31:4 31:6 31:6 32:3 33:2 34:3 34:3 35:3 35:3 1:5 1:2 36:3 36:3 36:3 37:6 38:2 39:6 40:6 41:6 42:3 42:6 43:6 44:3 …
GibbsLDA ++ is almost up to the point of creating a numerical model, so you need to map the results other than twords
to documents and words yourself.
Personally, it's an R language that is difficult to get into the language function and is a little shy, but [lda.collapsed.gibbs.sampler {lda}](http://www.inside-r.org/packages/cran/lda/docs/lda If you look at .collapsed.gibbs.sampler), you can see that some models of Collapsed Gibbs Sampling are implemented.
It doesn't seem to support parallel processing (multi-core support) or distributed processing.
First, install the R language on CentOS 7.2 (if you get a 404 with the following rpm
command, delete the file part of the URL to find the appropriate version).
$ sudo rpm -ihv http://ftp.riken.jp/Linux/fedora/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
$ sudo yum install R
After the R installation is complete, launch the interactive interface to install the LDA package. Along the way, I try to install the LDA package in / usr / lib64 / R / library
, but I'm not the root
user, so I'm specifying my personal library. You will also be asked for the download source, so select Japan (Tokyo) CRAN mirror.
Below we have additionally installed reshape2
and ggplot2
to run the demo.
$ R
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
...
> install.packages("lda")
...
Would you like to use a personal library instead? (y/n) y
...
Selection: 13
...
> install.packages("reshape2")
> install.packages("ggplot2")
> require("lda")
> demo(lda)
It seems that not following the specification change of ggplot2
stat_count () must not be used with ay aesthetic.
Will occur and the graph will not be displayed. If you start the qplot ()
part correctly and manually, it will work, but since the LDA calculation process itself is finished, ignore it and proceed.
> top.topic.words(result$topics, 5)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "algorithm" "learning" "model" "learning" "neural" "learning"
[2,] "model" "algorithm" "report" "neural" "networks" "network"
[3,] "algorithms" "paper" "models" "networks" "network" "feature"
[4,] "data" "examples" "bayesian" "paper" "learning" "model"
[5,] "results" "number" "technical" "network" "research" "features"
[,7] [,8] [,9] [,10]
[1,] "knowledge" "problem" "paper" "learning"
[2,] "system" "genetic" "decision" "reinforcement"
[3,] "reasoning" "control" "algorithm" "paper"
[4,] "design" "performance" "results" "problem"
[5,] "case" "search" "method" "method"
The results are coming out.
vocab to use [
lda.collapsed.gibbs.sampler {lda} ](http://www.inside-r.org/packages/cran/lda/docs/lda.collapsed.gibbs.sampler) Prepare the data files corresponding to
, documents
.
Since vocab
represents a corpus, first make a list of $ N $ words (as strings) contained in all documents. This will cause each word to be represented by an index $ i $ on the vocab
.
{\tt vocab} = \{w_0, w_1, ..., w_{N-1}\} \\
{\tt vocab[}i{\tt]} = w_i
As a data file, create a corpus of corpus_r.txt
with 1 line = 1 word, which is a collection of words in all documents without duplication.
$ head -n 5 corpus_r.txt
Mediation
capital
person
Drip
Lanthanoid
$ wc -l corpus_r.txt
20778 corpus_r.txt
documents
is a list of documents $ d_k $. The document $ d_k $ is represented by a $ m \ times 2 $ matrix of the index $ i_ {k, j} $ of the word contained in the document and the number of occurrences of that word in the document $ c_j $.
{\tt documents} = \{ d_0, d_1, ..., d_k, ..., d_{m-1} \} \\
d_k = \begin{pmatrix}
i_{k,0} & i_{k,1} & ... & i_{k,n-1} \\
c_{k,0} & c_{k,1} & ... & c_{k,n-1}
\end{pmatrix}
As a data file, 1 line = 1 document, and prepare bow_r.txt
with the index (starting from 0) of the words contained in the document and the number of occurrences separated by a colon.
$ head -n 5 bow_r.txt
74:1 1109:1 1788:1 7000:2 10308:2 10552:1 12332:2 13489:1 14996:1 15448:1 15947:1 16354:4 17577:1 18262:1 19831:4
3256:3 5278:1 9039:1 12247:1 14529:6 17026:3
2181:1 4062:1 6270:1 6508:1 7405:1 8662:1 15448:1 18905:1
8045:2 9323:1 5934:2
288:3 624:1 691:1 820:2 1078:2 1109:2 1148:3 1251:1 2025:1 2050:1 2072:1 2090:1 2543:2 2626:1 2759:1…
$ wc -l bow_r.txt
13017 bow_r.txt
Run it with the following code. It is bigger to analyze the data file and create the assumed data structure than the LDA processing (maybe it is easier for someone who is good at R?).
require("lda")
vocab <- as.vector(read.table("corpus_r.txt", header=FALSE)$V1)
file <- file("bow_r.txt", "r")
docs <- list()
repeat {
line <- readLines(con=file, 1)
if(length(line) == 0) break
doc <- NULL
for(tc in strsplit(line, "\t")[[1]]){
col <- c()
for(c in strsplit(tc, ":")[[1]]){
col <- c(col, as.integer(c))
}
if(is.null(doc)) doc <- cbind(col)
else doc <- cbind(doc, col)
}
docs <- append(docs, list(doc))
}
close(file)
result = lda.collapsed.gibbs.sampler(docs, 100, vocab, 50, 0.1, 0.1)
top.topic.words(result$topics, 5)
Running top.topic.words ()
will list the words that characterize each of the $ k = 100 $ topics.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "problem" "period" "Ichiban" "Eye" "room" "Issue" "Idol" "box"
[2,] "Congressman" "Please" "weather" "Lock" "House" "from now on" "fan" "part"
[3,] "society" "trial" "China" "rotation" "toilet" "Temple" "Weird" "sticker"
[4,] "religion" "everyone" "mistake" "Unnecessary" "water" "Impossible" "Interest" "varnish"
[5,] "Principle" "Full-time employee" "All" "China" "Washing" "Contents" "Obedient" "information"
[,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
[1,] "Female" "Silver" "image" "site" "Song" "Please" "circuit" "jobs"
[2,] "Man" "Myopia" "Attachment" "Method" "Please" "My" "regiment" "Company"
[3,] "male" "grown up" "Multiple" "Registration" "piano" "Reason" "infantry" "boss"
[4,] "Man" "snack" "Photo" "domain" "musics" "Sales" "Current" "Man"
[5,] "woman" "Picture book" "Earth" "page" "Feeling" "number" "Photo" "workplace"
... (Continues up to 100 topics)
Recommended Posts