I wanted to practice JavaFX & self-contained packages, I wanted to make a tool that I could use anyway, so I decided to use LDA. Here is a summary of LDA on Java. See the links at the end of this article for JavaFX & self-contained packages.
It is one of the (general purpose) machine learning methods for estimating topics from a specified set of documents in natural language processing. As of 2019, deep learning is popular, but before that, I was on a business trip to improve accuracy in the NLP area. For those who want to read about logic, the following pages are easy to read and recommended.
Excerpt from Latent Dirichlet Allocation (LDA) Introduction to Yurufuwa
LDA is a type of language model that assumes that a document consists of multiple topics. In Japanese, it is called "Latent Dirichlet Allocation Method". If you describe a word as superficial, the topic is latent because it does not appear on the surface unlike a word. I wonder if it is called the "latent Dirichlet distribution method" because the Dirichlet distribution is assumed to be the prior distribution of the distribution of the potential elements. (Omitted) The Dirichlet distribution is roughly the probability distribution of the probability distribution. For example, if there are three topics, "sports", "economy", and "politics" Probability of generation of each topic (sports, economy, politics) = (0.3, 0.2, 0.5) The probability that 0> .1, (sports, economy, politics) = (0.1, 0.2, 0.7) determines the probability of the probability distribution as 0.2.
As an implementation policy of LDA, it seems that the Python library called gensim is famous. Reference: Introduction to gensim
From the perspective of simplifying subsequent JavaFX apps, we will not collaborate with Python & Java. An implementation example of Java LDA was published on GitHub, so I decided to borrow it. Thanks.
I did two hits under the name of LDA4j, but this time I adopted the module of Mr. hankcs. I almost like it. (As an input document set, 1 file (1 document per line) in breakbee / LDA4J, In hankcs / LDA4j, there was a difference between multiple files (1 file, 1 document), I personally preferred by file)
environment | service/version |
---|---|
Execution environment | Windows10 |
Development environment | eclipse 4.1.0 |
development language | Java 8 |
Pull the module to eclipse appropriately. Download forks & clones or Zip from Github and import projects. After that, I created my own execution module. (MainRunner.java)
As the ReadMe says ...
MainRunner.java
package com.ketman.app;
import java.io.IOException;
import java.util.Map;
import com.hankcs.lda.Corpus;
import com.hankcs.lda.LdaGibbsSampler;
import com.hankcs.lda.LdaUtil;
public class MainRunner {
public static void main(String[] args)
{
// 1. Load corpus from disk
Corpus corpus;
try {
corpus = Corpus.load("data/mini");
// 2. Create a LDA sampler
LdaGibbsSampler ldaGibbsSampler = new LdaGibbsSampler(corpus.getDocument(), corpus.getVocabularySize());
// 3. Train it
ldaGibbsSampler.gibbs(10);
// 4. The phi matrix is a LDA model, you can use LdaUtil to explain it.
double[][] phi = ldaGibbsSampler.getPhi();
Map<String, Double>[] topicMap = LdaUtil.translate(phi, corpus.getVocabulary(), 10);
LdaUtil.explain(topicMap);
} catch (IOException e) {
//TODO auto-generated catch block
e.printStackTrace();
}
}
}
Try running MainRunner from Execution ⇒ Execution Configuration ⇒ Java Application. You should see the following output on the console. Estimates the specified number (10) of topics for the set of documents stored in data / mini.
Sampling 1000 iterations with burn-in of 100 (B/S=20).
BBBBB|S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||
topic 0 :
China=0.0097164123524064
Market=0.007268178259268298
Business=0.006646897977003122
Company=0.006420165848545306
Exhibition=0.005931172520179485
Tourism=0.005517115761050293
Imminent=0.004144655174798414
Reporter=0.003896963247878764
Products=0.0038405773231741857
Service=0.0036131627315211285
topic 1 :
Beautiful country=0.007753386939328633
Japan=0.004271883755069139
Training=0.0039382838929572965
Systematic=0.0038821627109404673
Airplane=0.0037908977218186262
Troop=0.003713327985408122
Military=0.003662570207063461
Advance=0.003548971364140448
Creation=0.003465095755923189
Equipment=0.0033491792847693187
~ Omitted ~
topic 9 :
Hirai=0.00887335526016362
队员=0.003820752808354389
联赛=0.0034088636107220934
Ball=0.0030593385176732896
Club=0.002519739439727434
Crown=0.0025101075962186965
China=0.002314435002019442
Ball=0.0023066510579788685
赛=0.002282312176369107
Reporter=0.0022029528425211455
Recommended Posts