[PYTHON] Let's use tomotopy instead of gensim

What is tomotopy?

tomotopy is an abbreviation for TOpic MOdeling TOol, a Python library that can mainly handle LDA (Latent Dirichlet Allocation) and its derived algorithms. ..

It is easier to handle than the library gensim, which has similar functions, and the calculation is faster because it is built in C ++.

Introduction method

Just put it in with pip.

pip install tomotopy

How to use

As an example, use the following dataset from the gensim tutorial.

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

When using LDA with tomotopy, it will be as follows.

Use the dataset after preprocessing (preprocessing is [this](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-] Same as core-concepts-py)).

import tomotopy as tp
from pprint import pprint

texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

#Model initialization
model = tp.LDAModel(k=2, seed=1)  #k is the number of topics

#Creating a corpus
for text in texts:
    model.add_doc(text)

#Learning
model.train(iter=100)

#Extracting the word distribution of a topic
for k in range(model.k):
    print(f"Topic {k}")
    pprint(model.get_topic_words(k, top_n=5))

"""output
Topic 0
[('system', 0.20972803235054016),
 ('user', 0.15742677450180054),
 ('human', 0.10512551665306091),
 ('interface', 0.10512551665306091),
 ('computer', 0.10512551665306091)]
Topic 1
[('trees', 0.2974308431148529),
 ('graph', 0.2974308431148529),
 ('survey', 0.1986166089773178),
 ('minors', 0.1986166089773178),
 ('system', 0.0009881423320621252)]
"""

Features of tomtopy

Good point

Most of what you want to do when you want to use LDA can be easily done by initializing the model and setting the arguments of the learning function.

(Parallelization, TF-IDF, setting upper and lower limits of word frequency and document frequency, etc.)

--The learning algorithm is sampling (collapsed Gibbs sampling).

Variational reasoning is used in gensim, but sampling is said to be more accurate.

The disadvantage of sampling is that it takes time,

Since tomotopy is built in C ++ and can be easily parallelized, it is much faster than MALLET.

--LDA derivatives are available.

The following are available:

bad place

――It may not be possible to reach the itchy place.

Perhaps because tomotopy specializes in ease of use, there are times when I'm asked, "Well, can't I do this?"

For example

-~~ The processed corpus cannot be reused (you must create a corpus each time you learn). ~~ If you look closely, it seems that you can do it using a class called tomotopy.utils.Corpus. However, when I tried it, it was a disappointing specification that the cost was high in terms of time and RAM.

--There is no way to save RAM.

(Well, if neither of them is a dataset of 10 million records, it doesn't bother me that much.)

Summary

With tomotopy, you can learn LDA models by sampling very easily.

To be honest, I can't go back to gensim anymore.

Recommended Posts

Let's use tomotopy instead of gensim
Let's use usercustomize.py instead of sitecustomize.py
R: Use Japanese instead of Japanese in scripts
Let's use pytube
Let's use different versions of SQLite3 from Python3!
Convenient use of ipython
Let's use the API of the official statistics counter (e-Stat)
Use urlparse.urljoin instead of os.path.join for Python URL joins
Graceful use of MySQLdb
Let's use the Python version of the Confluence API module.
Let's use the open data of "Mamebus" in Python
EP 7 Use List Comprehensions Instead of map and filter
Let's use the distributed expression of words quickly with fastText!
Let's use def in python
Let's use python janome easily
[gensim] How to use Doc2Vec
Let's visualize GraphConvModel of DeepChem
Let's use MemSQL Vol.14: Practice 7
Use and integration of "Shodan"
Let's use MemSQL Vol.13: Practice 6
It is convenient to use Icecream instead of print when debugging.