Even in natural language, these days it is often "deep learning" as if you were asking for draft beer. In fact, deep learning can often record high accuracy, but in reality, that accuracy can be recorded even with a basic model, and in some cases it can be defeated.

This happened in a paper published by DeepMind, a well-known research institute.

From Toward the acquisition of the ability to read and understand sentences-Research trends of Machine Comprehension-

Recently, there have been papers such as Achieving the highest accuracy even with a model that is often used as a baseline. In this way, even a model that has been kicked as a baseline cannot be placed in an unexpected corner.

This time, I will take up the topic model, which is a kind of basic method in natural language processing. This is a model used for document classification, but it is also a very versatile model that can incorporate various assumptions such as assuming the existence of an author when classifying documents (author topic model).

This time, I would like to introduce the mechanism and explain the implementation method.

This article is a reprint of a blog I wrote a long time ago (it has moved due to the site closure).

What is a topic model?

The topic model is a type of probabilistic model. In other words, we are estimating the "probability of appearance" of something. In the topic model, this is the "probability of a word appearing" in a sentence. If you can estimate this well, you will be able to understand sentences in which similar words appear. In other words, the topic model is a model that estimates the "probability of word appearance in a document".

In addition, when I hear that, I think that sentences can be generated from the topic model (learning the probability of word appearance), but as mentioned above, the topic model estimates "probability of word appearance". Yes, I don't care about grammar rules. In other words, the topic model is a model that focuses on words in sentences and their frequency of occurrence.

Now, how should we estimate the "probability of words appearing"? In short, it counts the words that actually appear in the document and makes an estimate based on that. The topic model assumes that the type and number of words that appear depends on the topic (= category). Imagewise, this is a natural assumption, as the words that appear in the topics of politics and performing arts are different.

The model changes depending on the unit in which this "topic" is considered to exist. In the figure below, the square box represents the document and the color of its contents represents the topic.

For Unigram models, all boxes are the same blue. In other words, it is a model that assumes that the words in all documents are generated from one topic.
The mixed unigram model has a different color for each box. That is, the model assumes that each document has one topic and that topic generates words for the document.
In the topic model (LDA), the colors are different inside the box. In other words, suppose each document is made up of multiple topics, and the words are generated by adding up the word distributions for each topic.

In other words, the topic model introduced this time is a model with the most detailed assumptions. By assuming that the document is composed of multiple topics, it is possible to derive the structure of the document such as "This article is 70% sports, 20% entertainment ...", which allows us to understand the characteristics of the text. However, you will be able to quantitatively grasp the differences between documents (this document is about 10% more political than that document, etc.).

How to implement a topic model

Now let's see how to actually implement a topic model. The following assumes implementation in Python.

The easiest implementation is gensim. As it says "topic modeling for humans", it can be implemented in just a few lines.

Alternatively, it can be implemented on PyMC3 or PyStan. Both PyMC / PyStan are libraries for estimating the parameters of statistical models called MCMC samplers. So, of course, you have to build the statistical model itself. PyStan is an interface for Python of a library called Stan, and Stan itself can also be used from C ++ and R.

If you want to customize it, such as gensim if you want to focus on ease of use, or build your own model, you can use PyMC / PyStan (by the way, installation is also troublesome in this order). This time, I would like to use gensim with an emphasis on ease of use. From here, I will explain based on the actual code. The explanation code is placed in the following repository, so please refer to it if necessary.

icoxfog417/gensim_notebook (It will be encouraging if you give me a Star m (_ _) m)

Please refer to here for the environment settings when developing using machine learning with Python. Introducing how to use a package called Miniconda. If you build the environment according to the above article, gensim can be entered in one shot with conda install gensim (PyMC is also the same. PyStan is not so easy to enter ...). In the above repository, iPython notebook, a tool that can create documents containing Python code, is used to create explanatory documents. Based on this, we will explain the following.

gensim_notebook/topic_model_evaluation.ipynb

Data acquisition

First, get the target data.

The data for building the topic model this time was obtained from Hot Pepper Beauty API provided by Recruit. The motivation was that the data was completely unfamiliar to me, so I was curious about what it would look like. Make a new registration from the above site and get an API key. You can then download the hair salon data by running scripts / download_data.py in the repository.

Data preprocessing

Next, preprocess the data. Specifically, it converts a sentence into a word / occurrence set. This set is called a corpus, and words are represented by IDs in the corpus. The link between this ID and the actual word is called a dictionary here. The process of creating this corpus / dictionary is as follows.

Break down sentences into words (morphological analysis) and count them
Removal of unnecessary words (removal of stop words, etc.)
Word unification (stemming)

The process of breaking down a sentence into words is separated by spaces in English, but it is not so easy in Japanese. Therefore, it is common to use a library that can perform morphological analysis. MeCab is famous as this library, and a dictionary called mecab-ipadic-neologd that registers newly appearing words is also provided.

However, especially on Windows, it is troublesome to install, so I think it is good to use janome, which is built only with Python and easy to install. In this sample, morphological analysis using the above library is not performed so that it can be executed easily. Since the "special condition" in the data was just information separated by slashes, I use this as a delimiter.

Eliminating unnecessary words is to remove words that you think are unnecessary for classification. Specifically, the words are as follows.

General words (stop words) that do not contribute to classification, such as "yes" and "yes"
Words that appear in most sentences and are not useful for classification (words that are used everywhere, such as "very popular!")
On the contrary, words that rarely appear
Other words that do not seem to contribute to the classification

And unification of words means unifying things that differ only in conjugation, such as "delicious" and "delicious". This is called stemming.

In natural language processing, it is no exaggeration to say that the accuracy is almost determined by how much of these preprocessing can be performed. The script scripts / make_corpus.py for creating a corpus allows you to perform the above operations with various options, so please try various settings.

Building a topic model

Once you have a corpus and a dictionary, build a model based on it. However, with gensim, building a model is just one line.

m = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3)

"Assuming how many topics there are (num_topics)" needs to be set. This means that you have to estimate how many topics you are likely to have in the entire document (actually, there is also a Dirichlet process to estimate how many topics are good).

To consider how many you should set, you need to evaluate the model.

Evaluation of topic model

First, in order to make an evaluation, the corpus is divided into learning and evaluation in the same way as normal machine learning practices. We learn the classification for learning, and as a result, we check whether the evaluation (= unknown sentence) can be properly classified.

An index called perplexity is used for this evaluation. The reciprocal of perplexity indicates the degree to which the appearance of words in a document can be predicted, so the highest value is 1, which increases as the model becomes less accurate (2 digits are fine, the first half of 3 digits is ok After that, I feel that it is bad, and in the case of 1 digit, it is better to review the model and the calculation method of perplexity for errors).

In Make Topic Model part in the commentary document, the perplexity is calculated while changing the number of topics, so check it. please try.

And visualizing the topic makes it easier to evaluate the model. The method will be explained below.

Distance between topics

Since the topic is a classification of sentences, it is better to classify without omission / duplication, that is, if it is clearly separated. In other words, it is a good model to have a good distance between topics and classifications.

There is KL-divergence as an index to measure the distance between each classification (= word distribution of each topic). Using this, the following is an illustration of the distance between topics (created with 3 topics, the axis value is the topic number).

The figure above is a bad example. The farther the topics are, the lighter the color becomes, so the fact that the entire figure is dark means that similar topics (= categories) have appeared. In this case, consider reducing the number of topics.

Topic structure of sentences

When actually classifying sentences, it is better to clearly say that the topic of this sentence is xx. In other words, it is better to have a clear main topic for each sentence.

Therefore, the following is an illustration of the topics that make up each sentence (after the above verification, the number of topics has been reduced to 2). 200 documents are randomly picked up and the composition ratio of the topic in each document is displayed.

Topic reference

Finally, let's see what words are likely to appear in each topic. The words are not duplicated between topics, and it is so good that you can guess what kind of topic it is from a collection of words.

Looking at this, at least Topic # 1 can be inferred from the keywords "small salon" and "from one stylist" to be a small salon (considering the keyword "complete reservation system", it is high-class. It may be a store-like place). Topic # 0 has the impression that it is a relatively large hair salon with many staff and is open all year round. When I actually look at the homepage of the hair salon, I feel that Topic # 0 has a major feel, and Topic # 1 has a specialty store feel.

In addition, a method that allows the system itself to discover and classify the features of data without being taught by a person like a topic model is called unsupervised learning (conversely, a method that trains data and its classification as a set). Is called unsupervised learning). While this unsupervised learning has the advantage that it can be started immediately if there is data, it also has the disadvantage that it is difficult to interpret the results because ** "how to classify" is left to the model as described above *. *.

Application to application

In this way, by using gensim, you can easily build a topic model, grasp the characteristics of sentences, and classify them. By applying this to the application, I think that the following functions can be implemented.

Recommend users who have similar posts (= users with similar interests) on SNS
Useful for static analysis of code by building a topic model from source code and grouping it into similar functional units.
Display information similar to past inquiries to operators at call centers, etc.

Also, as mentioned at the beginning, the topic model is a very versatile model, so there are many developments.

Specifically, there is a Correspondence topic model that allows you to consider additional information such as evaluation indexes in addition to text information such as reviews, and an Author topic model that considers the author who wrote the text. model) etc. Attempts have also been made to adapt image classification by treating image features like words (this allows both text and images to be handled at the same time, so images are automatically captioned. It is applied to research on annotation etc.).

I hope this commentary will help you to put your ideas into shape.

References

Here are some useful books / articles for those who want to know more.

Topic Model (Machine Learning Professional Series) The topic model is explained in a very easy-to-understand manner. Another good book is Statistical Latent Semantics Analysis by Topic Model (Natural Language Processing Series), but if you read it before reading it, I think it's good.
Topic Model Series 1 Overview This is also a very easy-to-understand explanation. Since the explanation is based on PyStan, I think that those who want to implement it with PyStan will also be helpful.

The following articles carefully explain basic ideas such as probability.

If you want to implement it with PyMC, this tutorial will be helpful.

[PYTHON] Learn the basics of document classification by natural language processing, topic model