Even in natural language, these days it is often "deep learning" as if you were asking for draft beer. In fact, deep learning can often record high accuracy, but in reality, that accuracy can be recorded even with a basic model, and in some cases it can be defeated.
This happened in a paper published by DeepMind, a well-known research institute.
Recently, there have been papers such as Achieving the highest accuracy even with a model that is often used as a baseline. In this way, even a model that has been kicked as a baseline cannot be placed in an unexpected corner.
This time, I will take up the topic model, which is a kind of basic method in natural language processing. This is a model used for document classification, but it is also a very versatile model that can incorporate various assumptions such as assuming the existence of an author when classifying documents (author topic model).
This time, I would like to introduce the mechanism and explain the implementation method.
The topic model is a type of probabilistic model. In other words, we are estimating the "probability of appearance" of something. In the topic model, this is the "probability of a word appearing" in a sentence. If you can estimate this well, you will be able to understand sentences in which similar words appear. In other words, the topic model is a model that estimates the "probability of word appearance in a document".
In addition, when I hear that, I think that sentences can be generated from the topic model (learning the probability of word appearance), but as mentioned above, the topic model estimates "probability of word appearance". Yes, I don't care about grammar rules. In other words, the topic model is a model that focuses on words in sentences and their frequency of occurrence.
Now, how should we estimate the "probability of words appearing"? In short, it counts the words that actually appear in the document and makes an estimate based on that. The topic model assumes that the type and number of words that appear depends on the topic (= category). Imagewise, this is a natural assumption, as the words that appear in the topics of politics and performing arts are different.
The model changes depending on the unit in which this "topic" is considered to exist. In the figure below, the square box represents the document and the color of its contents represents the topic.
In other words, the topic model introduced this time is a model with the most detailed assumptions. By assuming that the document is composed of multiple topics, it is possible to derive the structure of the document such as "This article is 70% sports, 20% entertainment ...", which allows us to understand the characteristics of the text. However, you will be able to quantitatively grasp the differences between documents (this document is about 10% more political than that document, etc.).
Now let's see how to actually implement a topic model. The following assumes implementation in Python.
The easiest implementation is gensim. As it says "topic modeling for humans", it can be implemented in just a few lines.
Alternatively, it can be implemented on PyMC3 or PyStan. Both PyMC / PyStan are libraries for estimating the parameters of statistical models called MCMC samplers. So, of course, you have to build the statistical model itself. PyStan is an interface for Python of a library called Stan, and Stan itself can also be used from C ++ and R.
If you want to customize it, such as gensim if you want to focus on ease of use, or build your own model, you can use PyMC / PyStan (by the way, installation is also troublesome in this order). This time, I would like to use gensim with an emphasis on ease of use. From here, I will explain based on the actual code. The explanation code is placed in the following repository, so please refer to it if necessary.
icoxfog417/gensim_notebook (It will be encouraging if you give me a Star m (_ _) m)
Please refer to here for the environment settings when developing using machine learning with Python. Introducing how to use a package called Miniconda.
If you build the environment according to the above article, gensim can be entered in one shot with conda install gensim
(PyMC is also the same. PyStan is not so easy to enter ...).
In the above repository, iPython notebook, a tool that can create documents containing Python code, is used to create explanatory documents. Based on this, we will explain the following.
gensim_notebook/topic_model_evaluation.ipynb
First, get the target data.
The data for building the topic model this time was obtained from Hot Pepper Beauty API provided by Recruit.
The motivation was that the data was completely unfamiliar to me, so I was curious about what it would look like.
Make a new registration from the above site and get an API key.
You can then download the hair salon data by running scripts / download_data.py
in the repository.
Next, preprocess the data. Specifically, it converts a sentence into a word / occurrence set. This set is called a corpus, and words are represented by IDs in the corpus. The link between this ID and the actual word is called a dictionary here. The process of creating this corpus / dictionary is as follows.
The process of breaking down a sentence into words is separated by spaces in English, but it is not so easy in Japanese. Therefore, it is common to use a library that can perform morphological analysis. MeCab is famous as this library, and a dictionary called mecab-ipadic-neologd that registers newly appearing words is also provided.
However, especially on Windows, it is troublesome to install, so I think it is good to use janome, which is built only with Python and easy to install. In this sample, morphological analysis using the above library is not performed so that it can be executed easily. Since the "special condition" in the data was just information separated by slashes, I use this as a delimiter.
Eliminating unnecessary words is to remove words that you think are unnecessary for classification. Specifically, the words are as follows.
And unification of words means unifying things that differ only in conjugation, such as "delicious" and "delicious". This is called stemming.
In natural language processing, it is no exaggeration to say that the accuracy is almost determined by how much of these preprocessing can be performed.
The script scripts / make_corpus.py
for creating a corpus allows you to perform the above operations with various options, so please try various settings.
Once you have a corpus and a dictionary, build a model based on it. However, with gensim, building a model is just one line.
m = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3)
"Assuming how many topics there are (num_topics)" needs to be set. This means that you have to estimate how many topics you are likely to have in the entire document (actually, there is also a Dirichlet process to estimate how many topics are good).
To consider how many you should set, you need to evaluate the model.
First, in order to make an evaluation, the corpus is divided into learning and evaluation in the same way as normal machine learning practices. We learn the classification for learning, and as a result, we check whether the evaluation (= unknown sentence) can be properly classified.
An index called perplexity is used for this evaluation. The reciprocal of perplexity indicates the degree to which the appearance of words in a document can be predicted, so the highest value is 1, which increases as the model becomes less accurate (2 digits are fine, the first half of 3 digits is ok After that, I feel that it is bad, and in the case of 1 digit, it is better to review the model and the calculation method of perplexity for errors).
In Make Topic Model part in the commentary document, the perplexity is calculated while changing the number of topics, so check it. please try.
And visualizing the topic makes it easier to evaluate the model. The method will be explained below.
Since the topic is a classification of sentences, it is better to classify without omission / duplication, that is, if it is clearly separated. In other words, it is a good model to have a good distance between topics and classifications.
There is KL-divergence as an index to measure the distance between each classification (= word distribution of each topic). Using this, the following is an illustration of the distance between topics (created with 3 topics, the axis value is the topic number).
The figure above is a bad example. The farther the topics are, the lighter the color becomes, so the fact that the entire figure is dark means that similar topics (= categories) have appeared. In this case, consider reducing the number of topics.
When actually classifying sentences, it is better to clearly say that the topic of this sentence is xx. In other words, it is better to have a clear main topic for each sentence.
Therefore, the following is an illustration of the topics that make up each sentence (after the above verification, the number of topics has been reduced to 2). 200 documents are randomly picked up and the composition ratio of the topic in each document is displayed.
Finally, let's see what words are likely to appear in each topic. The words are not duplicated between topics, and it is so good that you can guess what kind of topic it is from a collection of words.
Looking at this, at least Topic # 1 can be inferred from the keywords "small salon" and "from one stylist" to be a small salon (considering the keyword "complete reservation system", it is high-class. It may be a store-like place). Topic # 0 has the impression that it is a relatively large hair salon with many staff and is open all year round. When I actually look at the homepage of the hair salon, I feel that Topic # 0 has a major feel, and Topic # 1 has a specialty store feel.
In addition, a method that allows the system itself to discover and classify the features of data without being taught by a person like a topic model is called unsupervised learning (conversely, a method that trains data and its classification as a set). Is called unsupervised learning). While this unsupervised learning has the advantage that it can be started immediately if there is data, it also has the disadvantage that it is difficult to interpret the results because ** "how to classify" is left to the model as described above *. *.
In this way, by using gensim, you can easily build a topic model, grasp the characteristics of sentences, and classify them. By applying this to the application, I think that the following functions can be implemented.
Also, as mentioned at the beginning, the topic model is a very versatile model, so there are many developments.
Specifically, there is a Correspondence topic model that allows you to consider additional information such as evaluation indexes in addition to text information such as reviews, and an Author topic model that considers the author who wrote the text. model) etc. Attempts have also been made to adapt image classification by treating image features like words (this allows both text and images to be handled at the same time, so images are automatically captioned. It is applied to research on annotation etc.).
I hope this commentary will help you to put your ideas into shape.
Here are some useful books / articles for those who want to know more.
The following articles carefully explain basic ideas such as probability.
If you want to implement it with PyMC, this tutorial will be helpful.