[PYTHON] Learn the basics of document classification by natural language processing, topic model

Even in natural language, these days it is often "deep learning" as if you were asking for draft beer. In fact, deep learning can often record high accuracy, but in reality, that accuracy can be recorded even with a basic model, and in some cases it can be defeated.

This happened in a paper published by DeepMind, a well-known research institute.

machine-comprehension-11-638.jpg

Recently, there have been papers such as Achieving the highest accuracy even with a model that is often used as a baseline. In this way, even a model that has been kicked as a baseline cannot be placed in an unexpected corner.

This time, I will take up the topic model, which is a kind of basic method in natural language processing. This is a model used for document classification, but it is also a very versatile model that can incorporate various assumptions such as assuming the existence of an author when classifying documents (author topic model).

This time, I would like to introduce the mechanism and explain the implementation method.

What is a topic model?

The topic model is a type of probabilistic model. In other words, we are estimating the "probability of appearance" of something. In the topic model, this is the "probability of a word appearing" in a sentence. If you can estimate this well, you will be able to understand sentences in which similar words appear. In other words, the topic model is a model that estimates the "probability of word appearance in a document".

In addition, when I hear that, I think that sentences can be generated from the topic model (learning the probability of word appearance), but as mentioned above, the topic model estimates "probability of word appearance". Yes, I don't care about grammar rules. In other words, the topic model is a model that focuses on words in sentences and their frequency of occurrence.

Now, how should we estimate the "probability of words appearing"? In short, it counts the words that actually appear in the document and makes an estimate based on that. The topic model assumes that the type and number of words that appear depends on the topic (= category). Imagewise, this is a natural assumption, as the words that appear in the topics of politics and performing arts are different.

The model changes depending on the unit in which this "topic" is considered to exist. In the figure below, the square box represents the document and the color of its contents represents the topic.

1.png

In other words, the topic model introduced this time is a model with the most detailed assumptions. By assuming that the document is composed of multiple topics, it is possible to derive the structure of the document such as "This article is 70% sports, 20% entertainment ...", which allows us to understand the characteristics of the text. However, you will be able to quantitatively grasp the differences between documents (this document is about 10% more political than that document, etc.).

How to implement a topic model

Now let's see how to actually implement a topic model. The following assumes implementation in Python.

The easiest implementation is gensim. As it says "topic modeling for humans", it can be implemented in just a few lines.

Alternatively, it can be implemented on PyMC3 or PyStan. Both PyMC / PyStan are libraries for estimating the parameters of statistical models called MCMC samplers. So, of course, you have to build the statistical model itself. PyStan is an interface for Python of a library called Stan, and Stan itself can also be used from C ++ and R.

If you want to customize it, such as gensim if you want to focus on ease of use, or build your own model, you can use PyMC / PyStan (by the way, installation is also troublesome in this order). This time, I would like to use gensim with an emphasis on ease of use. From here, I will explain based on the actual code. The explanation code is placed in the following repository, so please refer to it if necessary.

icoxfog417/gensim_notebook (It will be encouraging if you give me a Star m (_ _) m)

Please refer to here for the environment settings when developing using machine learning with Python. Introducing how to use a package called Miniconda. If you build the environment according to the above article, gensim can be entered in one shot with conda install gensim (PyMC is also the same. PyStan is not so easy to enter ...). In the above repository, iPython notebook, a tool that can create documents containing Python code, is used to create explanatory documents. Based on this, we will explain the following.

gensim_notebook/topic_model_evaluation.ipynb

Data acquisition

First, get the target data.

The data for building the topic model this time was obtained from Hot Pepper Beauty API provided by Recruit. The motivation was that the data was completely unfamiliar to me, so I was curious about what it would look like. Make a new registration from the above site and get an API key. You can then download the hair salon data by running scripts / download_data.py in the repository.

Data preprocessing

Next, preprocess the data. Specifically, it converts a sentence into a word / occurrence set. This set is called a corpus, and words are represented by IDs in the corpus. The link between this ID and the actual word is called a dictionary here. The process of creating this corpus / dictionary is as follows.

  1. Break down sentences into words (morphological analysis) and count them
  2. Removal of unnecessary words (removal of stop words, etc.)
  3. Word unification (stemming)

The process of breaking down a sentence into words is separated by spaces in English, but it is not so easy in Japanese. Therefore, it is common to use a library that can perform morphological analysis. MeCab is famous as this library, and a dictionary called mecab-ipadic-neologd that registers newly appearing words is also provided.

However, especially on Windows, it is troublesome to install, so I think it is good to use janome, which is built only with Python and easy to install. In this sample, morphological analysis using the above library is not performed so that it can be executed easily. Since the "special condition" in the data was just information separated by slashes, I use this as a delimiter.

b20fe2c916266f2ac4b38004755fe09c6ed349ff.png

Eliminating unnecessary words is to remove words that you think are unnecessary for classification. Specifically, the words are as follows.

And unification of words means unifying things that differ only in conjugation, such as "delicious" and "delicious". This is called stemming.

In natural language processing, it is no exaggeration to say that the accuracy is almost determined by how much of these preprocessing can be performed. The script scripts / make_corpus.py for creating a corpus allows you to perform the above operations with various options, so please try various settings.

Building a topic model

Once you have a corpus and a dictionary, build a model based on it. However, with gensim, building a model is just one line.

m = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=3)

"Assuming how many topics there are (num_topics)" needs to be set. This means that you have to estimate how many topics you are likely to have in the entire document (actually, there is also a Dirichlet process to estimate how many topics are good).

To consider how many you should set, you need to evaluate the model.

Evaluation of topic model

First, in order to make an evaluation, the corpus is divided into learning and evaluation in the same way as normal machine learning practices. We learn the classification for learning, and as a result, we check whether the evaluation (= unknown sentence) can be properly classified.

An index called perplexity is used for this evaluation. The reciprocal of perplexity indicates the degree to which the appearance of words in a document can be predicted, so the highest value is 1, which increases as the model becomes less accurate (2 digits are fine, the first half of 3 digits is ok After that, I feel that it is bad, and in the case of 1 digit, it is better to review the model and the calculation method of perplexity for errors).

In Make Topic Model part in the commentary document, the perplexity is calculated while changing the number of topics, so check it. please try.

8d37f421f8271baea5c8e21d7d97a09e6c8fee12.png

And visualizing the topic makes it easier to evaluate the model. The method will be explained below.

Distance between topics

Since the topic is a classification of sentences, it is better to classify without omission / duplication, that is, if it is clearly separated. In other words, it is a good model to have a good distance between topics and classifications.

There is KL-divergence as an index to measure the distance between each classification (= word distribution of each topic). Using this, the following is an illustration of the distance between topics (created with 3 topics, the axis value is the topic number).

9fd51e91164caa900ad51bffb3d74c33adebf005.png

The figure above is a bad example. The farther the topics are, the lighter the color becomes, so the fact that the entire figure is dark means that similar topics (= categories) have appeared. In this case, consider reducing the number of topics.

Topic structure of sentences

When actually classifying sentences, it is better to clearly say that the topic of this sentence is xx. In other words, it is better to have a clear main topic for each sentence.

Therefore, the following is an illustration of the topics that make up each sentence (after the above verification, the number of topics has been reduced to 2). 200 documents are randomly picked up and the composition ratio of the topic in each document is displayed.

689a55fb7316f98fee7d97ef6576e0ea1879973d.png

Topic reference

Finally, let's see what words are likely to appear in each topic. The words are not duplicated between topics, and it is so good that you can guess what kind of topic it is from a collection of words.

58fd2e84dfc244e44ab6a5acca61112906cfc2ce.png

Looking at this, at least Topic # 1 can be inferred from the keywords "small salon" and "from one stylist" to be a small salon (considering the keyword "complete reservation system", it is high-class. It may be a store-like place). Topic # 0 has the impression that it is a relatively large hair salon with many staff and is open all year round. When I actually look at the homepage of the hair salon, I feel that Topic # 0 has a major feel, and Topic # 1 has a specialty store feel.

In addition, a method that allows the system itself to discover and classify the features of data without being taught by a person like a topic model is called unsupervised learning (conversely, a method that trains data and its classification as a set). Is called unsupervised learning). While this unsupervised learning has the advantage that it can be started immediately if there is data, it also has the disadvantage that it is difficult to interpret the results because ** "how to classify" is left to the model as described above *. *.

Application to application

In this way, by using gensim, you can easily build a topic model, grasp the characteristics of sentences, and classify them. By applying this to the application, I think that the following functions can be implemented.

Also, as mentioned at the beginning, the topic model is a very versatile model, so there are many developments.

Specifically, there is a Correspondence topic model that allows you to consider additional information such as evaluation indexes in addition to text information such as reviews, and an Author topic model that considers the author who wrote the text. model) etc. Attempts have also been made to adapt image classification by treating image features like words (this allows both text and images to be handled at the same time, so images are automatically captioned. It is applied to research on annotation etc.).

I hope this commentary will help you to put your ideas into shape.

References

Here are some useful books / articles for those who want to know more.

The following articles carefully explain basic ideas such as probability.

If you want to implement it with PyMC, this tutorial will be helpful.

Recommended Posts

Learn the basics of document classification by natural language processing, topic model
Pokemon classification by topic model
Natural Language: Doc2Vec Part2 --Document Classification
Learn the basics of Python ① Beginners
Evaluate the accuracy of the learning model by cross-validation from scikit learn
[Word2vec] Let's visualize the result of natural language processing of company reviews
Learn the basics of Theano once again
[Linux] Learn the basics of shell commands
Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-
[Pyro] Stochastic modeling by the stochastic programming language Pyro ③ ~ Analysis of variance model, normal linear model ~
Python: Deep Learning in Natural Language Processing: Basics
Unbearable shortness of Attention in natural language processing
Image processing by matrix Basics & Table of Contents-Reinventor of Python image processing-
Model using convolutional neural network in natural language processing
Performance verification of data preprocessing in natural language processing
I tried the simplest method of multi-label document classification
Overview of natural language processing and its data preprocessing
Python: Natural language processing
RNN_LSTM2 Natural language processing
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Natural language processing with Word2Vec developed by a researcher in the US google (original data)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
Types of preprocessing in natural language processing and their power
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)
Try to evaluate the performance of machine learning / classification model
[Python] Try to classify ramen shops by natural language processing
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 language processing knock-29: Get the URL of the national flag image
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
Analyze the topic model of becoming a novelist with GensimPy3
(Preserved version) Natural language processing List of articles that toddlers should read first by Team AI
Language prediction model by TensorFlow
Natural language processing 1 Morphological analysis
Supervised learning 1 Basics of supervised learning (classification)
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
Easily build a natural language processing model with BERT + LightGBM + optuna
Dockerfile with the necessary libraries for natural language processing in python
Why is distributed representation of words important for natural language processing?
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
Basics of Python × GIS (Part 1)
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)
Basics of Python x GIS (Part 3)
Basics of Python x GIS (Part 2)
Understand the "temporary" part of UNIX / Linux
Visualize the characteristic vocabulary of a document with D3.js
Visualize the orbit of Hayabusa2
Visualize the response status of the census 2020
Learn the basics of document classification by natural language processing, topic model
[Python] Visualize the information acquired by Wireshark
Visualize the boundary values of the multi-layer perceptron
Visualize the effects of deep learning / regularization
Pandas of the beginner, by the beginner, for the beginner [Python]
Visualize the export data of Piyo log