In this article, I will introduce the outline that this kind of work is done when doing natural language processing.

Morphological analysis

Morphological analysis is the task of chopping sentences into words. Look up the part of speech and return the inflected form to the original form. For example, the phrase "taking an exam that fits your height" is analyzed as follows.

Height Noun, General, *, *, *, *, Height, Minotake, Minotake Ni particle, case particle, general, *, *, *, ni, ni, ni Matching verb, independence, *, *, 5th dan / wa line reminder, continuous use connection, matching, ah, ah Ta auxiliary verb, *, *, *, special ta, uninflected word, ta, ta, ta Exam Noun, Sahen Connection, *, *, *, *, Exam, Juken, Juken

MeCab is used for morphological analysis. https://taku910.github.io/mecab/ This is widely used, but it has some weaknesses in new words and technical terms because it divides words more than necessary, such as changing "registered dietitian exam" to "administrative / nutrition / specialist / exam". For new words, a frequently updated dictionary called mecab-ipadic-NEologd has been published to make up for the weaknesses of MeCab. https://github.com/neologd/mecab-ipadic-neologd In addition to this, I create a new word dictionary to be analyzed locally and use it by adding various things.

I would like to use JUMAN ++ for high accuracy. http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++ When "Foreigners to vote" is processed by MeCab, it becomes "Foreign / Ginseng / Government", but in JUMAN ++, it is analyzed as "Foreigners / Suffrage". The feature is that you can improve your analysis skills by machine learning. However, when I build JUMAN ++ with boost, I can't solve the error and I'm still up to now. I wonder if MeCab will be used as it is.

Preprocessing

If the text has HTML tags, you need to remove the tags. In addition to that, if you do not perform preprocessing before applying to morphological analysis, ① and i will appear as frequent keywords, and words with the same meaning will be aggregated separately.

emoticon

It's a demon gate. Emoticon research is, in fact, profound with decades of history. If you are interested, please read the special feature in the Journal of the Japanese Society for Artificial Intelligence Vol. 32 No. 3 (2017/05). https://www.ai-gakkai.or.jp/vol32_no3/ I picked up an emoticon dictionary and added it to my dictionary to make the tea muddy, but the coverage is not high. Depending on the text to be analyzed, it may be necessary to work head-on with the emoticons.

Normalization

This is a process to prevent similar words with the same meaning from being aggregated separately. Full-width half-width, uppercase and lowercase letters (eg Qiita, QIITA, Qiita, QIITA) Align notation (example: high 3, high 3, high school third grade, high three, high third grade) Align abbreviations (eg, Hiroshima University, Hiroshima University)

Numbers

Regarding the handling of numbers, it may depend on the type of document. In normal context, you may delete all the numbers. On the other hand, for sports records and other items where numbers are keywords, it is better to treat numbers as numbers. In this case, the trouble is that MeCab divides the number with a decimal point into different words, such as an integer part, a period, and a decimal part. In this case, it is necessary to perform a process to restore the numerical value after performing the morphological analysis at once.

Stop word removal

A stop word is a word that appears in any document, such as "I" or "is". I exclude non-independent words and words that frequently appear in the analysis target document as stop words.

Other

Chemical formulas, mathematical and physics formulas, URLs, product codes and model numbers should be excluded from lexical analysis. For example, if you simply give MeCab the chemical formula of phenol (C6H5OH), it will be like this.

C noun,General,*,*,*,*,*
6 nouns,number,*,*,*,*,*
H noun,General,*,*,*,*,*
5 nouns,number,*,*,*,*,*
OH noun,Proper noun,Organization,*,*,*,*

This cannot be recognized as phenol at all. In addition, it is often meaningless to analyze English sentences and program codes using the same method as Japanese.

Document vectorization

Vectorize the document using the words contained in the document as clues (Doc2Vec). Similar documents should have similar vectors. Doc2Vec is included in a library called gensim. https://radimrehurek.com/gensim/

Once you have a vector, you can combine similar documents. There is also a technique called topic analysis, which will be described later, to put together similar documents. In some cases, topic analysis and vectorization are used together. I would like to make various trials and errors around here as well.

Topic analysis

Documents that have been morphologically analyzed are automatically classified and divided into a specified number of topics. This can also be done with gensim. gensim should be afraid.

LDA(Latent Dirichlet Allocation): If you give a group of documents, the topic will be divided into the specified topic group. It is difficult to know how many problems should be divided. There are indicators such as perplexity and coherence, but when I used them, they didn't look like that. I tried to put out perplexity and coherence using R's topicmodels package. https://cran.r-project.org/web/packages/topicmodels/index.html Perhaps the amount of documents was too large, it was not enough to run two consecutive weeks to draw one graph.
DTM(Dynamic Topic Model): This model is used when new news jumps in and the topic changes, such as when extending LDA and analyzing a topic on SNS for a certain period of time.
HDP(Hierarchical Dirichlet Process): It should extend the LDA and tell you how many topics to divide the document into, but when I feed on my dataset, more than half of the topics are similar, so it's a valid topic. I couldn't divide it.

BERT This is a natural language processing model announced by Google last year. I want to use it, but I haven't investigated it at all. I will study from now on.

[PYTHON] Preparing to start natural language processing