[PYTHON] Preparing to start natural language processing

In this article, I will introduce the outline that this kind of work is done when doing natural language processing.

Morphological analysis

Morphological analysis is the task of chopping sentences into words. Look up the part of speech and return the inflected form to the original form. For example, the phrase "taking an exam that fits your height" is analyzed as follows.

Height Noun, General, *, *, *, *, Height, Minotake, Minotake Ni particle, case particle, general, *, *, *, ni, ni, ni Matching verb, independence, *, *, 5th dan / wa line reminder, continuous use connection, matching, ah, ah Ta auxiliary verb, *, *, *, special ta, uninflected word, ta, ta, ta Exam Noun, Sahen Connection, *, *, *, *, Exam, Juken, Juken

MeCab is used for morphological analysis. https://taku910.github.io/mecab/ This is widely used, but it has some weaknesses in new words and technical terms because it divides words more than necessary, such as changing "registered dietitian exam" to "administrative / nutrition / specialist / exam". For new words, a frequently updated dictionary called mecab-ipadic-NEologd has been published to make up for the weaknesses of MeCab. https://github.com/neologd/mecab-ipadic-neologd In addition to this, I create a new word dictionary to be analyzed locally and use it by adding various things.

I would like to use JUMAN ++ for high accuracy. http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN++ When "Foreigners to vote" is processed by MeCab, it becomes "Foreign / Ginseng / Government", but in JUMAN ++, it is analyzed as "Foreigners / Suffrage". The feature is that you can improve your analysis skills by machine learning. However, when I build JUMAN ++ with boost, I can't solve the error and I'm still up to now. I wonder if MeCab will be used as it is.

Preprocessing

If the text has HTML tags, you need to remove the tags. In addition to that, if you do not perform preprocessing before applying to morphological analysis, ① and i will appear as frequent keywords, and words with the same meaning will be aggregated separately.

emoticon

It's a demon gate. Emoticon research is, in fact, profound with decades of history. If you are interested, please read the special feature in the Journal of the Japanese Society for Artificial Intelligence Vol. 32 No. 3 (2017/05). https://www.ai-gakkai.or.jp/vol32_no3/ I picked up an emoticon dictionary and added it to my dictionary to make the tea muddy, but the coverage is not high. Depending on the text to be analyzed, it may be necessary to work head-on with the emoticons.

Normalization

This is a process to prevent similar words with the same meaning from being aggregated separately. Full-width half-width, uppercase and lowercase letters (eg Qiita, QIITA, Qiita, QIITA) Align notation (example: high 3, high 3, high school third grade, high three, high third grade) Align abbreviations (eg, Hiroshima University, Hiroshima University)

Numbers

Regarding the handling of numbers, it may depend on the type of document. In normal context, you may delete all the numbers. On the other hand, for sports records and other items where numbers are keywords, it is better to treat numbers as numbers. In this case, the trouble is that MeCab divides the number with a decimal point into different words, such as an integer part, a period, and a decimal part. In this case, it is necessary to perform a process to restore the numerical value after performing the morphological analysis at once.

Stop word removal

A stop word is a word that appears in any document, such as "I" or "is". I exclude non-independent words and words that frequently appear in the analysis target document as stop words.

Other

Chemical formulas, mathematical and physics formulas, URLs, product codes and model numbers should be excluded from lexical analysis. For example, if you simply give MeCab the chemical formula of phenol (C6H5OH), it will be like this.

C noun,General,*,*,*,*,*
6 nouns,number,*,*,*,*,*
H noun,General,*,*,*,*,*
5 nouns,number,*,*,*,*,*
OH noun,Proper noun,Organization,*,*,*,*

This cannot be recognized as phenol at all. In addition, it is often meaningless to analyze English sentences and program codes using the same method as Japanese.

Document vectorization

Vectorize the document using the words contained in the document as clues (Doc2Vec). Similar documents should have similar vectors. Doc2Vec is included in a library called gensim. https://radimrehurek.com/gensim/

Once you have a vector, you can combine similar documents. There is also a technique called topic analysis, which will be described later, to put together similar documents. In some cases, topic analysis and vectorization are used together. I would like to make various trials and errors around here as well.

Topic analysis

Documents that have been morphologically analyzed are automatically classified and divided into a specified number of topics. This can also be done with gensim. gensim should be afraid.

BERT This is a natural language processing model announced by Google last year. I want to use it, but I haven't investigated it at all. I will study from now on.

Recommended Posts

Preparing to start natural language processing
Python: Natural language processing
RNN_LSTM2 Natural language processing
Loose articles for those who want to start natural language processing
Natural language processing 1 Morphological analysis
Natural language processing 2 Word similarity
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
[Python] Try to classify ramen shops by natural language processing
Natural language processing analyzer installation summary
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
Natural Language: ChatBot Part2-Sequence To Sequence Attention
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
3. Natural language processing with Python 2-1. Co-occurrence network
[WIP] Pre-processing memo in natural language processing
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
Convenient goods memo around natural language processing
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 47
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 Language Processing Knock (2020): 38
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knock 00 ~ 02
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
100 language processing knocks (2020): 36
Python inexperienced person tries to knock 100 language processing 14-16
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Python: Deep Learning in Natural Language Processing: Basics
Python inexperienced person tries to knock 100 language processing 07-09
Python inexperienced person tries to knock 100 language processing 10 ~ 13