[PYTHON] RNN_LSTM2 Natural language processing

Aidemy 2020/11/10

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of RNN_LSTM. Nice to meet you.

What to learn this time ・ Morphological analysis, etc. ・ Natural language processing by deep learning

Natural language processing with RNN / LSTM

Review of natural language processing

-In natural language processing, it is necessary to separate sentences for each __word __. The method includes __ "morphological analysis" __ and __ "N-gram model" __, and morphological analyzers such as MeCab are mainly used. This will be described later. -The __N-gram model __ is a __ method that divides words by __N next, or divides sentences by N words and aggregates the frequency of occurrence. This is not limited to Japanese __ Can be used in any language __.

Japanese morphological analysis

-The first thing that is done in Japanese natural language processing is __ "morphological analysis" __. This time, we will perform morphological analysis using MeCab. -To use, first create an instance with __ "MeCab.Tagger ('')" __, and then use __ "parse ()" __ to perform morphological analysis.

Parsing

-Syntax analysis is performed after performing morphological analysis. Parsing is a __ method that decomposes each sentence into __ clauses and determines the __dependence between clauses. -A program that performs parsing is called a parser, and there are __ "CaboCha" __ and __ "KNP" __.

Natural language processing by deep learning

Data preprocessing

-__ Data preprocessing __ is a very important step __ in natural language processing. The main flow is as follows. (See "Natural Language Processing" for details) ① __ Cleansing process __: Removes HTML tags and other items that are not related to the text. ② __ Morphological analysis __ ③ __ Word normalization __: Unification of notation fluctuations, etc. ④ __ Stop word removal __: Removal of words (particles, etc.) that have little meaning for frequent occurrence ⑤ __ Vectorization __: Words are vectorized (quantified)

Word-separation with MeCab

-If you want to use MeCab for __separation only __, you can use __ "MeCab.Tagger ('-Owakati')" __. Do this with __ "parse (text)" __ and distinguish each space-separated word with __ "split ('')" __. -Create a list of the same words in dictionary order with __ "sorted (set (word_list))" __. -For this list (vocab), a dictionary of __ "{word: id}" __ is created with __ "{c: i for i, c in enumerate (vocab)}" . -Apply this to each word_list to list ids. ( [vocab_to_int [x] for x in word_list] __)

・ Code![Screenshot 2020-11-10 13.05.10.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/42e409e1-a194-f8c8- e2c5-f4c33c5bdff4.png)

・ Result![Screenshot 2020-11-10 13.05.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/da7b1a42-7ab9-0a38- e237-bcaef1a1d274.png)

Embedding -For __ to vectorize __words, use __Embedding layer __. Implementation is done in __Sequential model __. -If the id list "int_list" created at the end of the previous section is made into a form (input_data) that can be applied to model and passed in model.predict, a vectorized list will be returned.

・ Code![Screenshot 2020-11-10 13.06.59.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2d92b0ed-b69b-5dba- ed7b-9f6c26fe0160.png)

Implementation of natural language processing by LSTM

-Although the data handled this time is natural language data, the implementation itself is not much different from what has been done so far, so I will only post the code here.

・ Code![Screenshot 2020-11-10 13.07.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/65b2e6b5-6bd8-7517- ecbc-2efd8b701f01.png)

Summary

-When performing natural language processing in deep learning, it is important to preprocess __ data __. Specifically, __ "words are vectorized (quantified)" __, so first, the sentence is divided into words by __ morphological analysis __, and __ unnecessary parts are removed __. You have to do that. -Use Embedding when vectorizing words. The implementation uses the Sequential model. -The implementation itself of the model that performs natural language processing should be done in the same way as other models __.

This time is over. Thank you for reading until the end.

Recommended Posts

RNN_LSTM2 Natural language processing
Python: Natural language processing
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
100 natural language processing knocks Chapter 4 Commentary
Artificial language Lojban and natural language processing (artificial language processing)
Preparing to start natural language processing
Natural language processing analyzer installation summary
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 Language Processing Knock (2020): 38
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knock 00 ~ 02
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
Convenient goods memo around natural language processing
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 language processing knock 2020 [00 ~ 39 answer]
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30