Aidemy　2020/11/10

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of RNN_LSTM. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ Morphological analysis, etc. ・ Natural language processing by deep learning

Natural language processing with RNN / LSTM

Review of natural language processing

-In natural language processing, it is necessary to separate sentences for each __word __. The method includes __ "morphological analysis" __ and __ "N-gram model" __, and morphological analyzers such as MeCab are mainly used. This will be described later. -The __N-gram model __ is a __ method that divides words by __N next, or divides sentences by N words and aggregates the frequency of occurrence. This is not limited to Japanese __ Can be used in any language __.

Japanese morphological analysis

-The first thing that is done in Japanese natural language processing is __ "morphological analysis" __. This time, we will perform morphological analysis using MeCab. -To use, first create an instance with __ "MeCab.Tagger ('')" __, and then use __ "parse ()" __ to perform morphological analysis.

Parsing

-Syntax analysis is performed after performing morphological analysis. Parsing is a __ method that decomposes each sentence into __ clauses and determines the __dependence between clauses. -A program that performs parsing is called a parser, and there are __ "CaboCha" __ and __ "KNP" __.

Natural language processing by deep learning

Data preprocessing

-__ Data preprocessing __ is a very important step __ in natural language processing. The main flow is as follows. (See "Natural Language Processing" for details) ① __ Cleansing process __: Removes HTML tags and other items that are not related to the text. ② __ Morphological analysis __ ③ __ Word normalization __: Unification of notation fluctuations, etc. ④ __ Stop word removal __: Removal of words (particles, etc.) that have little meaning for frequent occurrence ⑤ __ Vectorization __: Words are vectorized (quantified)

Word-separation with MeCab

-If you want to use MeCab for __separation only __, you can use __ "MeCab.Tagger ('-Owakati')" __. Do this with __ "parse (text)" __ and distinguish each space-separated word with __ "split ('')" __. -Create a list of the same words in dictionary order with __ "sorted (set (word_list))" __. -For this list (vocab), a dictionary of __ "{word: id}" __ is created with __ "{c: i for i, c in enumerate (vocab)}" . -Apply this to each word_list to list ids. ( [vocab_to_int [x] for x in word_list] __)

・ Code![Screenshot 2020-11-10 13.05.10.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/42e409e1-a194-f8c8- e2c5-f4c33c5bdff4.png)

・ Result![Screenshot 2020-11-10 13.05.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/da7b1a42-7ab9-0a38- e237-bcaef1a1d274.png)

Embedding -For __ to vectorize __words, use __Embedding layer __. Implementation is done in __Sequential model __. -If the id list "int_list" created at the end of the previous section is made into a form (input_data) that can be applied to model and passed in model.predict, a vectorized list will be returned.

・ Code![Screenshot 2020-11-10 13.06.59.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/2d92b0ed-b69b-5dba- ed7b-9f6c26fe0160.png)

Implementation of natural language processing by LSTM

-Although the data handled this time is natural language data, the implementation itself is not much different from what has been done so far, so I will only post the code here.

・ Code![Screenshot 2020-11-10 13.07.21.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/65b2e6b5-6bd8-7517- ecbc-2efd8b701f01.png)

Summary

-When performing natural language processing in deep learning, it is important to preprocess __ data __. Specifically, __ "words are vectorized (quantified)" __, so first, the sentence is divided into words by __ morphological analysis __, and __ unnecessary parts are removed __. You have to do that. -Use Embedding when vectorizing words. The implementation uses the Sequential model. -The implementation itself of the model that performs natural language processing should be done in the same way as other models __.

This time is over. Thank you for reading until the end.

[PYTHON] RNN_LSTM2 Natural language processing