Aidemy 2020/11/10
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of RNN_LSTM. Nice to meet you.
What to learn this time ・ Morphological analysis, etc. ・ Natural language processing by deep learning
-In natural language processing, it is necessary to separate sentences for each __word __. The method includes __ "morphological analysis" __ and __ "N-gram model" __, and morphological analyzers such as MeCab are mainly used. This will be described later. -The __N-gram model __ is a __ method that divides words by __N next, or divides sentences by N words and aggregates the frequency of occurrence. This is not limited to Japanese __ Can be used in any language __.
-The first thing that is done in Japanese natural language processing is __ "morphological analysis" __. This time, we will perform morphological analysis using MeCab. -To use, first create an instance with __ "MeCab.Tagger ('')" __, and then use __ "parse ()" __ to perform morphological analysis.
-Syntax analysis is performed after performing morphological analysis. Parsing is a __ method that decomposes each sentence into __ clauses and determines the __dependence between clauses. -A program that performs parsing is called a parser, and there are __ "CaboCha" __ and __ "KNP" __.
-__ Data preprocessing __ is a very important step __ in natural language processing. The main flow is as follows. (See "Natural Language Processing" for details) ① __ Cleansing process __: Removes HTML tags and other items that are not related to the text. ② __ Morphological analysis __ ③ __ Word normalization __: Unification of notation fluctuations, etc. ④ __ Stop word removal __: Removal of words (particles, etc.) that have little meaning for frequent occurrence ⑤ __ Vectorization __: Words are vectorized (quantified)
-If you want to use MeCab for __separation only __, you can use __ "MeCab.Tagger ('-Owakati')" __. Do this with __ "parse (text)" __ and distinguish each space-separated word with __ "split ('')" __. -Create a list of the same words in dictionary order with __ "sorted (set (word_list))" __. -For this list (vocab), a dictionary of __ "{word: id}" __ is created with __ "{c: i for i, c in enumerate (vocab)}" . -Apply this to each word_list to list ids. ( [vocab_to_int [x] for x in word_list] __)
・ Code
・ Result
Embedding -For __ to vectorize __words, use __Embedding layer __. Implementation is done in __Sequential model __. -If the id list "int_list" created at the end of the previous section is made into a form (input_data) that can be applied to model and passed in model.predict, a vectorized list will be returned.
・ Code
-Although the data handled this time is natural language data, the implementation itself is not much different from what has been done so far, so I will only post the code here.
・ Code
-When performing natural language processing in deep learning, it is important to preprocess __ data __. Specifically, __ "words are vectorized (quantified)" __, so first, the sentence is divided into words by __ morphological analysis __, and __ unnecessary parts are removed __. You have to do that. -Use Embedding when vectorizing words. The implementation uses the Sequential model. -The implementation itself of the model that performs natural language processing should be done in the same way as other models __.
This time is over. Thank you for reading until the end.
Recommended Posts