Aidemy　2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of natural language processing. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ What is natural language processing? ・ About the text corpus ・ About morphological analysis

About natural language processing

・ __ "Natural language" __ is a spoken / written language that humans usually use. Letting a computer process this is called "natural language processing". -The natural language used by humans may contain ambiguous expressions, which computers cannot "interpret" __. -In order for a computer to process natural language, it must be converted to numeric. -Natural language processing is used for machine translation, voice recognition, information retrieval, etc.

Corpus

-__ Corpus __ is data __ that summarizes documents in natural language __. It supports many languages, and there is also a Japanese version. ・ This time, we will use a "chat dialogue corpus". -The data is divided into 100 sets of chat data __ "init100" __ directory and 1046 sets of chat data __ "rest1046" __ directory. This time "init100" Use the one. -The file structure is provided in __ "JSON format" __. It is divided into "question data (human utterance)" and "answer data (system utterance)". -These data are stored in the __ "turns" __ key in the file. Of these, __ "utterance" __ is the utterance data, __ "speaker" __ is __ "U" __, the person, and __ "S" __ is the system utterance. -In addition, the flag "breakdown" __ is set in the utterance data of the system. This is to determine whether the utterance of the system is natural. __ "O" stands for natural, "T" stands for unnatural, and "X" stands for extremely unnatural (broken) __. Multiple flags are given to one answer. ・ Inside the corpus スクリーンショット 2020-10-18 13.07.11.png

Corpus reading / data extraction

-Reading the corpus is done with __ "open ()" __, just like reading a normal file. Regarding reading, since the file is of JSON type, read it with __ "json.load ()" __. -Data can be extracted by specifying the key of the data you want to acquire for the read file.

・ Get conversation ID スクリーンショット 2020-10-18 13.14.23.png

#Extract and display the speaker and utterance content
for turn in json_data['turns']:
    print("{}:{}".format(turn['speaker'],turn['utterance']))

Extraction of analytical data

・ From here, we will analyze "natural conversation". That is, since breakdown is used, first __ "contents of human utterance" and "flag of system utterance" are acquired __. -At this time, if data is acquired, duplicate data will be generated, so use __drop_duplicates () __ to delete the duplicate data. Since it is the Dataframe data that can be passed at this time, it is necessary to convert the acquired data to df.

·code スクリーンショット 2020-10-18 14.30.51.png

-In the above code, first, the "utterance turn number", "speaker ID", and "utterance content" are obtained from "turns" in which the utterance data is stored in the same way as in the previous section, and then from the "utterance content", We get "human utterance content" and "system utterance flag" and put them in a list called label_list. Finally, it is transformed into a DataFrame and duplicate data is deleted.

Morphological analysis

What is morphological analysis?

-__ Morphological analysis __ is one of the methods of natural language processing, and is a method of dividing a sentence into words (morphemes) and classifying part of speech __. • For example, "Hello, it is Yope!" If "Hello /, / I / is / Ngayope / is /!" Becomes. -There are morphological analysis execution tools such as MeCab and Janome.

MeCab ・ Perform morphological analysis with MeCab. The usage is as follows. For k with __k = MeCab.Tagger ('specify output mode') __ Execute as __k.parse ('character string for morphological analysis') __. Specifically, it is as follows.

スクリーンショット 2020-10-18 15.06.54.png

-Also, if you do not specify anything about the mode to be set in Tagger (), it will be output as above, but if you set __ "'-Owakati'" __, you can just separate each word (morpheme) with a space. __ Output as "separate writing" __. -In addition, there are modes such as "'-Oyomi'" that only the reading is output.

Janome -When performing morphological analysis with Janome, you can create an object with __t = Tokenizer () __ and then execute it with __t.tokenize ('character string for morphological analysis') __. -When writing in separate words, set "wakati = True" in this second argument.

-As another function, you can filter by part of speech. ・ If you want to get only a specific part of speech __POSKeepFilter (['part of speech']) __ ・ When you want to exclude a specific part of speech __POSStopFilter (['part of speech']) __

-If __Analyzer () __ is used, the processing up to this point and the preprocessing of the text for morphological analysis can be performed at the same time. -The argument to be passed is __ (preprocessing, Tokenizer object (t), filter) __. -The pre-processing part includes UnicodeNormalizeCharFilter () __ that normalizes the notation fluctuation of Unicode character strings. By the way, this normalizes the full-width alphabet and katakana to half-width. -Also, sometimes the first argument cannot be omitted even if preprocessing is not performed, so in such a case, write only " [] __". -For the remaining two arguments, set the object and filter mentioned above.

-Execute Analyzer () as follows. スクリーンショット 2020-10-18 15.55.35.png

Text normalization

-Since morphological analysis depends on the dictionary used, the analysis may become unnatural if words that are not in the dictionary appear. ・ There are two types of countermeasures in such cases. The first is to prepare a user dictionary. (However, it is not explained here) -Another method is "text normalization". This is to delete unnecessary symbols in the text and unify the notation as preprocessing.

-For example, when "," and "," are mixed in a sentence, it is unified to either one, and the notation of "apple" and "apple" is also unified to either one. -Use "regular expression" to specify the character string to be normalized. -Specifically, use __re.sub ("character string to be removed", "character string after conversion", "text to be removed") __, and describe the part specified here with a regular expression. -Regular expressions are not dealt with in detail here. (See Qiita for various articles)

・ Code (excluding alphanumeric characters from "I will buy 10 items A") スクリーンショット 2020-10-18 16.44.49.png

Summary

-Natural language processing can be performed by having a computer process natural language as a numerical value. -A corpus is __ data that summarizes documents in natural language . - Morphological analysis __ is one of the methods of natural language processing, and is a method of dividing a sentence into words (morphemes) and classifying part of speech __. -Morphological analysis can be performed with "MeCab" or "Janome". -Since morphological analysis depends on the dictionary used, it is important to preprocess using __regular expressions so that the dictionary can be judged.

This time is over. Thank you for reading until the end.

[PYTHON] Natural language processing 1 Morphological analysis