[PYTHON] Natural language processing 1 Morphological analysis

Aidemy 2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post of natural language processing. Nice to meet you.

What to learn this time ・ What is natural language processing? ・ About the text corpus ・ About morphological analysis

About natural language processing

・ __ "Natural language" __ is a spoken / written language that humans usually use. Letting a computer process this is called "natural language processing". -The natural language used by humans may contain ambiguous expressions, which computers cannot "interpret" __. -In order for a computer to process natural language, it must be converted to numeric. -Natural language processing is used for machine translation, voice recognition, information retrieval, etc.

Corpus

-__ Corpus __ is data __ that summarizes documents in natural language __. It supports many languages, and there is also a Japanese version. ・ This time, we will use a "chat dialogue corpus". -The data is divided into 100 sets of chat data __ "init100" __ directory and 1046 sets of chat data __ "rest1046" __ directory. This time "init100" Use the one. -The file structure is provided in __ "JSON format" __. It is divided into "question data (human utterance)" and "answer data (system utterance)". -These data are stored in the __ "turns" __ key in the file. Of these, __ "utterance" __ is the utterance data, __ "speaker" __ is __ "U" __, the person, and __ "S" __ is the system utterance. -In addition, the flag "breakdown" __ is set in the utterance data of the system. This is to determine whether the utterance of the system is natural. __ "O" stands for natural, "T" stands for unnatural, and "X" stands for extremely unnatural (broken) __. Multiple flags are given to one answer. ・ Inside the corpus スクリーンショット 2020-10-18 13.07.11.png

Corpus reading / data extraction

-Reading the corpus is done with __ "open ()" __, just like reading a normal file. Regarding reading, since the file is of JSON type, read it with __ "json.load ()" __. -Data can be extracted by specifying the key of the data you want to acquire for the read file.

・ Get conversation ID スクリーンショット 2020-10-18 13.14.23.png

#Extract and display the speaker and utterance content
for turn in json_data['turns']:
    print("{}:{}".format(turn['speaker'],turn['utterance']))

Extraction of analytical data

・ From here, we will analyze "natural conversation". That is, since breakdown is used, first __ "contents of human utterance" and "flag of system utterance" are acquired __. -At this time, if data is acquired, duplicate data will be generated, so use __drop_duplicates () __ to delete the duplicate data. Since it is the Dataframe data that can be passed at this time, it is necessary to convert the acquired data to df.

·code スクリーンショット 2020-10-18 14.30.51.png

-In the above code, first, the "utterance turn number", "speaker ID", and "utterance content" are obtained from "turns" in which the utterance data is stored in the same way as in the previous section, and then from the "utterance content", We get "human utterance content" and "system utterance flag" and put them in a list called label_list. Finally, it is transformed into a DataFrame and duplicate data is deleted.

Morphological analysis

What is morphological analysis?

-__ Morphological analysis __ is one of the methods of natural language processing, and is a method of dividing a sentence into words (morphemes) and classifying part of speech __. • For example, "Hello, it is Yope!" If "Hello /, / I / is / Ngayope / is /!" Becomes. -There are morphological analysis execution tools such as MeCab and Janome.

MeCab ・ Perform morphological analysis with MeCab. The usage is as follows. For k with __k = MeCab.Tagger ('specify output mode') __ Execute as __k.parse ('character string for morphological analysis') __. Specifically, it is as follows.

スクリーンショット 2020-10-18 15.06.54.png

-Also, if you do not specify anything about the mode to be set in Tagger (), it will be output as above, but if you set __ "'-Owakati'" __, you can just separate each word (morpheme) with a space. __ Output as "separate writing" __. -In addition, there are modes such as "'-Oyomi'" that only the reading is output.

Janome -When performing morphological analysis with Janome, you can create an object with __t = Tokenizer () __ and then execute it with __t.tokenize ('character string for morphological analysis') __. -When writing in separate words, set "wakati = True" in this second argument.

-As another function, you can filter by part of speech. ・ If you want to get only a specific part of speech __POSKeepFilter (['part of speech']) __ ・ When you want to exclude a specific part of speech __POSStopFilter (['part of speech']) __

-If __Analyzer () __ is used, the processing up to this point and the preprocessing of the text for morphological analysis can be performed at the same time. -The argument to be passed is __ (preprocessing, Tokenizer object (t), filter) __. -The pre-processing part includes UnicodeNormalizeCharFilter () __ that normalizes the notation fluctuation of Unicode character strings. By the way, this normalizes the full-width alphabet and katakana to half-width. -Also, sometimes the first argument cannot be omitted even if preprocessing is not performed, so in such a case, write only " [] __". -For the remaining two arguments, set the object and filter mentioned above.

-Execute Analyzer () as follows. スクリーンショット 2020-10-18 15.55.35.png

Text normalization

-Since morphological analysis depends on the dictionary used, the analysis may become unnatural if words that are not in the dictionary appear. ・ There are two types of countermeasures in such cases. The first is to prepare a user dictionary. (However, it is not explained here) -Another method is "text normalization". This is to delete unnecessary symbols in the text and unify the notation as preprocessing.

-For example, when "," and "," are mixed in a sentence, it is unified to either one, and the notation of "apple" and "apple" is also unified to either one. -Use "regular expression" to specify the character string to be normalized. -Specifically, use __re.sub ("character string to be removed", "character string after conversion", "text to be removed") __, and describe the part specified here with a regular expression. -Regular expressions are not dealt with in detail here. (See Qiita for various articles)

・ Code (excluding alphanumeric characters from "I will buy 10 items A") スクリーンショット 2020-10-18 16.44.49.png

Summary

-Natural language processing can be performed by having a computer process natural language as a numerical value. -A corpus is __ data that summarizes documents in natural language . - Morphological analysis __ is one of the methods of natural language processing, and is a method of dividing a sentence into words (morphemes) and classifying part of speech __. -Morphological analysis can be performed with "MeCab" or "Janome". -Since morphological analysis depends on the dictionary used, it is important to preprocess using __regular expressions so that the dictionary can be judged.

This time is over. Thank you for reading until the end.

Recommended Posts

Natural language processing 1 Morphological analysis
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock Chapter 4: Morphological Analysis
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
Python: Natural language processing
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
RNN_LSTM2 Natural language processing
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
100 language processing knocks Morphological analysis learned in Chapter 4
100 Language Processing Knock-57: Dependency Analysis
Natural language processing 3 Word continuity
100 language processing knock-56: co-reference analysis
Natural language processing 2 Word similarity
3. Natural language processing with Python 4-1. Analysis for words with KWIC
100 language processing knock-30 (using pandas): reading morphological analysis results
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
Study natural language processing with Kikagaku
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 natural language processing knocks Chapter 4 Commentary
Natural language processing for busy people
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
[Natural language processing] Preprocessing with Japanese
Artificial language Lojban and natural language processing (artificial language processing)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-59: Analysis of S-expressions
Preparing to start natural language processing
Natural language processing analyzer installation summary
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 language processing knocks (2020): 39
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
3. Natural language processing with Python 2-1. Co-occurrence network
100 language processing knocks (2020): 10-19
[WIP] Pre-processing memo in natural language processing
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 Language Processing Knock (2020): 38
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44