[PYTHON] Natural language processing 3 Word continuity

Aidemy 2020/10/30

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the third post on natural language processing. Nice to meet you.

What to learn this time ・ ・

Preparation for word continuity

Creating a word dictionary

-In order to perform "word continuity analysis", the word data is quantified as a preparation. -After dividing the data set, first create a __word dictionary (list) __ with the ID set in order to assign an ID to each word and quantify it. Of these, we want to number the words in descending order of appearance, so we count the number of occurrences of __words and sort them in descending order __.

-Count the number of occurrences of words using __Counter () __ and __itertools.chain () __. __Counter (itertools.chain (* list of word data to count)) __

-Counter () counts the number of elements, but since the result is returned in multiple dimensions, each element cannot be accessed individually. This is returned to one dimension by using __itertools.chain () __. Add __ "*" __ to the multidimensional list passed to this itertools.chain ().

-Use __most_common (n) __ to sort in descending order. If n is specified, the number of tuples will be returned in descending order.

・ Once you have done this, you can create a word dictionary by assigning an ID to each of the sorted appearance count lists and storing them in an empty dictionary.

・ Code![Screenshot 2020-10-19 12.50.50.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/27647f46-ce71-8d11- 4745-85d91fbe0912.png)

Convert word data to numeric data

-Once the dictionary is created, do the purpose __ "quantify the data set" __. -Refer to the ID part of the word dictionary created in the previous section, and convert the utterance data set "wakatiO" to a new array called "wakatiO_n" with only numerical data.

-The code is as follows. スクリーンショット 2020-10-19 13.02.59.png

・ This code is easy to understand if you look at it from behind. First, the __ "for waka in wakati O" __ part indicates that each word list (separated by one sentence) of the dataset "wakatiO" is stored in waka. Next, __ "for word in waka" __ indicates that the word list is divided into each word and stored in word. -And, each word is stored in "wakatiO_n" by referring to the dictionary ID with __ "dic_inv [word]" __.

Feature extraction from word continuity

N-gram -N-gram is a model used when extracting topics from text, and uses the method of dividing __text into N consecutive characters. ・ For N divisions, if there is a character string "aiueo", N=1 "1-For "gram", "Ah|I|U|e|Divided into "O" and "2-For "gram", "AhI|IU|Ue|eお」と分割される。 -I tried to extract topics from the "word document matrix" that appeared in "Natural Language Processing 2", but that was __ "word co-occurrence (does it appear in the same sentence)" __ In contrast, N-gram represents __ "word continuity (in what order)" __.

-Since there is no method for creating N-gram, the method of storing a list of character strings divided into an empty list by iterative processing is adopted.

list = []
word = ['Good','weather','is','Ne','。']
#3-Creating a gram model
for i range(len(word)-2):
    list.append([word[i],word[i+1],word[i+2]])
print(list)
#[['Good','weather','is']['weather','is','Ne']['is','Ne','。']]

-For __ "len (word) -2" __, specify the same number so that "i + 2" below it does not exceed the length of the word.

2-Creating a gram list

-For the above N-gram, by creating a 2-gram list from "wakatiO_n" __, the number of occurrences (weights) of two nodes is calculated __. -First, apply the 2-gram model to "wakatiO_n" to create the 2-gram array "bigramO". Convert it to __DataFrame __ (df_bigramO), group by "'node1' and'node2'" (two nodes that calculate continuous values) (__groupby () __), and finally "__sum () __ ] To complete the total number of appearances.

·code スクリーンショット 2020-10-19 16.15.39.png

-Output result![Screenshot 2020-10-19 16.16.20.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7464f40b-c219-42b5 -5f38-523171df5761.png)

2-Creating a gram network

-In Chapter 2, an undirected graph was created with the similarity of words as edges (weights). This time, __ Create a directed graph with the number of occurrences of word pairs as edges (weights) __. The data of the creation source is df_bigramO created in the previous section. -A directed graph refers to a graph that has the concept of "direction" at the __ edge. Regarding the number of occurrences of word pairs, the directed graph also has meaning in the information "in order of appearance", that is, "which word comes first". -The creation method is exactly the same as for undirected graphs, just specify __ "nx.DiGraph" __ in the argument of __nx.from_pandas_edgelist () __.

·code スクリーンショット 2020-10-19 16.51.44.png

・ Result![Screenshot 2020-10-19 16.52.04.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7770095a-ae2c-b57d- 92b7-b7b396299952.png)

2-gram network features

-Similar to the previous Chapter, it is difficult to understand the characteristics just by looking at the graph, so calculate __ "average cluster coefficient" __ and __ "mediation centrality" __ to grasp the characteristics quantitatively. ・ (Review) The average cluster coefficient is calculated by __nx.average_clustering () __, and the mediation centrality is calculated by __nx.betweenness_centrality () __.

See the effect of each word

・ To see how each word affects each other, visualize __ "order distribution" __. -In the case of a directed graph, it is divided into "influenced by other words" __in-degree __ and "influenced in other words" __out-degree __. -Check the input degree using the __in_degree (weight) __ method. It is returned in the form of (node number, degree). -Similarly, check the degree by using the __out_degree (weight) __ method.

·code スクリーンショット 2020-10-19 17.20.57.png

·result スクリーンショット 2020-10-19 17.21.32.png

Summary

・ To understand the characteristics from __word continuity __, first divide the __ utterance text __ and then create a __word dictionary __ to convert the __ data into numerical values. __. -Convert the data converted to numerical values to N-gram list, calculate the number of occurrences of each word combination __, and create a directed graph from it __. ・ Since it is difficult to understand the characteristics if the graph remains directed, the characteristics are quantitatively grasped by calculating __ "average cluster coefficient" __ and __ "mediation centrality" __. -Also, quantitative features can be made __visualized by doing __order distribution __.

This time is over. Thank you for reading until the end.

Recommended Posts

Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
Python: Natural language processing
3. Natural language processing with Python 1-1. Word N-gram
RNN_LSTM2 Natural language processing
Natural language processing 1 Morphological analysis
100 Language Processing Knock-87: Word Similarity
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
100 Language Processing Knock-82 (Context Word): Context Extraction
Artificial language Lojban and natural language processing (artificial language processing)
Language processing 100 knock-86: Word vector display
[Language processing 100 knocks 2020] Chapter 7: Word vector
Preparing to start natural language processing
Natural language processing analyzer installation summary
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 47
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
3. Natural language processing with Python 2-1. Co-occurrence network
100 language processing knocks (2020): 10-19
[WIP] Pre-processing memo in natural language processing
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 Language Processing Knock (2020): 38
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knock 00 ~ 02
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
I tried natural language processing with transformers.
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
Natural Language Processing Case Study: Word Frequency in'Anne with an E'
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
Convenient goods memo around natural language processing
100 language processing knocks (2020): 36
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
Unbearable shortness of Attention in natural language processing
100 amateur language processing knocks: 41