Aidemy　2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This time, it will be a post. Nice to meet you.

This article is a summary of what you learned in "Aidemy" "in your own words". It may contain mistakes and misunderstandings. Please note.

What to learn this time ・ About word similarity ・ Know the characteristics of utterances

Word similarity

Morphological analysis of the contents of utterance data

・ This time, for the utterance data of the corpus, we will learn __ "characteristics of utterance" from __ "word similarity". To do this, we must first get __ "word similarity" __. -In this section, we will perform preprocessing to acquire "word similarity". Specifically, the flag is "O", that is, of the utterance data, only __natural utterances are morphologically analyzed __.

·procedure (1) Of the "df_label_text" created in "Natural language processing 1 Extraction of analysis data", only those with the __ flag "O" are extracted __. (df_label_text_O) (2) Convert the extracted df_label_text_O (NumPy array) to a Python list with ".tolist ()", and remove numbers and alphabets line by line with __re.sub () __. (Reg_row) ③ Morphological analysis of reg_row with __Janome __. Of these, only the surface system (word) is added to the list called "morpO".

・ Code![Screenshot 2020-10-18 17.52.12.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/1ddf364b-1a9b-8b5b- 3d0f-ec0417ab21dd.png)

Word document matrix

-__ Word document matrix __ is one of the methods for converting word data into numerical data. ・ This is a numerical value of the frequency of words appearing in documents. In order to execute this word document matrix, the document was divided into words by morphological analysis. -For execution, use __CountVectorizer () __ of scikit-learn. -By default, it is not recognized as a word unless it has two or more characters, but in Japanese there are words that make sense even with one character, so in such a case, "token_pattern ='(? U) \ b" in the above argument Enter "\ w + \ b'".

-For the CountVectorizer () object (CV), you can make an array of the number of occurrences of words by doing __CV.fit_transform ('character string to make a word document matrix') __.

-In the code below, "morpO" in the previous section is converted to a word document matrix, which is further converted to an ndarray array and represented by a DataFrame.

With __ "get_feature_names ()" __, the learned words are stored as an array.

-Code output result (only part) スクリーンショット 2020-10-18 18.39.40.png

Weighted word document matrix

-In the above word document matrix, the frequency of occurrence of universal words such as __ "I" and "desu" inevitably increases. __ At this rate, words that appear only in a specific document are not emphasized, and "characteristics of speech" cannot be properly extracted. -In such a case, use a technique such as __ "Reduce the weight of universal words that appear in any document and increase the weight of words that appear only in a specific document" __ Document matrices are often created, and this is called __ "weighted word document matrix" __. -As the value that determines the weight at this time, __ "TF-IDF" __, which is obtained by multiplying the word appearance frequency (TF) by the value of IDF, is used. -IDF is calculated by the ratio of the total number of documents to the number of documents in which the word appears, and the smaller the number of __ documents that appear, the larger the IDF value. __ That is, the weight (TF-IDF) increases.

-Execution of the weighted word document matrix is performed as follows. TfidfVectorizer(use_idf=) -Specify True or False for "use_idf" and specify __ "whether idf is used for weighting" __. -Similar to CountVectorizer (), if you want to treat even one character as a word, specify __ "token_pattern" __.

-After that, you can make an array of the number of occurrences of words with __ "fit_transform ()" __ like CountVectorizer ().

Calculate word similarity

・ This time, we will create an unsupervised model __ with the similarity of the appearance of __words as a feature. -There are various methods for calculating the similarity of words, such as cosine similarity, which is the similarity between vectors, but this time, the similarity is calculated by calculating the __correlation coefficient __ between each column. measure. -Use the __DataFrame.corr () __ method to calculate the correlation coefficient. For this reason, the above code converted it to a DataFrame.

Analyze utterance characteristics

Creating a similarity list

-Since the __correlation coefficient calculated in the previous section is the feature quantity, we will analyze it quantitatively using this. -First, convert the correlation coefficient from matrix format to list format. To do this, use the __DataFrame.stack () __ method. (Stack () converts column to row, unstack () converts row to column)

・ The procedure of the following code (1) _ Create a weighted word document matrix __, and then create a matrix that calculates the __correlation coefficient __ (corr_matrixO), and convert it to list format. (Corr_stackO) (2) Of these, __ to extract pairs that have a positive correlation (which can be said to be similar) __, so index (word set) and value (correlation coefficient) whose correlation coefficient is "0.5 to 1.0" (Value of) is extracted, __ concatenated and displayed __. ![Screenshot 2020-10-18 21.55.50.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/328d870b-566e-41ea-e394- 85aa33954369.png)

-Execution result (only part)![Screenshot 2020-10-18 21.56.19.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/ 61bc1c17-d7e3-d7af-8df8-e1c9403dd941.png)

Creating a similarity network

-Similarity network is, in a nutshell, a graph and visualization of the relationship of similarity. -When visualizing the similarity list created in the previous section, use __ "undirected graph (undirected network)" __. -Use a library called NetworkX to create undirected graphs. If you import this with __ "nx" __ nx.from_pandas_edgelist(df,source,target,edge_attr,create_using) Can be created with. ・ About each argument df: DataFrame that is the source of the graph source: Column name of one of the correlations (source node) target: Column name of the other (target node) of the correlation edge_attr: Weight of each data (edge) create_using: Graph type (nx.Graph for undirected)

-The created graph can be displayed by calculating the optimum display position with __pos = nx.spling_layout (graph) __, drawing with __nx.draw_networkx (graph, pos) __, and then plt.show ().

-Graph the "df_corlistO" created in the previous section. スクリーンショット 2020-10-18 22.55.46.png

·Graph スクリーンショット 2020-10-19 10.53.46.png

Similarity network characteristics

-Although it was possible to output a graph, it is difficult to grasp the features at first glance in the above graph. ・ Therefore, we will set a new index and make a quantitative judgment. ・ There are various indicators used at this time, but they will not be explained in detail here. The indicators used this time are called __ "mean cluster coefficient" __ and __ "mediation centrality" __. -The cluster coefficient indicates the density of connections between __words __, and if this average is high, it can be said that the entire network is also dense __. Mediation centrality is a value __ indicating how many "nodes (one of the pairs of correlations)" are included in the shortest path between all nodes, and the larger the value, the more efficiently information is transmitted. It can be said that it is a node with high mediation and centrality.

-Calculate the average cluster coefficient as follows. __nx.average_clustering (graph, weight = None) __ -Specify the weight (edge) for weight. If None, the weight of each edge will be 1.

-Calculation of mediation centrality is performed as follows. __nx.betweenness_centrality (graph, weight = None) __

-Code![Screenshot 2020-10-19 11.21.45.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/261279ff-7db1-c716- 2b08-436df546fbd4.png)

Extract similarity network topics

-Similarity The entire network is ___ made up of multiple partial networks (communities). -At this time, each node of the community is dense __, so extracting the community means extracting a network with a high degree of similarity __. -To divide into partial networks, use an index called __ "modularity" __. Detailed calculation formulas are omitted. ・ Community extraction using modularity is performed as follows. __greedy_modularity_communities (graph, weight = None) __

スクリーンショット 2020-10-19 12.01.32.png

Summary

-By converting to a word document matrix, word data can be converted to numerical data. In many cases, it is done with weight. -By converting to numerical data, the correlation coefficient of each data can be obtained. As a result, the degree of similarity between words can be calculated. -By listing this similarity and creating a network, visualization becomes possible, and by using an index called modularity, topics can be extracted, and networks with high similarity can be extracted.

This time is over. Thank you for reading until the end.

[PYTHON] Natural language processing 2 Word similarity