Know the word similarity

Morphological analysis of utterance text

In Python: Japanese text: Morphological analysis I learned "reading data to be analyzed" and "basics of preprocessing of natural language processing". In this post, you will use what you have learned so far to learn the "method of processing utterance data sets" that is the subject of analysis. In particular, we will implement preprocessing focusing on word similarity.

The flag of this dataset O utterance that is not bankruptcy It cannot be said that T is a bankruptcy, but it is a strange utterance. X There are three types of utterances that clearly feel strange.

Here, we will process based on the flag of the utterance that is not O failure.

About the variables that appear in the example

Contents of variable df_label_text_O that extracts only non-broken utterances (line 49) An ndarray of a NumPy array containing an index and column.

 0                                1
1 O Excuse me, who are you?
24 O Is that so? Do you like high school baseball?
48 O Koshien, right?
...  ..                              ...
2376 O Is that so?

Contents of the variable row used when processing a non-broken utterance data set (line 62) row uses tolist () NumPy array ndarray df_label_text_O converted to Python list type.

[['O', 'Excuse me, who are you?'], ['O', 'Really. Do you like high school baseball?'], ['O', 'Koshien, right?'], ... ['O', 'Is that so.']]

Click here for an example

import os
import json
import pandas as pd
import re
from janome.tokenizer import Tokenizer


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create an empty list to store flags and utterances
label_text = []

#Process 10 JSON files one by one
for file in file_dir[:10]:
    #Read in read-only mode
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)

    #Utterance data array`turns`Extract utterance content and flags from
    for turn in json_data['turns']:
        turn_index = turn['turn-index'] #Utterance turn No
        speaker = turn['speaker'] #Speaker ID
        utterance = turn['utterance'] #Utterance content
        #Exclude the first line because it is a system utterance
        if turn_index != 0:
            #Extract the utterance content of a person
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    #Extract the flag of failure
                    a = annotate['breakdown']
                    #Store flags and utterances of people in a list
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#list`label_text`To DataFrame
df_label_text = pd.DataFrame(label_text)

#Remove duplicate lines
df_label_text = df_label_text.drop_duplicates()

#Extract only non-bankrupt utterances
df_label_text_O = df_label_text[df_label_text[0] == 'O']

t = Tokenizer()

#Create an utterance dataset that is not an empty collapse
morpO = []  #Stores word-separated words
tmp1 = []
tmp2 = ''

#Read line by line
# .values:Read except index and column
# .tolist:Convert NumPy array ndarray to Python list type
for row in df_label_text_O.values.tolist():
    #Remove numbers and uppercase and lowercase letters from regular expressions
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')

    #Perform morphological analysis with Janome
    for token in t.tokenize(reg_row):
        #The surface system of words`morpO`Please add to
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

#Output morphologically analyzed words
pd.DataFrame(morpO)

What is a word document matrix?

He explained that in order to analyze natural language data, word data (sentence data) should be converted to numerical data. One of the conversion methods

Word document matrix (term-There is something called a document matrix).

The word document matrix is a tabular representation of the frequency of words that appear in a document.

The word data contained in each document can be obtained by morphological analysis. From there, the number of occurrences of each word is counted and converted into numerical data.

The word document matrix is a document in the word / column direction in the row direction, or a document / column direction in the opposite row direction. It is expressed in a matrix format in which words are arranged.

When there are all N kinds of words and all M documents, it is called a word document matrix of N rows x M columns.

In the word document matrix in the figure, document 1 has word 1 twice, word 2 once, word 3 three times, ..., word N 0 times. Indicates that it will appear.

To count the number of times a word appears

Python standard library collections.Counter()There are several methods such as using
Here scikit-learn (Sykit Learn) Count Vectorizer()Using
Here is an example of creating a word document matrix.
CountVectorizer()Breaks the text into words and counts the number of times the word appears.

from sklearn.feature_extraction.text import CountVectorizer

# `CountVectorizer()`Generate a converter using
CV = CountVectorizer()
corpus = ['This is a pen.',
          'That is a bot.',]

# `fit_transform()`so`corpus`And convert the number of occurrences of words into an array
X = CV.fit_transform(corpus)
print(X)

>>>Output result
  (0, 2)    1
  (0, 1)    1
  (0, 4)    1
  (1, 0)    1
  (1, 3)    1
  (1, 1)    1

# `get_feature_names()`Check the list that contains the words you learned in
print(CV.get_feature_names())

>>>Output result
['bot', 'is', 'pen', 'that', 'this']

#The number of appearances counted`toarray()`Convert to vector and display with
print(X.toarray())

>>>Output result
#line:`corpus`Order of sentences given in
#Column:`get_feature_names()`Order of words confirmed in
[[0 1 1 0 1]
 [1 1 0 1 0]]

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create an empty list to store flags and utterances
label_text = []

#Process 10 JSON files one by one
for file in file_dir[:10]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)

    #Utterance data array`turns`Extract utterance content and flags from
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#list`label_text`To DataFrame and remove duplicates
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
#Extract only non-bankrupt utterances
df_label_text_O = df_label_text[df_label_text[0] == 'O']

t = Tokenizer()

#Create an utterance dataset that is not an empty collapse
morpO = []
tmp1 = []
tmp2 = ''

#Remove uppercase and lowercase letters from numbers and alphabets
for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    #Morphological analysis with Janome
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

#Convert from list format to NumPy array (because array is faster)
morpO_array = np.array(morpO)

#Count the number of times a word appears
cntvecO = CountVectorizer()

#Learn and convert the number of word occurrences to an array
morpO_cntvecs = cntvecO.fit_transform(morpO_array)

#Convert to ndarray array
morpO_cntarray = morpO_cntvecs.toarray()

#Display the number of occurrences of words in DataFrame format
#columns: split words
#index (row): Original utterance data
pd.DataFrame(morpO_cntarray, columns=cntvecO.get_feature_names(),
             index=morpO).head(20)

One-letter word attention

By default, single-letter words are not counted. There are words in Japanese that have meaning even one letter, so be careful when dealing with Japanese. To count even one-letter words, specify token_pattern ='(? U) \ b \ w + \ b'in CountVectorizer ().

CountVectorizer(token_pattern='(?u)\\b\\w+\\b')

What is a weighted word document matrix?

In a word document matrix that has the number of occurrences (frequency) of a word as a value Words that appear universally (such as "I" and "is") tend to appear more frequently in any document.

On the other hand, words that appear only in specific documents appear less frequently. It becomes difficult to characterize each document from words. Therefore, in the word document matrix

In TF (Term Frequency)
Multiplied by IDF (Inverse Document Frequency)
TF-IDF values are often used.

The IDF value of a word can be calculated by log (total number of documents / number of documents in which a word appears) + 1. For example, if a word is included in 3 out of 4 documents The IDF value is log (4/3) + 1 ≒ 1.1 Words that appear only in a particular document have a higher IDF value.

A large IDF value is a characteristic of the document, as the word is of high importance.

You can calculate the IDF from the TF and multiply the TF by the IDF to calculate the TF-IDF value.

Below is TfidfVectorizer()Was used

Here is an example of creating a weighted word document matrix with TF-IDF values.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

#Display after the decimal point with 2 significant digits
np.set_printoptions(precision=2)
docs = np.array([
    "White black red", "White white black", "Red black"
])

# `TfidfVectorizer()`Generate a converter using
vectorizer = TfidfVectorizer(use_idf=True, token_pattern="(?u)\\b\\w+\\b")

# `fit_transform()`so`docs`And convert the number of occurrences of weighted words into an array
vecs = vectorizer.fit_transform(docs)
print(vecs.toarray())
# >>Output result
[[ 0.62  0.62  0.48]
[ 0.93  0.    0.36]
[ 0.    0.79  0.61]]

①vectorizer = TfidfVectorizer()so
Generates a converter that performs vector representation (quantifying words).

②use_idf=If set to False, only tf will be weighted.

③vectorizer.fit_transform()Converts the document to a vector.
The argument is an array separated (separated) by whitespace characters.

④toarray()Converts the output to a NumPy ndarray array.

np.set_printoptions()Is a function that defines the display format of the NumPy array.
print()The original value does not change with the setting that is valid only when displaying the value with.

Argument precision=Specify the number of digits to be displayed after the decimal point.

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create an empty list to store flags and utterances
label_text = []

#Process 10 JSON files one by one
for file in file_dir[:10]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)

    #Utterance data array`turns`Extract utterance content and flags from
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#list`label_text`To DataFrame and remove duplicates
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
#Extract only non-bankrupt utterances
df_label_text_O = df_label_text[df_label_text[0] == 'O']

t = Tokenizer()

#Create an utterance dataset that is not an empty collapse
morpO = []
tmp1 = []
tmp2 = ''

#Remove uppercase and lowercase letters from numbers and alphabets
for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    #Morphological analysis with Janome
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

#Convert from list format to NumPy array (because array is faster)
morpO_array = np.array(morpO)

#(1) Generate a converter that performs vector representation
tfidf_vecO = TfidfVectorizer(use_idf=True)

#② Convert words to vector representation
morpO_tfidf_vecs = tfidf_vecO.fit_transform(morpO_array)

#③ Convert to ndarray array
morpO_tfidf_array = morpO_tfidf_vecs.toarray()

#Display words (vector representation) in DataFrame format
pd.DataFrame(morpO_tfidf_array, columns=tfidf_vecO.get_feature_names(), 
             index=morpO).head(20)

Calculate word similarity (correlation)

A feature is a feature that is different from other data that the data has.

In the word document matrix created by CountVectorizer (), the number of times a word appears In the word document matrix created by TfidfVectorizer (), the TF-IDF value of a word is used as the feature of the word.

For example, when distinguishing whether the object in the image is a dog or a cat First of all, you may unknowingly notice the shape of your ears.

In this case, the ear (the area including) is the feature quantity. In the document classification problem, each word is used as a feature to create a supervised learning model.

Here is how similar the two words appear, which is different from the above

In other words, we create an unsupervised learning model with similarity as a feature.

A familiar method for measuring similarity is the correlation coefficient.

In addition, cosine similarity that measures the similarity between vectors
The Jaccard coefficient, which measures the similarity between sets, is well known.

Here, we use the corr () method of pandas.DataFrame to find the similarity. Calculate the correlation coefficient between each column.

The corr () method calculates columns whose data type is numeric or Boolean. Strings and missing values NaN are excluded.

corr = DataFrame.corr()

In the argument of corr (), specify the calculation method of the correlation coefficient from the following.

'pearson': Pearson's product moment correlation coefficient (default)
'kendall': Kendall rank correlation coefficient
'spearman': Spearman's rank correlation coefficient

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create an empty list to store flags and utterances
label_text = []

#Process 10 JSON files one by one
for file in file_dir[:10]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)

    #Utterance data array`turns`Extract utterance content and flags from
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#list`label_text`To DataFrame and remove duplicates
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
#Extract only non-bankrupt utterances
df_label_text_O = df_label_text[df_label_text[0] == 'O']

t = Tokenizer()

#Create an utterance dataset that is not an empty collapse
morpO = []
tmp1 = []
tmp2 = ''

#Remove uppercase and lowercase letters from numbers and alphabets
for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    #Morphological analysis with Janome
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

# TF-Create a weighted word document matrix by IDF value
morpO_array = np.array(morpO)
tfidf_vecO = TfidfVectorizer(use_idf=True)
morpO_tfidf_vecs = tfidf_vecO.fit_transform(morpO_array)
morpO_tfidf_array = morpO_tfidf_vecs.toarray()

#Convert word occurrence count to DataFrame format
dtmO = pd.DataFrame(morpO_tfidf_array, columns=tfidf_vecO.get_feature_names(), 
             index=morpO).head(20)

#Create a correlation matrix
corr_matrixO = dtmO.corr().abs()
# `.abs()`Is a method to find the absolute value
#Display of correlation matrix
corr_matrixO

Know the characteristics of utterances from the similarity of words

Creating a similarity list

From here, by network analysis using the correlation coefficient of the two words created in the previous section as a feature. We will carry out a quantitative analysis.

Convert the correlation coefficient from matrix format to list format for network analysis.

To convert matrix format to list format
pandas.DataFrame stack()Use the method.

from pandas import DataFrame

#Prepare DataFrame
df=DataFrame([[0.1,0.2,0.3],[0.4,'NaN',0.5]],
             columns=['test1','test2','test3'],
             index=['AA','BB'])
print(df)

# >>>Output result
      test1      test2      test3
AA       0.1       0.2       0.3
BB       0.4       NaN       0.5
# stack :Column-to-row conversion
print(df.stack())

# >>>Output result
AA   test1    0.1
     test2    0.2
     test3    0.3
BB   test1    0.4
     test2    NaN
     test3    0.5
#unstack: Row-to-column conversion
print(df.unstack())

# >>>Output result
test1  AA    0.1
       BB    0.4
test2  AA    0.2
       BB    NaN
test3  AA    0.3
       BB    0.5

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create a list of flags and utterances
label_text = []
for file in file_dir[:10]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#Remove duplicates and extract only non-broken utterances
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
df_label_text_O = df_label_text[df_label_text[0] == 'O']

#Morphological analysis by Janome
t = Tokenizer()

morpO = []
tmp1 = []
tmp2 = ''

for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

# TF-Create a weighted word document matrix by IDF value
morpO_array = np.array(morpO)
tfidf_vecO = TfidfVectorizer(use_idf=True)
morpO_tfidf_vecs = tfidf_vecO.fit_transform(morpO_array)
morpO_tfidf_array = morpO_tfidf_vecs.toarray()

#Convert to DataFrame format and create correlation matrix
dtmO = pd.DataFrame(morpO_tfidf_array, columns=tfidf_vecO.get_feature_names(), 
             index=morpO)
corr_matrixO = dtmO.corr().abs()

#Correlation matrix`corr_matrixO`Convert from column direction to row direction
corr_stackO = corr_matrixO.stack()
index = pd.Series(corr_stackO.index.values)
value = pd.Series(corr_stackO.values)

#Correlation coefficient is 0.5 or more 1.Extract less than 0
tmp3 = [] #Correlation coefficient is 0.5 or more 1.List of index values with values less than 0
tmp4 = [] #Correlation coefficient is 0.5 or more 1.List of value values less than 0

for i in range(0, len(index)):
    if value[i] >= 0.5 and value[i] < 1.0:
        tmp1 = str(index[i][0]) + ' ' + str(index[i][1])
        tmp2 = [s for s in tmp1.split()]
        tmp3.append(tmp2)
        tmp4 = np.append(tmp4, value[i])

tmp3 = pd.DataFrame(tmp3)
tmp3 = tmp3.rename(columns={0: 'node1', 1: 'node2'})
tmp4 = pd.DataFrame(tmp4)
tmp4 = tmp4.rename(columns={0: 'weight'})

# DataFrame`tmp3`When`tmp4`Please connect in the horizontal direction
df_corlistO = pd.concat([tmp3, tmp4], axis=1)

#Display the created DataFrame
df_corlistO.head(20)

Creating a similarity network

A network is one of the ways to express the relationship between objects. A well-known example is the network of friendships on social media.

In the network structure
The target is a node
Relationships are represented by edges.

Edges have weight and are intimacy in a network of friends. The closer you are, the higher the weight value.

In addition, route maps, air networks, and co-occurrence / similar relationships of words can be expressed on the network.

To visualize a group of languages that have no concept of direction at the edge and are not related, such as the similarity list created in the previous section.

Use an undirected graph (or undirected network).

A weighted graph is also called a network.

An undirected graph is one in which the edges that make up the network have no direction. On the contrary, if the edge is directional

It is called a directed graph (or directed network).

Creating an undirected graph (undirected network)

Python has a library called NetworkX. This section uses this library to visualize the similarity list created in the previous section.

#Library`NetworkX`Import
import networkx as nx

#Creating undirected graphs
network = nx.from_pandas_edgelist(df, source='source', target='target', edge_attr=None, create_using=None)

① df: DataFrame name of Pandas which is the source of the graph

② source: Column name of the source node
Specify with str (string type) or int (integer type)

③ target: Column name of the target node
Specify with str or int

④edge_attr: Edge (weight) of each data
Specify with str or int, iterable, True

⑤create_using: Graph type (optional)

Undirected graph: nx.Graph (default)
Directed graph: nx.DiGraph

Visualization of graph (network)

#Library`Matplotlib`From`pyplot`Import
from matplotlib import pyplot

#Calculate the optimum display position for each node
pos = nx.spring_layout(graph)

#Draw graph
nx.draw_networkx(graph, pos)

#Display graphs using Matplotlib
plt.show()

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import networkx as nx
import matplotlib.pyplot as plt


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create a list of flags and utterances
label_text = []
for file in file_dir[:10]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#Remove duplicates and extract only non-broken utterances
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
df_label_text_O = df_label_text[df_label_text[0] == 'O']

#Morphological analysis by Janome
t = Tokenizer()

morpO = []
tmp1 = []
tmp2 = ''

for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

# TF-Create a weighted word document matrix by IDF value
morpO_array = np.array(morpO)
tfidf_vecO = TfidfVectorizer(use_idf=True)
morpO_tfidf_vecs = tfidf_vecO.fit_transform(morpO_array)
morpO_tfidf_array = morpO_tfidf_vecs.toarray()

#Convert to DataFrame format and create correlation matrix
dtmO = pd.DataFrame(morpO_tfidf_array)

corr_matrixO = dtmO.corr().abs()

#Creating a non-broken speech dataset
corr_stackO = corr_matrixO.stack()
index = pd.Series(corr_stackO.index.values)
value = pd.Series(corr_stackO.values)

tmp3 = []
tmp4 = []
for i in range(0, len(index)):
    if value[i] >= 0.5 and value[i] < 1.0:
        tmp1 = str(index[i][0]) + ' ' + str(index[i][1])
        tmp2 = [int(s) for s in tmp1.split()]
        tmp3.append(tmp2)
        tmp4 = np.append(tmp4, value[i])

tmp3 = pd.DataFrame(tmp3)
tmp3 = tmp3.rename(columns={0: 'node1', 1: 'node2'})
tmp4 = pd.DataFrame(tmp4)
tmp4 = tmp4.rename(columns={0: 'weight'})
df_corlistO = pd.concat([tmp3, tmp4], axis=1)

#① Create an undirected graph
G_corlistO = nx.from_pandas_edgelist(df_corlistO, 'node1', 'node2', ['weight'])

#② Visualize the created graph
#Layout settings
pos = nx.spring_layout(G_corlistO)
nx.draw_networkx(G_corlistO, pos)
plt.show()

Similarity network characteristics

As shown in the graph visualized in the previous section, the actual network has many complicated structures. At first glance, it is difficult to grasp the characteristics.

In such a case, quantitatively grasp the characteristics with some index. Some of the indicators are for understanding the entire network (global). There are also things (local) that focus on a certain node and grasp it.

Here are some examples of commonly used indicators:

Degree: Represents the number of edges that a node has.
Frequency distribution: Represents a histogram of the number of nodes with a certain degree.
Cluster coefficient: Indicates how closely the nodes are connected.
Route length: The distance from one node to another.
Centrality: Represents the degree to which a node plays a central role in the network.

Now, for the network created in the previous section, calculate the cluster coefficient and mediation centrality. Let's look at the features.

In this network Cluster coefficient is the connection density between words Mediation centrality represents the degree of hub of a word in a network.

Comparing the average cluster coefficients in each network, non-broken utterances and broken utterances You can see that the words of the utterance that are broken are more closely connected.

Also, if you compare the top 5 words with high mediation centrality Words about Obon holidays are included in non-bankrupt utterances It can be inferred that early morning baseball words play a central role in the bankruptcy utterances.

① Utterance that is not bankruptcy
<Average cluster coefficient>　0.051924357
<Top 5 words with high mediation centrality>
Holiday, Obon, here, few, dawn

② Utterance that is bankrupt
<Average cluster coefficient>　0.069563257
<Top 5 words with high mediation centrality>
This time, let's play baseball, early morning, he

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import networkx as nx
import matplotlib.pyplot as plt


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create a list of flags and utterances
label_text = []
for file in file_dir[:10]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#Remove duplicates and extract only non-broken utterances
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
df_label_text_O = df_label_text[df_label_text[0] == 'O']

#Morphological analysis by Janome
t = Tokenizer()

morpO = []
tmp1 = []
tmp2 = ''

for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

# TF-Create a weighted word document matrix by IDF value
morpO_array = np.array(morpO)
tfidf_vecO = TfidfVectorizer(use_idf=True)
morpO_tfidf_vecs = tfidf_vecO.fit_transform(morpO_array)
morpO_tfidf_array = morpO_tfidf_vecs.toarray()

#Convert to DataFrame format and create correlation matrix
dtmO = pd.DataFrame(morpO_tfidf_array, columns=tfidf_vecO.get_feature_names(), 
             index=morpO)
corr_matrixO = dtmO.corr().abs()

#Creating a non-broken speech dataset
corr_stackO = corr_matrixO.stack()
index = pd.Series(corr_stackO.index.values)
value = pd.Series(corr_stackO.values)

tmp3 = []
tmp4 = []
for i in range(0, len(index)):
    if value[i] >= 0.5 and value[i] < 1.0:
        tmp1 = str(index[i][0]) + ' ' + str(index[i][1])
        tmp2 = [s for s in tmp1.split()]
        tmp3.append(tmp2)
        tmp4 = np.append(tmp4, value[i])

tmp3 = pd.DataFrame(tmp3)
tmp3 = tmp3.rename(columns={0: 'node1', 1: 'node2'})
tmp4 = pd.DataFrame(tmp4)
tmp4 = tmp4.rename(columns={0: 'weight'})
df_corlistO = pd.concat([tmp3, tmp4], axis=1)

#Create undirected graph
G_corlistO = nx.from_pandas_edgelist(df_corlistO, 'node1', 'node2', ['weight'])

#For non-broken speech datasets
#① Calculation of average cluster coefficient
print('Average cluster coefficient')
print(nx.average_clustering(G_corlistO, weight='weight'))
print()

#② Calculation of mediation centrality
bc = nx.betweenness_centrality(G_corlistO, weight='weight')
print('Mediation centrality')
for k, v in sorted(bc.items(), key=lambda x: -x[1]):
    print(str(k) + ': ' + str(v))

Calculation of average cluster coefficient

The higher the average of the cluster coefficients of all nodes, the denser the network. The average of the cluster coefficients is calculated using nx.average_clustering ().

nx.average_clustering(G, weight=None)

①G
Specify graph.
(Undirected graph G created in the previous section_corlistO)

②weight
Specifies the edge with the number to use as the weight. If None, the weight of each edge will be 1.

Computation of mediation centrality

It is determined by how many nodes are included in the shortest path between all nodes. In other words, the nodes that are most used to convey information efficiently are more intermediary and central.

nx.betweenness_centrality(G, weight=None)

①G
Specify graph.
(Undirected graph G created in the previous section_corlistO)

②weight
Specifies the edge with the number to use as the weight. If None, all edge weights are considered equal.

Similarity network topic extraction

One network consists of multiple partial networks (= communities). Each node in the community is characterized by being closely connected at the edge.

You can divide it into partial networks by removing the sparse edges of one network. In other words, the community can be extracted = networks with high similarity can be extracted.

To divide the network

We use an index called Modularity.

Modularity From "Ratio of the number of edges in the community to the total number of edges in one network" "For the sum of the degrees of all nodes in a network (equal to the number of edges in the network x 2) Quantify the quality of the split by subtracting the percentage of the total degree of nodes in the community.

The higher the modularity value, the tighter the nodes in the community.

Now let's extract the community using modularity. Non-bankrupt utterances and bankrupt utterances in their respective networks When I checked the words of the community with the largest number of nodes, I got the following results.

① Utterance that is not bankruptcy
Obon, end, slack, rest, cut, trouble, few, homecoming, forgetting, lazy, dawn, continuing, consecutive holidays, concentration

② Utterance that is bankrupt
Please, about, about, then take, which, addiction, of course, good, work, bias, split bill, danger, wait, mind, time, nutrition, carelessness, sleep, meal

Utterances that are not bankrupt include, for example, "There are few Obon holidays and it is difficult to go home" and "If consecutive holidays continue, I will be lazy after the holidays". You can guess that there is a topic.

The utterances that are bankrupt include, for example, "try not to bias the nutrition of the diet" and "get enough sleep". There seems to be a topic.

Similarly, you should be able to guess from the words what topics are included in other communities.

Extracting communities using modularity

greedy_modularity_communities(G, weight=None)

①G
Specify graph.
(Undirected graph G created in the previous section_corlistO)

②weight
Specifies the edge with the number to use as the weight. If None, all edge weights are considered equal.

Click here for usage examples

import os
import json
import pandas as pd
import numpy as np
import re
from janome.tokenizer import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import networkx as nx
import matplotlib.pyplot as plt
from networkx.algorithms.community import greedy_modularity_communities


#Specify init100 directory
file_path = './6110_nlp_preprocessing_data/init100/'
file_dir = os.listdir(file_path)

#Create a list of flags and utterances
label_text = []
for file in file_dir[:20]:
    r = open(file_path + file, 'r', encoding='utf-8')
    json_data = json.load(r)
    for turn in json_data['turns']:
        turn_index = turn['turn-index']
        speaker = turn['speaker']
        utterance = turn['utterance']
        if turn_index != 0:
            if speaker == 'U':
                u_text = ''
                u_text = utterance
            else:
                a = ''
                for annotate in turn['annotations']:
                    a = annotate['breakdown']
                    tmp1 = str(a) + '\t' + u_text
                    tmp2 = tmp1.split('\t')
                    label_text.append(tmp2)

#Remove duplicates and extract only non-broken utterances
df_label_text = pd.DataFrame(label_text)
df_label_text = df_label_text.drop_duplicates()
df_label_text_O = df_label_text[df_label_text[0] == 'O']

#Morphological analysis by Janome
t = Tokenizer()

morpO = []
tmp1 = []
tmp2 = ''

for row in df_label_text_O.values.tolist():
    reg_row = re.sub('[0-9a-zA-Z]+', '', row[1])
    reg_row = reg_row.replace('\n', '')
    for token in t.tokenize(reg_row):
        tmp1.append(token.surface)
        tmp2 = ' '.join(tmp1)
    morpO.append(tmp2)
    tmp1 = []

# TF-Create a weighted word document matrix by IDF value
morpO_array = np.array(morpO)
tfidf_vecO = TfidfVectorizer(use_idf=True)
morpO_tfidf_vecs = tfidf_vecO.fit_transform(morpO_array)
morpO_tfidf_array = morpO_tfidf_vecs.toarray()

#Convert to DataFrame format and create correlation matrix
dtmO = pd.DataFrame(morpO_tfidf_array, columns=tfidf_vecO.get_feature_names(), 
             index=morpO)
corr_matrixO = dtmO.corr().abs()

#Creating a non-broken speech dataset
corr_stackO = corr_matrixO.stack()
index = pd.Series(corr_stackO.index.values)
value = pd.Series(corr_stackO.values)

tmp3 = []
tmp4 = []
for i in range(0, len(index)):
    if value[i] >= 0.5 and value[i] < 1.0:
        tmp1 = str(index[i][0]) + ' ' + str(index[i][1])
        tmp2 = [s for s in tmp1.split()]
        tmp3.append(tmp2)
        tmp4 = np.append(tmp4, value[i])

tmp3 = pd.DataFrame(tmp3)
tmp3 = tmp3.rename(columns={0: 'node1', 1: 'node2'})
tmp4 = pd.DataFrame(tmp4)
tmp4 = tmp4.rename(columns={0: 'weight'})
df_corlistO = pd.concat([tmp3, tmp4], axis=1)

#Create undirected graph
G_corlistO = nx.from_pandas_edgelist(df_corlistO, 'node1', 'node2', ['weight'])

#For non-broken speech datasets
#Community extraction
cm_corlistO = list(greedy_modularity_communities(G_corlistO, weight='weight'))

#View nodes belonging to each community
cm_corlistO

Python: Japanese text: Characteristic of utterance from word similarity

Know the word similarity

Morphological analysis of utterance text

What is a word document matrix?

One-letter word attention

What is a weighted word document matrix?

Calculate word similarity (correlation)

Know the characteristics of utterances from the similarity of words

Creating a similarity list

Creating a similarity network

Creating an undirected graph (undirected network)

Visualization of graph (network)

Similarity network characteristics

Calculation of average cluster coefficient

Computation of mediation centrality

Similarity network topic extraction

Extracting communities using modularity