3. Natural language processing with Python 2-1. Co-occurrence network

** 1. Preparation of text data **

⑴ Import of various modules

import re
import zipfile
import urllib.request
import os.path
import glob

⑵ Get file path

URL = 'https://www.aozora.gr.jp/cards/000148/files/772_ruby_33099.zip'

(3) Acquisition of text file and extraction of text

def download(URL):

    #Download zip file
    zip_file = re.split(r'/', URL)[-1]
    urllib.request.urlretrieve(URL, zip_file)
    dir = os.path.splitext(zip_file)[0]

    #Unzip and save the zip file
    with zipfile.ZipFile(zip_file) as zip_object:

    #Get the path of the text file
    path = os.path.join(dir,'*.txt')
    list = glob.glob(path)
    return list[0]
def convert(download_text):

    #Read file
    data = open(download_text, 'rb').read()
    text = data.decode('shift_jis')

    #Extraction of text
    text = re.split(r'\-{5,}', text)[2] 
    text = re.split(r'Bottom book:', text)[0]
    text = re.split(r'[#New Page]', text)[0]

    #Delete unnecessary parts
    text = re.sub(r'《.+?》', '', text)
    text = re.sub(r'[#.+?]', '', text)
    text = re.sub(r'|', '', text)
    text = re.sub(r'\r\n', '', text)
    text = re.sub(r'\u3000', '', text)
    text = re.sub(r'「', '', text)
    text = re.sub(r'」', '', text)

    return text
#Get file path
download_file = download(URL)

#Extract only the text
text = convert(download_file)

#Split into a statement-based list
sentences = text.split("。")


** 2. Creating co-occurrence data **

⑷ Installation of MeCab

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

⑸ Sentence-based noun list generation

import MeCab
mecab = MeCab.Tagger("-Ochasen")

#Generate a sentence-by-sentence noun list
noun_list = [
             [v.split()[2] for v in mecab.parse(sentence).splitlines()
             if (len(v.split())>=3 and v.split()[3][:2]=='noun')]
             for sentence in sentences

⑹ Generation of co-occurrence data

  • Co-occurrence data is a dictionary-type object consisting of co-occurrence word pairs and frequency of occurrence.
import itertools
from collections import Counter
  • ʻItertools`: A module that collects iterator functions for efficient loop processing.
  • Counter: Module for counting the number of occurrences of each element
#Generate a sentence-based noun pair list
pair_list = [
             list(itertools.combinations(n, 2))
             for n in noun_list if len(noun_list) >=2

#Flattening the noun pair list
all_pairs = []
for u in pair_list:

#Count the frequency of noun pairs
cnt_pairs = Counter(all_pairs)
  • Sequentially extract two or more words from the sentence-based nomenclature list, generate a combination of two words with ʻitertools.combinations (), list them with list (), and store them in pair_list`.
  • However, since pair_list is a sentence unit, it cannot be counted as it is. Therefore, flatten it by sequentially adding it to the newly prepared variable ʻall_pairs with ʻextend ().
  • Pass this to Counter () to generate ** dictionary-type co-occurrence data ** cnt_pairs.


** 3. Creation of drawing data **

import pandas as pd
import numpy as np

⑺ Narrow down co-occurrence data

  • Narrow down the elements to simplify the appearance when drawing. Here, we will generate a list of the top 50 sets by appearance frequency.
tops = sorted(
    key=lambda x: x[1], reverse=True
  • The syntax is a combination of sorted () and lambda expressions, and sorts dictionary-type objects based on the elements specified under key = lambda.
  • The reference x [1] extracts the top 50 pairs from the second element, that is, the reverse sort by frequency reverse = True.

⑻ Weighted data generation

noun_1 = []
noun_2 = []
frequency = []

#Creating a data frame
for n,f in tops:

df = pd.DataFrame({'The above noun': noun_1, 'Later noun': noun_2, 'Frequency of appearance': frequency})

#Setting weighted data
weighted_edges = np.array(df)
  • Converted the co-occurrence data of the top 50 sets to array to make weighted_edges (weighted data).
  • The following shows the data frame before converting to array.


** 4. Drawing a network diagram **

⑼ Import of visualization library

import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline 
  • ** networkX ** is a package for creating and manipulating complex networks and graph structures in Python.
  • In the network diagram, the vertices are called ** nodes **, and the edges that connect the vertices are called ** edges **.
  • In order to display the node label in Japanese, it is necessary to import the following japanize_matplotlib and then specify the Japanese font.
#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib

⑽ Visualization by NetworkX

  • The procedure for drawing a network diagram with networkX is 3 steps: ➀ create an object with a graph structure, ➁ load data into it, and ➂ specify specifications such as nodes and edges on matplotlib and draw.
  • It seems to be confusing, but font_family =" IPAexGothic " is important, and by specifying ** font_family with Japanese font **, the node label will be made compatible with Japanese display.
#Generating a graph object
G = nx.Graph()

#Reading weighted data

#Drawing a network diagram
                 node_shape = "s",
                 node_color = "c", 
                 node_size = 500,
                 edge_color = "gray", 
                 font_family = "IPAexGothic") #Font specification



  • In order to understand the mechanism of co-occurrence network analysis as one big flow, details such as setting stop words (words to be excluded) and processing of idioms (for example, "individualism" instead of "individual" and "principle"), etc. I closed my eyes.
  • Also, for convenience, I divided it into the following four work stages. There are four steps: ➀ text data preparation, ➁ co-occurrence data creation, ➂ drawing data creation, and ➃ network diagram drawing. However, in general, I think that it is understood in three stages: ➊ preprocessing, ➋ analysis, and ➌ visualization.
  • Especially ➊ I think preprocessing is the heart of natural language processing. Actually, it may be incorporated as part of ➋ in the script, but in short, it is a question of "how to extract the necessary words from the raw data". What kind of analysis perspective and what criteria should be used to extract words? It will appear directly in the analysis results, which will affect the interpretation. It is the unit of work that requires the most consideration and takes time and energy.

