3. Natural language processing with Python 2-1. Co-occurrence network

** 1. Preparation of text data **

⑴ Import of various modules

import re
import zipfile
import urllib.request
import os.path
import glob

⑵ Get file path

URL = 'https://www.aozora.gr.jp/cards/000148/files/772_ruby_33099.zip'

(3) Acquisition of text file and extraction of text

def download(URL):

    #Download zip file
    zip_file = re.split(r'/', URL)[-1]
    urllib.request.urlretrieve(URL, zip_file)
    dir = os.path.splitext(zip_file)[0]

    #Unzip and save the zip file
    with zipfile.ZipFile(zip_file) as zip_object:
        zip_object.extractall(dir)
    os.remove(zip_file)

    #Get the path of the text file
    path = os.path.join(dir,'*.txt')
    list = glob.glob(path)
    return list[0]
def convert(download_text):

    #Read file
    data = open(download_text, 'rb').read()
    text = data.decode('shift_jis')

    #Extraction of text
    text = re.split(r'\-{5,}', text)[2] 
    text = re.split(r'Bottom book:', text)[0]
    text = re.split(r'[#New Page]', text)[0]

    #Delete unnecessary parts
    text = re.sub(r'《.+?》', '', text)
    text = re.sub(r'[#.+?]', '', text)
    text = re.sub(r'|', '', text)
    text = re.sub(r'\r\n', '', text)
    text = re.sub(r'\u3000', '', text)
    text = re.sub(r'「', '', text)
    text = re.sub(r'」', '', text)

    return text
#Get file path
download_file = download(URL)

#Extract only the text
text = convert(download_file)

#Split into a statement-based list
sentences = text.split("。")

image.png

** 2. Creating co-occurrence data **

⑷ Installation of MeCab

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

⑸ Sentence-based noun list generation

import MeCab
mecab = MeCab.Tagger("-Ochasen")

#Generate a sentence-by-sentence noun list
noun_list = [
             [v.split()[2] for v in mecab.parse(sentence).splitlines()
             if (len(v.split())>=3 and v.split()[3][:2]=='noun')]
             for sentence in sentences
             ]

⑹ Generation of co-occurrence data

  • Co-occurrence data is a dictionary-type object consisting of co-occurrence word pairs and frequency of occurrence.
import itertools
from collections import Counter
  • ʻItertools`: A module that collects iterator functions for efficient loop processing.
  • Counter: Module for counting the number of occurrences of each element
#Generate a sentence-based noun pair list
pair_list = [
             list(itertools.combinations(n, 2))
             for n in noun_list if len(noun_list) >=2
             ]

#Flattening the noun pair list
all_pairs = []
for u in pair_list:
    all_pairs.extend(u)

#Count the frequency of noun pairs
cnt_pairs = Counter(all_pairs)
  • Sequentially extract two or more words from the sentence-based nomenclature list, generate a combination of two words with ʻitertools.combinations (), list them with list (), and store them in pair_list`.
  • However, since pair_list is a sentence unit, it cannot be counted as it is. Therefore, flatten it by sequentially adding it to the newly prepared variable ʻall_pairs with ʻextend ().
  • Pass this to Counter () to generate ** dictionary-type co-occurrence data ** cnt_pairs.

image.png

** 3. Creation of drawing data **

import pandas as pd
import numpy as np

⑺ Narrow down co-occurrence data

  • Narrow down the elements to simplify the appearance when drawing. Here, we will generate a list of the top 50 sets by appearance frequency.
tops = sorted(
    cnt_pairs.items(), 
    key=lambda x: x[1], reverse=True
    )[:50]
  • The syntax is a combination of sorted () and lambda expressions, and sorts dictionary-type objects based on the elements specified under key = lambda.
  • The reference x [1] extracts the top 50 pairs from the second element, that is, the reverse sort by frequency reverse = True.

⑻ Weighted data generation

noun_1 = []
noun_2 = []
frequency = []

#Creating a data frame
for n,f in tops:
    noun_1.append(n[0])    
    noun_2.append(n[1])
    frequency.append(f)

df = pd.DataFrame({'The above noun': noun_1, 'Later noun': noun_2, 'Frequency of appearance': frequency})

#Setting weighted data
weighted_edges = np.array(df)
  • Converted the co-occurrence data of the top 50 sets to array to make weighted_edges (weighted data).
  • The following shows the data frame before converting to array.

image.png

** 4. Drawing a network diagram **

⑼ Import of visualization library

import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline 
  • ** networkX ** is a package for creating and manipulating complex networks and graph structures in Python.
  • In the network diagram, the vertices are called ** nodes **, and the edges that connect the vertices are called ** edges **.
  • In order to display the node label in Japanese, it is necessary to import the following japanize_matplotlib and then specify the Japanese font.
#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib

⑽ Visualization by NetworkX

  • The procedure for drawing a network diagram with networkX is 3 steps: ➀ create an object with a graph structure, ➁ load data into it, and ➂ specify specifications such as nodes and edges on matplotlib and draw.
  • It seems to be confusing, but font_family =" IPAexGothic " is important, and by specifying ** font_family with Japanese font **, the node label will be made compatible with Japanese display.
#Generating a graph object
G = nx.Graph()

#Reading weighted data
G.add_weighted_edges_from(weighted_edges)

#Drawing a network diagram
plt.figure(figsize=(10,10))
nx.draw_networkx(G,
                 node_shape = "s",
                 node_color = "c", 
                 node_size = 500,
                 edge_color = "gray", 
                 font_family = "IPAexGothic") #Font specification

plt.show()

image.png


  • In order to understand the mechanism of co-occurrence network analysis as one big flow, details such as setting stop words (words to be excluded) and processing of idioms (for example, "individualism" instead of "individual" and "principle"), etc. I closed my eyes.
  • Also, for convenience, I divided it into the following four work stages. There are four steps: ➀ text data preparation, ➁ co-occurrence data creation, ➂ drawing data creation, and ➃ network diagram drawing. However, in general, I think that it is understood in three stages: ➊ preprocessing, ➋ analysis, and ➌ visualization.
  • Especially ➊ I think preprocessing is the heart of natural language processing. Actually, it may be incorporated as part of ➋ in the script, but in short, it is a question of "how to extract the necessary words from the raw data". What kind of analysis perspective and what criteria should be used to extract words? It will appear directly in the analysis results, which will affect the interpretation. It is the unit of work that requires the most consideration and takes time and energy.

Recommended Posts

3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
Python: Natural language processing
3. Natural language processing with Python 1-1. Word N-gram
100 Language Processing with Python Knock 2015
[Python] I played with natural language processing ~ transformers ~
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
I tried natural language processing with transformers.
Dockerfile with the necessary libraries for natural language processing in python
RNN_LSTM2 Natural language processing
Image processing with Python
Getting started with Python with 100 knocks on language processing
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
Image processing with Python (Part 2)
"Apple processing" with OpenCV3 + Python3
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Neural network with Python (scikit-learn)
Acoustic signal processing with Python (2)
Acoustic signal processing with Python
[Chapter 5] Introduction to Python with 100 knocks of language processing
Model using convolutional neural network in natural language processing
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Image processing with Python (Part 1)
Natural language processing 1 Morphological analysis
[Chapter 3] Introduction to Python with 100 knocks of language processing
Natural language processing 3 Word continuity
Image processing with Python (Part 3)
[Chapter 2] Introduction to Python with 100 knocks of language processing
Python: Natural language vector representation
Network programming with Python Scapy
Natural language processing 2 Word similarity
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Python] Image processing with scikit-image
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
[Python] Try to classify ramen shops by natural language processing
[Python] Easy parallel processing with Joblib
Neural network with OpenCV 3 and Python 3
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Measuring network one-way delay with python
Image processing with Python 100 knocks # 3 Binarization
Artificial language Lojban and natural language processing (artificial language processing)
10 functions of "language with battery" python
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
[Language processing 100 knocks 2020] Chapter 8: Neural network
Image processing with Python 100 knocks # 2 Grayscale
100 Language Processing Knock Chapter 1 by Python
Preparing to start natural language processing