3. Natural language processing with Python 4-1. Analysis for words with KWIC

⑴ Acquisition of corpus

❶ Import library

import re #Regular expression manipulation
import zipfile #Working with zip files
import urllib.request #Get data on the web
import os.path #Manipulating pathnames
import glob #Get file path name

❷ Get and read files

def download(URL):
    #Download zip file
    zip_file = re.split(r'/', URL)[-1]
    urllib.request.urlretrieve(URL, zip_file)
    dir = os.path.splitext(zip_file)[0]

    #Unzip and save the zip file
    with zipfile.ZipFile(zip_file) as zip_object:
        zip_object.extractall(dir)

    os.remove(zip_file)

    #Get the path of the saved file
    path = os.path.join(dir,'*.txt')
    list = glob.glob(path)
    return list[0]
def convert(download_text):
    #File reading
    data = open(download_text, 'rb').read()
    text = data.decode('shift_jis')

    #Extraction of text
    text = re.split(r'\-{5,}', text)[2]  
    text = re.split(r'Bottom book:', text)[0]
    text = re.split(r'[#New Page]', text)[0]

    #Noise removal
    text = re.sub(r'《.+?》', '', text)
    text = re.sub(r'[#.+?]', '', text)
    text = re.sub(r'|', '', text)
    text = re.sub(r'\r\n', '', text)
    text = re.sub(r'\u3000', '', text)  
    text = re.sub(r'「', '', text) 
    text = re.sub(r'」', '', text)
    text = re.sub(r'、', '', text)
    text = re.sub(r'。', '', text)

    return text
URL = 'https://www.aozora.gr.jp/cards/000081/files/43737_ruby_19028.zip'

download_file = download(URL)
text = convert(download_file)

print(text)

⑵ Separation by morphological analysis

❶ Install MeCab

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

❷ Divide into words

import MeCab

mecab = MeCab.Tagger("-Owakati")
words = mecab.parse(text).split()

image.png

❸ Separate word

doc = ' '.join(words)
print(doc)

image.png

⑶ Execution of KWIC

❶ Tokenization with nltk

import nltk
nltk.download('punkt')

text_ = nltk.Text(nltk.word_tokenize(doc))

❷ KWIC format output

word = 'Giovanni'

#Create an instance and specify the input text
c = nltk.text.ConcordanceIndex(text_)

#Display KWIC format by keyword
c.print_concordance(word, width=40, lines=50)

image.png

print(c.offsets(word))

image.png


Recommended Posts

3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
Python: Natural language processing
Dockerfile with the necessary libraries for natural language processing in python
3. Natural language processing with Python 5-5. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (words)]
100 Language Processing with Python Knock 2015
Natural language processing 1 Morphological analysis
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
Study natural language processing with Kikagaku
100 Language Processing Knock with Python (Chapter 1)
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
100 Language Processing Knock with Python (Chapter 2, Part 2)
3. Natural language processing with Python 5-3. Emotion value analysis of Japanese sentences [Word emotion polarity value correspondence table]
100 Language Processing Knock with Python (Chapter 2, Part 1)
Why is distributed representation of words important for natural language processing?
I tried natural language processing with transformers.
100 Language Processing Knock-88: 10 Words with High Similarity
3. Natural language processing with Python 5-4. Emotion value analysis of Japanese sentences [Japanese evaluation polarity dictionary (noun edition)]
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
Getting started with Python with 100 knocks on language processing
Image Processing with Python Environment Setup for Windows
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
Data analysis with python 2
Voice analysis with python
RNN_LSTM2 Natural language processing
Image processing with Python
Voice analysis with python
Data analysis with Python
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
Set up a development environment for natural language processing
Data analysis for improving POG 1 ~ Web scraping with Python ~
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
[Chapter 4] Introduction to Python with 100 knocks of language processing
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
Image processing with Python (Part 2)
[Python] Morphological analysis with MeCab
"Apple processing" with OpenCV3 + Python3
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Python for Data Analysis Chapter 4
100 Language Processing Knock-57: Dependency Analysis
Sentiment analysis with Python (word2vec)
Acoustic signal processing with Python
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Image processing with Python (Part 1)
Planar skeleton analysis with Python
Natural language processing 3 Word continuity