3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko

** 1. Get the file and extract only the text **

⑴ Import of various modules

import re
import zipfile
import urllib.request
import os.path
import glob

⑵ Get file path

Here, Kenji Miyazawa's "Night on the Galactic Railroad" is used as the material.

URL = 'https://www.aozora.gr.jp/cards/000081/files/43737_ruby_19028.zip'

⑶ Method to get / decompress zip file

def download(URL):
    zip_file = re.split(r'/', URL)[-1] #➀
    urllib.request.urlretrieve(URL, zip_file) #➁
    dir = os.path.splitext(zip_file)[0] #➂

    with zipfile.ZipFile(zip_file) as zip_object: #➃
        zip_object.extractall(dir) #➄

    os.remove(zip_file) #➅

    path = os.path.join(dir,'*.txt') #➆
    list = glob.glob(path) #➇
    return list[0] #➈

** 1) Download zip file **

** 2) Unzip and save the zip file **

** 3) Get the path of the saved file **

⑷ Method to read file and extract body

def convert(download_text):
    data = open(download_text, 'rb').read() #➀
    text = data.decode('shift_jis') #➁

    #Text extraction
    text = re.split(r'\-{5,}', text)[2] #➂  
    text = re.split(r'Bottom book:', text)[0] #➃
    text = re.split(r'[#New Page]', text)[0] #➄

    #Noise removal
    text = re.sub(r'《.+?》', '', text) #➅
    text = re.sub(r'[#.+?]', '', text) #➆
    text = re.sub(r'|', '', text) #➇
    text = re.sub(r'\r\n', '', text) #➈
    text = re.sub(r'\u3000', '', text) #➉   

    return text

** 1) Read file **

** 2) Extracting the text with re.split () **

** 3) Noise removal (replacement) by re.sub () **

⑸ File acquisition and text extraction

download_file = download(URL)
text = convert(download_file)

print(text)

image.png

** 2. "Separate writing" by MeCab **

⑹ Installation of MeCab, word-separation

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
import MeCab
mecab = MeCab.Tagger("-Owakati")
text = mecab.parse(text)

print(text)

image.png

separated_text = text.split()
print(separated_text)

image.png

** 3. If you download to your local PC **

⑺ File and get to local PC

with open('output.txt', 'w') as f:
    f.write(text)
from google.colab import files

files.download('output.txt')

image.png

Recommended Posts

3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
[Python] How to create a 2D histogram with Matplotlib
3. Natural language processing with Python 2-1. Co-occurrence network
3. Natural language processing with Python 1-1. Word N-gram
How to create a heatmap with an arbitrary domain in Python
Summarize how to preprocess text (natural language processing) with tf.data.Dataset api
How to read a CSV file with Python 2/3
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
How to create a Python virtual environment (venv)
How to do multi-core parallel processing with python
Python: Natural language processing
How to create a JSON file in Python
[Python] I played with natural language processing ~ transformers ~
Steps to create a Twitter bot with python
I will write a detailed explanation to death while solving 100 natural language processing knock 2020 with Python
How to create a multi-platform app with kivy
[Python] How to create a local web server environment with SimpleHTTPServer and CGIHTTPServer
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
How to convert / restore a string with [] in python
[Python] How to draw a line graph with Matplotlib
How to create a submenu with the [Blender] plugin
[Chapter 5] Introduction to Python with 100 knocks of language processing
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
How to create a kubernetes pod from python code
[Python] How to draw a scatter plot with Matplotlib
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing with Python Knock 2015
Create a directory with python
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
How to install NPI + send a message to line with python
How to convert an array to a dictionary with Python [Application]
Create a Mastodon bot with a function to automatically reply with Python
How to create a flow mesh around a cylinder with snappyHexMesh
[Python Kivy] How to create a simple pop up window
[Python] Try to classify ramen shops by natural language processing
How to build a python2.7 series development environment with Vagrant
Create a message corresponding to localization with python translation string
[Python Kivy] How to create an exe file with pyinstaller
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
Study natural language processing with Kikagaku
How to write a Python class
Python: How to use async with
100 Language Processing Knock with Python (Chapter 1)
Create folders from '01' to '12' with python
[Natural language processing] Preprocessing with Japanese
How to create a Conda package
Create a virtual environment with Python!
100 Language Processing Knock with Python (Chapter 3)
How to create a virtual bridge
How to get started with Python
How to create a Dockerfile (basic)
How to use FTP with Python
How to calculate date with python
5 Ways to Create a Python Chatbot
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Preparing to start natural language processing