[PYTHON] Use with Cabocha to automatically generate "IOB2 tag corpus" learning data

https://gist.github.com/jpena930/0753edfd27e010503755ccfdaeb965bf

#coding: utf-8
from __future__ import print_function  # Only needed for Python 2
import MeCab
import CaboCha
import sys
import os


cabocha = CaboCha.Parser("-f1 -n1")
m = MeCab.Tagger ("-Ochasen")

# For reading from file
class getWords():
    def readText(self, filename):
        ###Extract the file
        with open(filename, 'r', encoding='utf-8') as f:
            tText = f.read()
            f.close()
        return tText

#Usage: python training_generator <text file>
with open(sys.argv[1], 'r') as my_file:
    text = my_file.read()


getText = getWords()
#file_output = '<Filename>'

file_output = sys.argv[1]

text = getText.readText(file_output)

cabocha_text = cabocha.parseToString(text)
cabocha_text = cabocha_text.replace("B-ORGANIZATION", "B-ORG")
cabocha_text = cabocha_text.replace("I-ORGANIZATION", "I-ORG")
cabocha_text = cabocha_text.replace("B-ARTIFACT", "B-ART")
cabocha_text = cabocha_text.replace("I-ARTIFACT", "I-ART")
cabocha_text = cabocha_text.replace("B-LOCATION", "B-LOC")
cabocha_text = cabocha_text.replace("I-LOCATION", "I-LOC")
cabocha_text = cabocha_text.replace("B-DATE", "B-DAT")
cabocha_text = cabocha_text.replace("I-DATE", "I-DAT")
cabocha_text = cabocha_text.replace("B-TIME", "B-TIM")
cabocha_text = cabocha_text.replace("I-TIME", "I-TIM")
cabocha_text = cabocha_text.replace("B-PERSON", "B-PSN")
cabocha_text = cabocha_text.replace("I-PERSON", "I-PSN")
cabocha_text = cabocha_text.replace("B-MONEY", "B-MNY")
cabocha_text = cabocha_text.replace("I-MONEY", "I-MNY")
cabocha_text = cabocha_text.replace("B-PERCENT", "B-PNT")
cabocha_text = cabocha_text.replace("I-PERCENT", "I-PNT")


#Remove commas and replace with tab
cabocha_text = cabocha_text.replace(",", "\t")

filename = file_output + '_generated.txt'

if os.path.exists(filename):
    os.remove(filename)

# Remove * and add line space
for line in cabocha_text.splitlines():
    if not line.startswith('*'):
        with open(filename, 'a') as f:
            print(line, file=f)
    if line.startswith('。'):
        with open(filename, 'a') as f:
            print("", file=f)

readFile = open(filename)

lines = readFile.readlines()
lines = lines[:-1]

readFile.close()

w = open(filename,'w')
w.writelines([item for item in lines[:-1]])
w.close()

Next Step: Fix tags to suit your needs

Reference: http://qiita.com/Hironsan/items/326b66711eb4196aa9d4 https://github.com/Hironsan/IOB2Corpus

Recommended Posts

Use with Cabocha to automatically generate "IOB2 tag corpus" learning data
[Evangelion] Try to automatically generate Asuka-like lines with Deep Learning
Try to automatically generate Python documents with Sphinx
How to use xgboost: Multi-class classification with iris data
I tried to automatically generate a password with Python3
Building an environment to use CaboCha with google colaboratory
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Made icrawler easier to use for machine learning data collection
Generate Pokemon with Deep Learning
Generate error correction code to restore data corruption with zfec library
The strongest way to use MeCab and CaboCha with Google Colab
Python: How to use async with
How to use virtualenv with PowerShell
How to deal with imbalanced data
How to deal with imbalanced data
Automatically generate model relationships with Django
How to Data Augmentation with PyTorch
How to use FTP with Python
Generate fake table data with GAN
Use boto3 to mess with S3
How to collect machine learning data
PPLM: A simple deep learning technique to generate sentences with specified attributes
I started machine learning with Python (I also started posting to Qiita) Data preparation