[PYTHON] A spell that eliminates non-Japanese characters and symbols to create a Japanese plaintext corpus

Many Japanese natural language libraries stop processing due to an error the moment you enter Hebrew or Korean. Here are some spells that are useful in such cases.

For example, janome introduced at PyCon 2015 will die with an error if Korean is mixed in.

janome is a wonderful morphological analyzer that saves you the trouble of installing MeCab, but if even one non-Japanese character is mixed in, you will die with an error. In the example of reading the left language switching bar of wikipedia ...

From the bar on the left of wikipedia


text = "Other language version Italiano한 국어 Polski Simple English"
t = Tokenizer()
for token in t.tokenize(text):
     print token

---------------
Traceback (most recent call last):
  File "tests.py", line 98, in <module>
    for token in t.tokenize(text):
  File "lib/python2.7/site-packages/janome/tokenizer.py", line 107, in tokenize
    pos += lattice.forward()
  File "lib/python2.7/site-packages/janome/lattice.py", line 124, in forward
    while not self.enodes[self.p]:
IndexError: list index out of range

In such a case, this spell

python


import re
import nltk

def filter(text):
   """
   :param text: str
   :rtype : str
   """
   #Eliminate alphabets, half-width alphanumeric characters, symbols, line breaks, and tabs
   text = re.sub(r'[a-zA-Z0-9¥"¥.¥,¥@]+', '', text)
   text = re.sub(r'[!"“#$%&()\*\+\-\.,\/:;<=>?@\[\\\]^_`{|}~]', '', text)
   text = re.sub(r'[\n|\r|\t]', '', text)

   #Eliminate non-Japanese characters(Korean, Chinese, Hebrew, etc.)
   jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([Ah-Hmm]+|[A-Hmm]+|[\u4e00-\u9FFF]+|[Ah-んA-Hmm\u4e00-\u9FFF]+)')
   text = "".join(jp_chartype_tokenizer.tokenize(text))
   return text


text = "Other language version Italiano한 국어 Polski Simple English"
text = filter(text)
t = Tokenizer()
for token in t.tokenize(text):
     print token


------------------
Other prefix,Noun connection,*,*,*,*,other,Ta,Ta
Language noun,General,*,*,*,*,language,Gengo,Gengo
Version noun,suffix,General,*,*,*,Edition,Van,Van

Recommended Posts

A spell that eliminates non-Japanese characters and symbols to create a Japanese plaintext corpus
Create a web app that converts PDF to text using Flask and PyPDF2
How to create a USB that Linux and Win10 installer and winpe can boot UEFI
I made a tool that makes it a little easier to create and install a public key.
Steps to create a Job that pulls a Docker image and tests it with Github Actions
Create code that outputs "A and pretending B" in python
How to write a metaclass that supports both python2 and python3