Many Japanese natural language libraries stop processing due to an error the moment you enter Hebrew or Korean. Here are some spells that are useful in such cases.

For example, janome introduced at PyCon 2015 will die with an error if Korean is mixed in.

janome is a wonderful morphological analyzer that saves you the trouble of installing MeCab, but if even one non-Japanese character is mixed in, you will die with an error. In the example of reading the left language switching bar of wikipedia ...

`From the bar on the left of wikipedia`


text = "Other language version Italiano한 국어 Polski Simple English"
t = Tokenizer()
for token in t.tokenize(text):
     print token

---------------
Traceback (most recent call last):
  File "tests.py", line 98, in <module>
    for token in t.tokenize(text):
  File "lib/python2.7/site-packages/janome/tokenizer.py", line 107, in tokenize
    pos += lattice.forward()
  File "lib/python2.7/site-packages/janome/lattice.py", line 124, in forward
    while not self.enodes[self.p]:
IndexError: list index out of range

In such a case, this spell

`python`


import re
import nltk

def filter(text):
   """
   :param text: str
   :rtype : str
   """
   #Eliminate alphabets, half-width alphanumeric characters, symbols, line breaks, and tabs
   text = re.sub(r'[a-zA-Z0-9¥"¥.¥,¥@]+', '', text)
   text = re.sub(r'[!"“#$%&()\*\+\-\.,\/:;<=>?@\[\\\]^_`{|}~]', '', text)
   text = re.sub(r'[\n|\r|\t]', '', text)

   #Eliminate non-Japanese characters(Korean, Chinese, Hebrew, etc.)
   jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([Ah-Hmm]+|[A-Hmm]+|[\u4e00-\u9FFF]+|[Ah-んA-Hmm\u4e00-\u9FFF]+)')
   text = "".join(jp_chartype_tokenizer.tokenize(text))
   return text


text = "Other language version Italiano한 국어 Polski Simple English"
text = filter(text)
t = Tokenizer()
for token in t.tokenize(text):
     print token


------------------
Other prefix,Noun connection,*,*,*,*,other,Ta,Ta
Language noun,General,*,*,*,*,language,Gengo,Gengo
Version noun,suffix,General,*,*,*,Edition,Van,Van

[PYTHON] A spell that eliminates non-Japanese characters and symbols to create a Japanese plaintext corpus

For example, janome introduced at PyCon 2015 will die with an error if Korean is mixed in.

From the bar on the left of wikipedia

In such a case, this spell

python

`From the bar on the left of wikipedia`

`python`