Introducing the library "mojimoji" that can convert Japanese character strings into half-width and full-width characters at high speed with Python.
mojimoji can perform full-width / half-width conversion in Python at high speed. Using the method introduced in here, the conversion process is internally performed using unordered_map of Cython and C ++, which is considerably faster than the conventional implementation. It works.
% pip install mojimoji
mojimoji has two methods, han_to_zen and zen_to_han. You can also take the keyword arguments kana, digit, and ascii, respectively, to disable katakana, number, and alphabet conversion.
>>> import mojimoji
>>> print mojimoji.zen_to_han(u'Aiu abc012')
Aiuu abc012
>>> print mojimoji.zen_to_han(u'Aiu abc012', kana=False)
Aiu abc012
>>> print mojimoji.zen_to_han(u'Aiu abc012', digit=False)
Aiuu abc012
>>> print mojimoji.zen_to_han(u'Aiu abc012', ascii=False)
Iwabc012
>>> import mojimoji
>>> print mojimoji.han_to_zen(u'Aiuu abc012')
Aiu abc012
>>> print mojimoji.han_to_zen(u'Aiuu abc012', kana=False)
Iwabc012
>>> print mojimoji.han_to_zen(u'Aiuu abc012', digit=False)
Aiu abc012
>>> print mojimoji.han_to_zen(u'Aiuu abc012', ascii=False)
Aiu abc012
Similarly, let's compare the operation speed with the libraries zenhan and jctconv that convert half-width and full-width with Python.
% pip install zenhan
% pip install jctconv
% ipython
In [1]: import mojimoji
In [2]: import zenhan
In [3]: import jctconv
In [4]: s = u'Io Eo 012345' * 10
In [5]: %time for n in range(1000000): mojimoji.zen_to_han(s)
CPU times: user 3.90 s, sys: 0.03 s, total: 3.93 s
Wall time: 3.97 s
In [6]: %time for n in range(1000000): zenhan.z2h(s)
CPU times: user 71.05 s, sys: 0.16 s, total: 71.22 s
Wall time: 71.45 s
In [7]: %time for n in range(1000000): jctconv.z2h(s)
CPU times: user 19.75 s, sys: 0.06 s, total: 19.81 s
Wall time: 19.86 s
You can see that it is about 18 times faster than the zenhan library implemented in Pure Python and about 5 times faster than jctconv.
Recommended Posts