Speed comparison of word-separation in Python / janome, sudachi, ginza, mecab, fugashi, tinysegmenter

Speed comparison of morphological analysis in Python

--Simply expecting only word-separation --The guy who finishes the installation quickly with pip --Environment docker pull python: 3-slim

Comparison

Preparation

Reference: pip freeze

reqirements.txt


blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.2
chardet==3.0.4
cymem==2.0.3
Cython==0.29.19
dartsclone==0.9.0
fugashi==0.2.3
ginza==3.1.2
idna==2.9
ja-ginza==3.1.0
ja-ginza-dict==3.1.0
Janome==0.3.10
mecab-python3==0.996.5
murmurhash==1.0.2
numpy==1.18.5
plac==1.1.3
preshed==3.0.2
requests==2.23.0
sortedcontainers==2.1.0
spacy==2.2.4
srsly==1.0.2
SudachiDict-core==20200330
SudachiPy==0.4.5
thinc==7.4.0
tinysegmenter==0.4
tqdm==4.46.1
unidic-lite==1.0.6
urllib3==1.25.9
wasabi==0.6.0

Code and execution results

View code

ma.py


import time

from janome.tokenizer import Tokenizer
from sudachipy import dictionary
import spacy
from MeCab import Tagger as mecab
from fugashi import Tagger as fugashi
import tinysegmenter

def use_janome(s, cache=None):
  t = Tokenizer() if cache is None else cache
  [token.surface for token in t.tokenize(s)]
  return t


def use_sudachi(s, cache=None):
  t = dictionary.Dictionary().create() if cache is None else cache
  [token.surface() for token in t.tokenize(s)]
  return t


def use_ginza(s, cache=None):
  t = spacy.load('ja_ginza') if cache is None else cache
  [token for token in t(s)]
  return t


def use_mecab(s, cache=None):
  t = mecab('-Owakati')
  [token for token in t.parse(s).split(' ')]
  return t


def use_fugashi(s, cache=None):
  t = fugashi('-Owakati')
  t(s)
  return t


def use_tinysegmenter(s, cache=None):
  t = tinysegmenter.TinySegmenter()
  t.tokenize(s)
  return t


def stopwatch(func, times=100):
  #Momotaro Ryunosuke Akutagawa https://www.aozora.gr.jp/cards/000879/card100.html
  s = 'Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. It may not be enough just to be big. The branches of this peach spread above the clouds, and the roots of this peach extended even to the land of Yomi at the bottom of the earth. Everything is heaven and earth | Around the time of Kaibyaku Hey, Izanagi's precious Mikoto is Huang Saitsu Hirasaka with eight thunders It is said that he struck the peach fruit on the gravel in order to dismiss it-the peach fruit of the gods, Kamiyo, was a branch of this tree.'
  time_s = time.perf_counter()
  cache = None
  for i in range(times):
    cache = func(s, cache)
  time_e = time.perf_counter()
  return time_e - time_s


def main():
  d = {
    'janome': stopwatch(use_janome), 
    'sudachi': stopwatch(use_sudachi), 
    'ginza': stopwatch(use_ginza), 
    'mecab': stopwatch(use_mecab), 
    'fugashi': stopwatch(use_fugashi), 
    'tinysegmenter': stopwatch(use_tinysegmenter)
    }
  d = sorted(d.items(), key=lambda x:x[1])
  print('\n'.join(map(str, d)))


if __name__ == '__main__':
  main()

result


('mecab', 0.09558620000007068)
('fugashi', 0.1353556000003664)
('tinysegmenter', 0.655696199999511)
('janome', 3.070499899999959)
('sudachi', 5.18910169999981)
('ginza', 9.577376000000186)

Impressions

--Sora MeCab is not a Python implementation but native, so ** overwhelming **, I knew ――I thought it would be heavier depending on convenience and accuracy. --The purpose of tinysegmenter is to specialize in word-separation, so it's a Python implementation, but it's fast. ――The speed of janome is as expected, and it is different from sudachi and ginza. --Compared speed for the first time --At first, it was terrible to put tokenizer initialization in a loop. --nagisa and UniDic2UD are difficult to install, so they are not adopted this time. -(Thank you for your comment) I thought that installing MeCab would be troublesome if Windows was the main battlefield.

that's all

Recommended Posts

Speed comparison of word-separation in Python / janome, sudachi, ginza, mecab, fugashi, tinysegmenter
Speed comparison of Python XML parsing
Comparison of Japanese conversion module in Python3
[Python3] Coarse graining of numpy.ndarray Speed comparison etc.
Speed evaluation of CSV file output in Python
Comparison of exponential moving average (EMA) code written in Python
Comparison of how to use higher-order functions in Python 2 and 3
AtCoder ABC151 Problem D Speed comparison in C ++ / Python / PyPy
Comparison of data frame handling in Python (pandas), R, Pig
Python, Java, C ++ speed comparison
Null object comparison in Python
Equivalence of objects in Python
Comparison of 4 Python web frameworks
Implementation of quicksort in Python
Comparison of calculation speed by implementation of python mpmath (arbitrary precision calculation) (Note)
File open function in Python3 (difference between open and codecs.open and speed comparison)