[PYTHON] When I try to divide with Bert Japanese Tokenizer of Hugging Face, it fails with initializing of MeCab or even with encode.

I'll write it down as a note so that people who stumble on the same error will have less time to look up.

The following is running on Google Colab.

Install MeCab and huggingface transformers on Colab by referring to here.

!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3
!pip install transformers

Try to divide with the tokenizer of BERT for Japanese.

from transformers.tokenization_bert_japanese import BertJapaneseTokenizer

#Declared tokenizer for Japanese BERT
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

text = "Natural language processing is a lot of fun."

wakati_ids = tokenizer.encode(text, return_tensors='pt')
print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print(wakati_ids)

I got the following error.

----------------------------------------------------------

Failed initializing MeCab. Please see the README for possible solutions:

    https://github.com/SamuraiT/mecab-python3#common-issues

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

    https://github.com/SamuraiT/mecab-python3/issues

You don't have to write the issue in English.

------------------- ERROR DETAILS ------------------------
arguments: 
error message: [ifs] no such file or directory: /usr/local/etc/mecabrc
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-f828f6470517> in <module>()
      2 
      3 #Declared tokenizer for Japanese BERT
----> 4 tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
      5 
      6 text = "Natural language processing is a lot of fun."

4 frames
/usr/local/lib/python3.6/dist-packages/MeCab/__init__.py in __init__(self, rawargs)
    122 
    123         try:
--> 124             super(Tagger, self).__init__(args)
    125         except RuntimeError:
    126             error_info(rawargs)

RuntimeError:

Kindly ask me to look at here in the error output, so install mecab-python3 according to the instructions at the URL. When you do

pip install unidic-lite

If you also execute, MeCab's initializing will no longer fail. But this time, I got angry with ʻencode saying Value Error: too many values to unpack (expected 2) `.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-f828f6470517> in <module>()
      6 text = "Natural language processing is a lot of fun."
      7 
----> 8 wakati_ids = tokenizer.encode(text, return_tensors='pt')
      9 print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
     10 print(wakati_ids)

8 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_bert_japanese.py in tokenize(self, text, never_split, **kwargs)
    205                 break
    206 
--> 207             token, _ = line.split("\t")
    208             token_start = text.index(token, cursor)
    209             token_end = token_start + len(token)

ValueError: too many values to unpack (expected 2)

About this error here, the developer of mecab-python3? As mentioned by, it was solved by specifying the version of mecab-python3 as 0.996.5.

In summary, when installing pip, I think that it will not be an error if you declare it as follows.

!apt install aptitude swig
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.996.5
!pip install unidic-lite
!pip install transformers

If you have already installed the latest version of mecab-python3 with pip before running ↑, don't forget to reconnect your colab session once. You can disconnect the session from the session management by clicking ▼ on the top right of the colab screen, such as RAM or disk.

from transformers.tokenization_bert_japanese import BertJapaneseTokenizer

#Declared tokenizer for Japanese BERT
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

text = "Natural language processing is a lot of fun."

wakati_ids = tokenizer.encode(text, return_tensors='pt')
print(tokenizer.convert_ids_to_tokens(wakati_ids[0].tolist()))
print(wakati_ids)

#Downloading: 100%
#258k/258k [00:00<00:00, 1.58MB/s]
#
#['[CLS]', 'Nature', 'language', 'processing', 'Is', 'very much', 'pleasant', '。', '[SEP]']
#tensor([[    2,  1757,  1882,  2762,     9,  8567, 19835,     8,     3]])

I was able to successfully divide it with BertJapaneseTokenizer.

end