When processing a language, I think that it is often divided using a tokenizer such as mecab. In this article, I will introduce how to calculate the correspondence between different tokenizer outputs (separate writing) and its implementation (tokenizations). For example, let's look at the following method that does not depend on the implementation of the tokenizer to calculate the correspondence between the sentence piece and the result of the BERT word division.

#Word-separation
(a) BERT          : ['Fu', '##Helt', '##Us', '##Fuルク', 'Treaty', 'To', 'Conclusion']
(b) sentencepiece : ['▁', 'Fu', 'Bell', 'Tous', 'Burg', 'Treaty', 'To', 'Conclusion']

#Correspondence
a2b: [[1], [2, 3], [3], [4], [5], [6], [7]]
b2a: [[], [0], [1], [1, 2], [3], [4], [5], [6]]

problem

Looking at the previous example, you can see that there are the following differences between different word divisions.

How to cut tokens is different
Different normalization (eg B-> F)
Noise such as control characters may enter (example: #, _)

If the difference is only 1, it seems easy to deal with. You can compare the two word-separators one character at a time from the top. In fact, the spacy.gold.align (link) previously implemented in spaCy compares word-separation in this way. However, as soon as 2 and 3 come in, it gets confusing. If you can depend on the implementation of each tokenizer, it seems that you can calculate the correspondence by excluding the control characters, but implementing it in this way for all tokenizer combinations seems to be painful. spacy-transformers responded to this problem by ignoring everything except ascii characters. /blob/88814f5f4be7f0d4c784d8500c558d9ba06b9a56/spacy_transformers/_tokenizers.py#L539) is adopted. It seems that it works reasonably well in English, but it hardly works in Japanese. Therefore, the problem to be solved this time is to calculate the correspondence of the group of word-separators with the differences 1 to 3 above.

Normalization

Various normalizations are used in language processing. For example

-Unicode Normalization: NFC, NFD, NFKC, NFKD --Lowercase --Remove accent

Etc. In addition to the above one, it is often used in combination. For example, the BERT multilingual model uses lowercase + NFKD + accent removal.

Corresponding calculation method

Let the two word divisions be ʻA and B. For example, ʻA = ["Today", "is", "Good", "Weather", "Da"] . The correspondence can be calculated as follows.

Normalize each token with NFKD and lowercase it
Combine the tokens of ʻAandB to make two strings SaandSb. (Example: Sa = "It's nice weather today" `)
Calculate the shortest path on the edit graph of Sa and Sb
Follow the shortest path to get the correspondence between the characters Sa and Sb
Calculate token correspondence from character correspondence

In short, after normalizing properly, the inverse of diff is used to take the character correspondence and calculate the token correspondence. The key is 3, which can be calculated in the same way as the edit distance DP, for example using Myers' algorithm at low cost. I will. NFKD was adopted in 1. because the character set after normalization is the smallest among Unicode normalization. In other words, you can increase the hit rate as much as possible. For example, "bu" and "fu" can be partially supported by NFKD, but not by NFKC.

>>> a = unicodedata.normalize("NFKD", "Fu")
>>> b = unicodedata.normalize("NFKD", "Bu")
>>> print(a in b)
True
>>> a = unicodedata.normalize("NFKC", "Fu")
>>> b = unicodedata.normalize("NFKC", "Bu")
>>> print(a in b)
False

Implementation

The implementation is available here: GitHub: tamuhey / tokenizations

It's Rust, but it also provides Python bindings. The Python library can be used as follows.

$ pip install pytokenizations

>>> import tokenizations
>>> tokens_a = ['Fu', '##Helt', '##Us', '##Fuルク', 'Treaty', 'To', 'Conclusion']
>>> tokens_b = ['▁', 'Fu', 'Bell', 'Tous', 'Burg', 'Treaty', 'To', 'Conclusion']
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[1], [2, 3], [3], [4], [5], [6], [7]]
>>> print(b2a)
[[], [0], [1], [1, 2], [3], [4], [5], [6]]

At the end

The other day, I released a language processing library called Camphr, and I use pytokenizations a lot in this library. This is to calculate the correspondence between transformers and spaCy. Thanks to that, I can easily combine the two libraries and I don't have to write code for each model. It's sober, but I think it's a very useful function in practical use.

[PYTHON] Calculate the correspondence between two word-separators

problem

Normalization

Corresponding calculation method

Implementation

At the end