When processing a language, I think that it is often divided using a tokenizer such as mecab. In this article, I will introduce how to calculate the correspondence between different tokenizer outputs (separate writing) and its implementation (tokenizations). For example, let's look at the following method that does not depend on the implementation of the tokenizer to calculate the correspondence between the sentence piece and the result of the BERT word division.
#Word-separation
(a) BERT : ['Fu', '##Helt', '##Us', '##Fuルク', 'Treaty', 'To', 'Conclusion']
(b) sentencepiece : ['▁', 'Fu', 'Bell', 'Tous', 'Burg', 'Treaty', 'To', 'Conclusion']
#Correspondence
a2b: [[1], [2, 3], [3], [4], [5], [6], [7]]
b2a: [[], [0], [1], [1, 2], [3], [4], [5], [6]]
Looking at the previous example, you can see that there are the following differences between different word divisions.
If the difference is only 1, it seems easy to deal with. You can compare the two word-separators one character at a time from the top. In fact, the spacy.gold.align
(link) previously implemented in spaCy compares word-separation in this way.
However, as soon as 2 and 3 come in, it gets confusing. If you can depend on the implementation of each tokenizer, it seems that you can calculate the correspondence by excluding the control characters, but implementing it in this way for all tokenizer combinations seems to be painful.
spacy-transformers responded to this problem by ignoring everything except ascii characters. /blob/88814f5f4be7f0d4c784d8500c558d9ba06b9a56/spacy_transformers/_tokenizers.py#L539) is adopted. It seems that it works reasonably well in English, but it hardly works in Japanese.
Therefore, the problem to be solved this time is to calculate the correspondence of the group of word-separators with the differences 1 to 3 above.
Various normalizations are used in language processing. For example
-Unicode Normalization: NFC, NFD, NFKC, NFKD --Lowercase --Remove accent
Etc. In addition to the above one, it is often used in combination. For example, the BERT multilingual model uses lowercase + NFKD + accent removal.
Let the two word divisions be ʻA and
B. For example, ʻA = ["Today", "is", "Good", "Weather", "Da"]
. The correspondence can be calculated as follows.
and
B to make two strings
Saand
Sb. (Example:
Sa = "It's nice weather today" `)Sa
and Sb
Sa
and Sb
In short, after normalizing properly, the inverse of diff is used to take the character correspondence and calculate the token correspondence. The key is 3, which can be calculated in the same way as the edit distance DP, for example using Myers' algorithm at low cost. I will. NFKD was adopted in 1. because the character set after normalization is the smallest among Unicode normalization. In other words, you can increase the hit rate as much as possible. For example, "bu" and "fu" can be partially supported by NFKD, but not by NFKC.
>>> a = unicodedata.normalize("NFKD", "Fu")
>>> b = unicodedata.normalize("NFKD", "Bu")
>>> print(a in b)
True
>>> a = unicodedata.normalize("NFKC", "Fu")
>>> b = unicodedata.normalize("NFKC", "Bu")
>>> print(a in b)
False
The implementation is available here: GitHub: tamuhey / tokenizations
It's Rust, but it also provides Python bindings. The Python library can be used as follows.
$ pip install pytokenizations
>>> import tokenizations
>>> tokens_a = ['Fu', '##Helt', '##Us', '##Fuルク', 'Treaty', 'To', 'Conclusion']
>>> tokens_b = ['▁', 'Fu', 'Bell', 'Tous', 'Burg', 'Treaty', 'To', 'Conclusion']
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[1], [2, 3], [3], [4], [5], [6], [7]]
>>> print(b2a)
[[], [0], [1], [1, 2], [3], [4], [5], [6]]
The other day, I released a language processing library called Camphr, and I use pytokenizations
a lot in this library. This is to calculate the correspondence between transformers and spaCy. Thanks to that, I can easily combine the two libraries and I don't have to write code for each model. It's sober, but I think it's a very useful function in practical use.
Recommended Posts