[PYTHON] Calculate the correspondence between two word-separators

When processing a language, I think that it is often divided using a tokenizer such as mecab. In this article, I will introduce how to calculate the correspondence between different tokenizer outputs (separate writing) and its implementation (tokenizations). For example, let's look at the following method that does not depend on the implementation of the tokenizer to calculate the correspondence between the sentence piece and the result of the BERT word division.

#Word-separation
(a) BERT          : ['Fu', '##Helt', '##Us', '##Fuルク', 'Treaty', 'To', 'Conclusion']
(b) sentencepiece : ['▁', 'Fu', 'Bell', 'Tous', 'Burg', 'Treaty', 'To', 'Conclusion']

#Correspondence
a2b: [[1], [2, 3], [3], [4], [5], [6], [7]]
b2a: [[], [0], [1], [1, 2], [3], [4], [5], [6]]

problem

Looking at the previous example, you can see that there are the following differences between different word divisions.

  1. How to cut tokens is different
  2. Different normalization (eg B-> F)
  3. Noise such as control characters may enter (example: #, _)

If the difference is only 1, it seems easy to deal with. You can compare the two word-separators one character at a time from the top. In fact, the spacy.gold.align (link) previously implemented in spaCy compares word-separation in this way. However, as soon as 2 and 3 come in, it gets confusing. If you can depend on the implementation of each tokenizer, it seems that you can calculate the correspondence by excluding the control characters, but implementing it in this way for all tokenizer combinations seems to be painful. spacy-transformers responded to this problem by ignoring everything except ascii characters. /blob/88814f5f4be7f0d4c784d8500c558d9ba06b9a56/spacy_transformers/_tokenizers.py#L539) is adopted. It seems that it works reasonably well in English, but it hardly works in Japanese. Therefore, the problem to be solved this time is to calculate the correspondence of the group of word-separators with the differences 1 to 3 above.

Normalization

Various normalizations are used in language processing. For example

-Unicode Normalization: NFC, NFD, NFKC, NFKD --Lowercase --Remove accent

Etc. In addition to the above one, it is often used in combination. For example, the BERT multilingual model uses lowercase + NFKD + accent removal.

Corresponding calculation method

Let the two word divisions be ʻA and B. For example, ʻA = ["Today", "is", "Good", "Weather", "Da"] . The correspondence can be calculated as follows.

  1. Normalize each token with NFKD and lowercase it
  2. Combine the tokens of ʻAandB to make two strings SaandSb. (Example: Sa = "It's nice weather today" `)
  3. Calculate the shortest path on the edit graph of Sa and Sb
  4. Follow the shortest path to get the correspondence between the characters Sa and Sb
  5. Calculate token correspondence from character correspondence

In short, after normalizing properly, the inverse of diff is used to take the character correspondence and calculate the token correspondence. The key is 3, which can be calculated in the same way as the edit distance DP, for example using Myers' algorithm at low cost. I will. NFKD was adopted in 1. because the character set after normalization is the smallest among Unicode normalization. In other words, you can increase the hit rate as much as possible. For example, "bu" and "fu" can be partially supported by NFKD, but not by NFKC.

>>> a = unicodedata.normalize("NFKD", "Fu")
>>> b = unicodedata.normalize("NFKD", "Bu")
>>> print(a in b)
True
>>> a = unicodedata.normalize("NFKC", "Fu")
>>> b = unicodedata.normalize("NFKC", "Bu")
>>> print(a in b)
False

Implementation

The implementation is available here: GitHub: tamuhey / tokenizations

It's Rust, but it also provides Python bindings. The Python library can be used as follows.

$ pip install pytokenizations
>>> import tokenizations
>>> tokens_a = ['Fu', '##Helt', '##Us', '##Fuルク', 'Treaty', 'To', 'Conclusion']
>>> tokens_b = ['▁', 'Fu', 'Bell', 'Tous', 'Burg', 'Treaty', 'To', 'Conclusion']
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[1], [2, 3], [3], [4], [5], [6], [7]]
>>> print(b2a)
[[], [0], [1], [1, 2], [3], [4], [5], [6]]

At the end

The other day, I released a language processing library called Camphr, and I use pytokenizations a lot in this library. This is to calculate the correspondence between transformers and spaCy. Thanks to that, I can easily combine the two libraries and I don't have to write code for each model. It's sober, but I think it's a very useful function in practical use.

Recommended Posts

Calculate the correspondence between two word-separators
Examine the relationship between two variables (2)
Calculate the time difference between two columns with Pandas DataFrame
Estimate the delay between two signals
Examine the relationship between two variables (1)
Calculate the angle between n-dimensional vectors with TensorFlow
Bayesian modeling-estimation of the difference between the two groups-
Various ways to calculate the similarity between data in python
Calculate the number of changes
Calculate the similarity between sentences with Doc2Vec, an evolution of Word2Vec