I calculated "Levenshtein distance" using Python

What is the Levenshtein distance?

** Levenshtein distance ** indicates how different the two strings are. Also known as the edit distance. A single character ** Insert / Delete / Replace ** defines the minimum number of steps required to transform one string into the other. Reference: [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%AC%E3%83%BC%E3%83%99%E3%83%B3%E3%82%B7%E3 % 83% A5% E3% 82% BF% E3% 82% A4% E3% 83% B3% E8% B7% 9D% E9% 9B% A2)

Preparation

The environment uses Google Colaboratory. The Python version is below.

import platform
print("python " + platform.python_version())
# python 3.6.9

In addition, the library for calculating the Levenshtein distance must be installed in advance.

pip install python-Levenshtein

Let's calculate the Levenshtein distance

Now let's write the code. First, import the library for calculating the Levenshtein distance.

import Levenshtein

As a trial, let's calculate the Levenshtein distance between "Rievenstein" and "Levenshtein".

#Calculate the Levenshtein distance
Levenshtein.distance('Rievenstein', 'Levenshtein')
# 3

The result was 3. this is,

-** Delete ** (** R ** ievenstein → ievenstein) -** Replace ** (** i ** evenstein → ** L ** evenstein) -** Insert ** (Levenstein → Levens h </ strong> tein)

This is because I performed the operation three times.

You can check what kind of operation you performed below.

Levenshtein.editops('Rievenstein', 'Levenshtein')
# [('delete', 0, 0), ('replace', 1, 0), ('insert', 7, 6)]

The content displayed here is (Operation performed, what character in the first string, what character in the second string) is. In other words

--Removed the 0th letter (R) of Rievenstein --Replaced the first letter (i) of Rievenstein with the 0th letter (L) of Levenshtein --Insert the 6th letter (h) of Levenshtein at the position of the 7th letter (t) of Rievenstein

It means that.

You can also calculate the similarity of how similar you are, not how many times you have performed the operation.

Levenshtein.ratio('Rievenstein', 'Levenshtein')
# 0.8181818181818182

If there are several strings, it is also possible to extract the string that is the median of them.

Levenshtein.median(['Rievenstein',
                    'Levenshtein',
                    'Revenshtein',
                    'Lievenstein',
                    'Levenshtain',
                    'Levennshtein'])
# 'Levenshtein'

Other edit distances

Hamming distance

The Hamming distance is the number of different characters in the corresponding positions in two strings of ** equal number of characters **. Reference: [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%9F%E3%83%B3%E3%82%B0%E8%B7%9D%E9 % 9B% A2)

This can also be calculated using the same library.

Levenshtein.hamming('Rievenstein', 'Levenshtein')
# 7

Hamming distance compares the strings of Rievenstein and Levenshtein only with each other in the same place. It will be as follows.

-** R ** ievenstein and ** L ** even shtein → different --R i </ strong> evenstein and L e </ strong> venshtein → different --Ri e </ strong> venstein and Le v </ strong> enshtein → different

  • … --Revenstei n </ strong> and Levenshtei n </ strong> → equal

Looking at it in this way, the Hamming distance is output as 7 because the 7 characters except the last "tein" are different.

The Hamming distance must be ** equal in number of characters. ** An error will occur if the number of characters is different.

Jaro Winkler Distance

There is also the Jaro-Winkler Distance, which measures the partial match between two strings. The closer it is to 1, the more similar it is, and the closer it is to 0, the more different it is. It seems to be used to judge spelling mistakes.

Levenshtein.jaro_winkler('Rievenstein', 'Levenshtein')
# 0.8787878787878789

Summary

I used Python to calculate the Levenshtein distance. I also calculated the Hamming distance and the Jaro Winkler distance as other editing distances. By all means, please calculate the editing distance of various similar character strings and calculate the similarity.

Please refer to the following for the details of the library used this time. https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html#Levenshtein-median

Recommended Posts