The edit distance (Levenshtein distance) is the number of steps it takes to change one string to another. That step is ・ Delete one character ・ Insert any single character -Replace one character with another Is one of
Let's first understand the Levenshtein distance algorithm
For example, in the case of the strings cafe
and coffee
cafe→caffe→coffe→coffee
Something like this
Let's make a table first
c | o | f | f | e | e | ||
---|---|---|---|---|---|---|---|
c | |||||||
a | |||||||
f | |||||||
e |
Let's enter the initial value When the target string is empty (the target string will be increased little by little), of course, the Levenshtein distance is equal to the length of the string. So
c | o | f | f | e | e | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
c | 1 | * | |||||
a | 2 | ||||||
f | 3 | ||||||
e | 4 |
The production algorithm starts from "*****" in (3,3) All remaining cells will be the ** minimum ** below this • If the top character of the column in this cell is the same as the leftmost character in the row in this cell, then the value in this cell is equal to ** the value in the upper left cell **, so Otherwise ** the value in the upper left cell + 1 ** ・ Value in the left cell +1 ・ Value of the upper cell +1
Here, in the case of (3,3) ・ Since the top character "c" in the column at (3,3) and the leftmost character "c" in the row are the same, it becomes 0. ・ Because it becomes (2,3), it becomes 1. ・ Because it becomes (3,2), it becomes 1. (3,3) becomes 0 because it takes the minimum of the above three values.
c | o | f | f | e | e | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
c | 1 | 0 | |||||
a | 2 | ||||||
f | 3 | ||||||
e | 4 |
Fill all cells like this
c | o | f | f | e | e | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
c | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
a | 2 | 1 | 1 | 2 | 3 | 4 | 5 |
f | 3 | 2 | 2 | 1 | 2 | 3 | 4 |
e | 4 | 3 | 3 | 2 | 2 | 2 | 3 |
Get this table. The value at the bottom right of this table is the Levenshtein distance.
If you put this algorithm into a program
levenshtein.py
class Levenshtein:
#Start the array here and enter the initial value
def initArray(self,str1,str2):
distance = []
for i in range(len(str1)+1):
distance.append([0]*(len(str2)+1))
distance[i][0] = i
for j in range(len(str2)+1):
distance[0][j] = j
return distance
#Put a value in a cell
def editDist(self,str1,str2,distance):
dist = [0]*3
for i in range(1,len(str1)+1):
for j in range(1,len(str2)+1):
dist[0] = distance[i-1][j-1] if str1[i-1]==str2[j-1] else distance[i-1][j-1]+1
dist[1] = distance[i][j-1]+1
dist[2] = distance[i-1][j]+1
distance[i][j]=min(dist)
return distance[i][j]
def __init__(self,str1,str2):
self.str1 = str1
self.str2 = str2
Levenshtein.distance = self.initArray(str1,str2)
Levenshtein.dist = self.editDist(str1,str2,Levenshtein.distance)
if __name__ == '__main__':
str1 = 'coffee'
str2 = 'cafe'
leven = Levenshtein(str1,str2)
print(leven.dist)
Recommended Posts