Calculate the editing distance as an index of pronunciation similarity [python]

Overview

We have summarized the points to note when using the editing distance as the pronunciation similarity between Japanese sentences. We also showed a code example to find the editing distance for kana, vowels, and consonants.

background

You may want to find the pronunciation similarity between sentences and words when automatically generating puns and mishearing. At this time, I can think of using the edit distance. The edit distance is an index used as an index of the similarity of character strings, and is defined as the minimum number of edits required to edit (insert, delete, replace) one character string one character at a time to make another character string. I will. In Japanese, there is a one-to-one correspondence between katakana notation and pronunciation, so you can find out how close the pronunciation is by finding the editing distance for katakana notation. However, you need to be careful when handling yoon and long vowels. Specifically, the following processing is required.

Also, when it comes to pronunciation similarity, you may want to find an index that focuses only on vowels and consonants.

Based on the above background, we aim to implement the following functions.

environment

macOS Catalina 10.15.7 python 3.8.0

Installation

Install the library that calculates the editing distance.

pip install editdistance

input

Input is a uniquely pronouncable katakana character string that meets the following conditions.

Implementation of preprocessing function

Divide the kana string into moura units

Create a function that divides the input character string that meets the above assumptions into Moura units.

Reference: Separate Japanese (Katakana) in mora units [Python]

import re

#Represent each condition with a regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
c4 = '[A-Vu]' #One katakana character (including long vowels)

cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
re_mora = re.compile(cond)

#Returns a list of katakana strings divided into moura units
def mora_wakachi(kana_text):
    return re_mora.findall(kana_text)

Convert long vowels to vowels that correspond to the actual pronunciation

Of the character strings divided into moura units above, create a function that converts long notes (stretched bars) into vowels that correspond to pronunciation. Convert according to the following rules.

First, in order to determine the "stage" of the immediately preceding element, create a function that converts one unit of moura into a vowel. Then, use that function to create a function that converts long vowels into vowels.

#Convert one unit of received kana into a vowel and return it
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2vowel(text):
  t = text[-1] #Just look at the last letter to convert to a vowel
  if t in "Akasatana Hamayarawagazada Bapaya":
    return "A"
  elif t in "Ikishini Himiligi Jijibipi":
    return "I"
  elif t in "Ukusutsunufmuyuruguzudubupuvu":
    return "C"
  elif t in "Ekesetene Hemeregese de Bepee":
    return "D"
  elif t in "Okosotonohomoyorowogozodobopoyo":
    return "Oh"
  elif t == "N":
    return "N"
  elif t == "Tsu":
    return "Tsu"
  else:
    print("no match")
    return text

#Input is a list of kana divided by moura
def bar2vowel(kana_list):
  output = []
  output.append(kana_list[0])
  #I don't expect a long vowel to come first, so start the loop from the second element
  for i,v in enumerate(kana_list[1:]):
    if v == "-":
      kana = char2vowel(output[i])#Get the previous element from the output just in case the long vowels are continuous
    else:
      kana = v
    output.append(kana)
  return output

Convert a kana string to the corresponding vowel string

Create a function that converts a kana string to the corresponding vowel string. The input is a list of kana divided by moura. However, it is assumed that long vowels have been converted to vowels. Vowels in the output are represented by katakana in line A. In other words

def kana2vowel(kana_list):
  output = []
  for v in kana_list:
    kana = char2vowel(v)
    output.append(kana)
  return output

Convert a kana string to the corresponding consonant string

Create a function that converts a kana string to the corresponding consonant string. The input is a list of kana divided by moura. However, it is assumed that long vowels have been converted to vowels. Consonants in the output shall be represented by the corresponding alphabet. In other words

The rules for converting to consonants are based on the table below.

consonant Kana
a A line, wa line, ya line
k Ka line, Ka line, Qua line
s Sa line, Sha line, Su line
t Ta line, Tya line, Cha line, Tsa line
n Na line, Nya line, Nua line
h Ha line, Fa line, Hya line
m Ma line, Mya line, Mu line
y Not applicable(Integrated into consonant a)
r La line, Lya line, Lua line
w Not applicable (integrated into consonant a)
g Ga line, Ga line, Gua line
z The line, Ja line, Ja line, Za line, Za line
d Da line, Da line
b Ba line, Va line, Bya line, Ba line
p Pa line, Pya line, Pu line
q Tsu
N N
ky Not applicable (integrated into consonant k)
sh Not applicable(Integrated into consonant s)
ch Not applicable (integrated into consonant t)
ny Not applicable(Integrated into consonant n)
hy Not applicable(Integrated into consonant h)
f Not applicable(Integrated into consonant h)
my Not applicable(Integrated into consonant m)
ry Not applicable(Integrated into consonant r)
gy Not applicable(Integrated into consonant g)
j Not applicable(Integrated into consonant z)
by Not applicable(Integrated into consonant b)
py Not applicable(Integrated into consonant p)
v Not applicable(Integrated into consonant b)

In the above table, consonants that are intuitively similar, such as consonants that are not yoon, are integrated. If possible, it is best to define the similarity between each consonant as a real number from 0 to 1 and find the weighted editing distance. In addition, the consonants of A and N are originally correct as silence (sp), but here, in order to distinguish them, the symbols a, N, and q are given.

Based on the above table, first create a function that converts a single kana character to a consonant, and then use it to convert the entire character string to a consonant.

#Convert one unit of received kana into a vowel and return it
#Input is a kana character string equivalent to one unit of moura (length 1 or 2)
def char2vowel(text):
  t = text[0] #Just look at the first letter to convert to a consonant
  if t in "Aiueoyayuyowawo":
    return "a"
  elif t in "Kakikukeko":
    return "k"
  elif t in "Sashisuseso":
    return "s"
  elif t in "Tachitsuteto":
    return "t"
  elif t in "Na ni nu ne no":
    return "n"
  elif t in "Hahifuheho":
    return "h"
  elif t in "Mamim memo":
    return "m"
  elif t in "Larilrero":
    return "r"
  elif t in "Gagigugego":
    return "g"
  elif t in "Zazizu Zezojizu":
    return "z"
  elif t in "Dadedo":
    return "d"
  elif t in "Babib Bebov":
    return "b"
  elif t in "Papipupepo":
    return "p"
  elif t == "Tsu":
    return "q"
  elif t == "N":
    return "N"
  else:
    print("no match")
    return text

def kana2consonant(kana_list):
  output = []
  for v in kana_list:
    kana = char2consonant(v)
    output.append(kana)
  return output

Edit distance calculation

In order to use it differently depending on the situation, we will define some methods to find the editing distance.

General editing distance

The edit distance as defined is calculated by a function called eval in the editdistance library. To use it, just enter two strings as arguments.

import editdistance as ed
kana_list1 = ["Ko ","N","D","Ji","Wa"]
kana_list2 = ["Ko ","N","Ba","N","Wa"]
dist = ed.eval(kana_list1,kana_list2)
print(dist)#Output is 2

Editing distance by replacement only

You can also find the edit distance when you cannot insert or delete and only replace. This is equivalent to comparing character string A and character string B one character at a time to find the number of different characters. Note that you can't insert and delete, that is, change the string length, so you can only use it when the two strings you want to compare are equal in length.

From the point of view of pronunciation similarity, the appearance of the same sound at the same timing may have a strong effect. If this effect is important, it is advantageous to adopt the edit distance by replacement only. In addition, since the comparison is done in order from the beginning, there is an advantage that the amount of calculation can be saved compared to the general editing distance.

#Find the edit distance by replacement only
def replace_ed(kana_list1, kana_list2):
  if len(kana_list1) != (kana_list2): #kana_list1 and kana_Issue a warning when the string length of list2 is different
    print("warning: length of kana_list1 is different from one of kana_list2")
  dist = 0
  for k1,k2 in zip(kana_list1, kana_list2):
    if k1 != k2:
      dist += 1
  return dist

Relative edit distance

The longer the string, the larger the edit distance tends to be, so if you want to treat it as an absolute similarity, it may be better to standardize by the string length. Below is the code to calculate the standardized value of the general edit distance and the edit distance by replacement only by the character string length.

#General relative edit distance(kana_Normalized with the length of list1)
def relative_ed(kana_list1, kana_list2):
  dist = ed.eval(kana_list1, kana_list2)
  dist /= len(kana_list1)
  return dist

#Relative edit distance by replacement only(kana_Normalized with the length of list1)
def relative_replace_ed(kana_list1, kana_list2):
  dist = replace_ed(kana_list1, kana_list2)
  dist /= len(kana_list1)
  return dist

There is room for consideration as to which of the two character strings to be compared is used as the reference. For example, if you have a specific word A and you want to find the similarity between word A and some word candidates, it seems better to use the length of word A as a reference. You may want to use the average of the lengths of the two words, especially if you don't think of a reference word. This time, I showed the code to normalize by the character string length of the word of the first argument.

Calculation of editing distances for kana, vowels, and consonants

By combining the chords up to this point, create a function that calculates the editing distance of kana, vowels, and consonants. The editing distance for vowels can be calculated after converting kana to vowels. The same is true for consonants. From the contents introduced so far (Absolute or relative) x (general or permutation only) x (kana or vowel or consonant) So, there are 12 possible editing distances. I hope you can create the necessary functions according to your purpose. Here, 12 examples will be long, so here is an example of a function that branches the output according to the flag argument. Input is a katakana character string that meets the conditions raised in the "Input" section.

#text1, text2:Katakana character string that can be pronounced uniquely
#is_relative:When False and True, find the absolute and relative editing distances, respectively.
#ed_type: "normal","replace"Find the editing distance by replacement only, which is common when
#kana_type: "kana","vowel","consonant"Find the editing distances for kana, vowels, and consonants, respectively.
def calc_ed(text1, text2, is_relative=False, ed_type="normal", kana_type="kana"):
  #Divide the input into Moura units
  kana_list1 = mora_wakachi(text1)
  kana_list2 = mora_wakachi(text2)

  #Convert long vowels to vowels
  kana_list1 = bar2vowel(kana_list1)
  kana_list2 = bar2vowel(kana_list2)

  #kana_Do nothing or convert to vowels or consonants depending on type
  if kana_type=="kana":
    pass
  elif kana_type == "vowel":
    kana_list1 = kana2vowel(kana_list1)
    kana_list2 = kana2vowel(kana_list2)
  elif kana_type == "consonant":
    kana_list1 = kana2consonant(kana_list1)
    kana_list2 = kana2consonant(kana_list2)
  else:
    print("warning: kana_type is invalid")

  #ed_Find the editing distance according to type
  dist = 0
  if ed_type == "normal":
    dist = ed.eval(kana_list1, kana_list2)
  elif ed_type == "replace": 
    dist = replace_ed(kana_list1, kana_list2)
  else:
    print("warning: ed_type is invalid")

  #is_If relative is True, kana_Normalize with the character string length of list1
  if is_relative: 
    dist /= len(kana_list1)

  return dist

Whole code

The whole code for copy and paste. Classes are provided for easy management.

import re
import editdistance as ed

class EditDistanceUtil():
  def __init__(self):
    self.re_mora = self.__get_mora_unit_re()

  #Get a regular expression pattern to split into moula units
  def __get_mora_unit_re(self):
    #Represent each condition with a regular expression
    c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
    c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
    c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
    c4 = '[A-Vu]' #One katakana character (including long vowels)

    cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
    re_mora = re.compile(cond)
    return re_mora
  
  #Returns a list of katakana strings divided into moura units
  def mora_wakachi(self, kana_text):
    return self.re_mora.findall(kana_text)

  def char2vowel(self, text):
    t = text[-1] #Just look at the last letter to convert to a vowel
    if t in "Akasatana Hamayarawagazada Bapaya":
      return "A"
    elif t in "Ikishini Himiligi Jijibipi":
      return "I"
    elif t in "Ukusutsunufmuyuruguzudubupuvu":
      return "C"
    elif t in "Ekesetene Hemeregese de Bepee":
      return "D"
    elif t in "Okosotonohomoyorowogozodobopoyo":
      return "Oh"
    elif t == "N":
      return "N"
    elif t == "Tsu":
      return "Tsu"
    else:
      print(text, "no match")
      return text

  #Convert long vowels to vowels
  #Input is a list of kana divided by moura
  def bar2vowel(self, kana_list):
    output = []
    output.append(kana_list[0])
    #I don't expect a long vowel to come first, so start the loop from the second element
    for i,v in enumerate(kana_list[1:]):
      if v == "-":
        kana = self.char2vowel(output[i])#Get the previous element from the output just in case the long vowels are continuous
      else:
        kana = v
      output.append(kana)
    return output

  #Convert kana to vowels
  def kana2vowel(self, kana_list):
    output = []
    for v in kana_list:
      kana = self.char2vowel(v)
      output.append(kana)
    return output

  #Converts 1 unit of received kana into consonants and returns
  #Input is a kana character string equivalent to one unit of moura (length 1 or 2)
  def char2consonant(self, text):
    t = text[0] #Just look at the first letter to convert to a consonant
    if t in "Aiueoyayuyowawo":
      return "a"
    elif t in "Kakikukeko":
      return "k"
    elif t in "Sashisuseso":
      return "s"
    elif t in "Tachitsuteto":
      return "t"
    elif t in "Na ni nu ne no":
      return "n"
    elif t in "Hahifuheho":
      return "h"
    elif t in "Mamim memo":
      return "m"
    elif t in "Larilrero":
      return "r"
    elif t in "Gagigugego":
      return "g"
    elif t in "Zazizu Zezojizu":
      return "z"
    elif t in "Dadedo":
      return "d"
    elif t in "Babib Bebov":
      return "b"
    elif t in "Papipupepo":
      return "p"
    elif t == "Tsu":
      return "q"
    elif t == "N":
      return "N"
    else:
      print(text, "no match")
      return text
  #Returns a kana string as a consonant
  def kana2consonant(self, kana_list):
    output = []
    for v in kana_list:
      kana = self.char2consonant(v)
      output.append(kana)
    return output
  
  #Find the edit distance by replacement only
  def replace_ed(self, kana_list1, kana_list2):
    if len(kana_list1) != len(kana_list2): #kana_list1 and kana_Issue a warning when the string length of list2 is different
      print("warning: length of kana_list1 is different from one of kana_list2")
    dist = 0
    for k1,k2 in zip(kana_list1, kana_list2):
      if k1 != k2:
        dist += 1
    return dist
        
  #text1, text2:Katakana character string that can be pronounced uniquely
  #is_relative:When False and True, find the absolute and relative editing distances, respectively.
  #ed_type: "normal","replace"Find the editing distance by replacement only, which is common when
  #kana_type: "kana","vowel","consonant"Find the editing distances for kana, vowels, and consonants, respectively.
  def calc_ed(self, text1, text2, is_relative=False, ed_type="normal", kana_type="kana"):
    #Divide the input into Moura units
    kana_list1 = self.mora_wakachi(text1)
    kana_list2 = self.mora_wakachi(text2)
  
    #Convert long vowels to vowels
    kana_list1 = self.bar2vowel(kana_list1)
    kana_list2 = self.bar2vowel(kana_list2)
  
    #kana_Do nothing or convert to vowels or consonants depending on type
    if kana_type=="kana":
      pass
    elif kana_type == "vowel":
      kana_list1 = self.kana2vowel(kana_list1)
      kana_list2 = self.kana2vowel(kana_list2)
    elif kana_type == "consonant":
      kana_list1 = self.kana2consonant(kana_list1)
      kana_list2 = self.kana2consonant(kana_list2)
    else:
      print("warning: kana_type is invalid")
  
    #ed_Find the editing distance according to type
    dist = 0
    if ed_type == "normal":
      dist = ed.eval(kana_list1, kana_list2)
    elif ed_type == "replace": 
      dist = self.replace_ed(kana_list1, kana_list2)
    else:
      print("warning: ed_type is invalid")
  
    #is_If relative is True, kana_Normalize with the character string length of list1
    if is_relative: 
      dist /= len(kana_list1)  
    return dist  
    
  #Find the editing distance of Kana
  def kana_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "normal", "kana")
  
  #Find the vowel editing distance
  def vowel_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "normal", "vowel")
  
  #Find the editing distance of consonants
  def consonant_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "normal", "consonant")    
    
  #Find the edit distance only by replacing the kana
  def kana_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "replace", "kana")
  
  #Find the editing distance by replacing vowels only
  def vowel_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "replace", "vowel")
  
  #Find the editing distance by replacing consonants only
  def consonant_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, False, "replace", "consonant")    
    
  #Find the relative editing distance of Kana
  def relative_kana_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "normal", "kana")
  
  #Find the relative editing distance of vowels
  def relative_vowel_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "normal", "vowel")
  
  #Find the relative editing distance of consonants
  def relative_consonant_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "normal", "consonant")    
    
  #Find the relative editing distance only by replacing the kana
  def relative_kana_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "replace", "kana")
  
  #Find the relative editing distance by replacing vowels only
  def relative_vowel_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "replace", "vowel")
  
  #Find the relative editing distance by replacing consonants only
  def relative_consonant_replace_ed(self, text1,text2):
    return self.calc_ed(text1, text2, True, "replace", "consonant")    
  
if __name__=="__main__":
  edu = EditDistanceUtil()
  text1 = "Hello"
  text2 = "Wacontara"

  print("normal edit distance")
  print(edu.kana_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.vowel_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.consonant_ed(text1,text2))
  
  print("")
  print("replace edit distance")
  print(edu.kana_replace_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.vowel_replace_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.consonant_replace_ed(text1,text2))
  
  print("")
  print("relative normal edit distance")
  print(edu.relative_kana_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.relative_vowel_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.relative_consonant_ed(text1,text2))
  
  print("")
  print("relative replace edit distance")
  print(edu.relative_kana_replace_ed(text1,text2))
  print("")
  print(edu.kana2vowel(text1), edu.kana2vowel(text2))
  print(edu.relative_vowel_replace_ed(text1,text2))
  print("")
  print(edu.kana2consonant(text1), edu.kana2consonant(text2))
  print(edu.relative_consonant_replace_ed(text1,text2))

Recommended Posts

Calculate the editing distance as an index of pronunciation similarity [python]
Calculate the similarity between sentences with Doc2Vec, an evolution of Word2Vec
The story of making Python an exe
Calculate the total number of combinations with python
[Python numpy] Dynamically specify the index of the array
[Python] How to calculate the approximation formula of the same intercept 0 as Excel [scikit-learn] Memo
[Python] Calculate the average value of the pixel value RGB of the object
the zen of Python
Calculate the regression coefficient of simple regression analysis with python
Note: The meaning of specifying only * (asterisk) as an argument in the Python function definition.
Save screenshot of [Python] [Windows] screen as an image
Lambda expression (correction) that creates an index (dictionary with members as keys) of the members of the object being collected in python
Don't take an instance of a Python exception class directly as an argument to the exception class!
Calculate the square root of 2 in millions of digits with python
Write a python program to find the editing distance [python] [Levenshtein distance]
Various ways to calculate the similarity between data in python
[Python] Calculate the angle consisting of three points on the coordinates
Get the index of each element of the confusion matrix in Python
Towards the retirement of Python2
About the ease of Python
Calculate the number of changes
About the features of Python
The Power of Pandas: Python
Understand the function of convolution using image processing as an example
Save the result of the life game as a gif with python
How to know the internal structure of an object in Python
Tips: [Python] Calculate the average value of the specified area with bedgraph
Get the formula in an excel file as a string in Python
The story of making a tool to load an image with Python ⇒ save it as another name
The story of Python and the story of NaN
[Python] The stumbling block of import
First Python 3 ~ The beginning of repetition ~
Existence from the viewpoint of Python
pyenv-change the python version of virtualenv
Ideone> Python version: 3.5 (as of August 29, 2017)
Get the attributes of an object
Find the Levenshtein Distance with python
Change the Python version of Homebrew
[Python] Understanding the potential_field_planning of Python Robotics
Review of the basics of Python (FizzBuzz)
Calculate the previous month in Python
About the basics list of Python basics
Learn the basics of Python ① Beginners
[Python] Yuriko Koike Calculate the number of votes you need exactly [matplotlib]
If you want a singleton in python, think of the module as a singleton
Comparing the basic grammar of Python and Go in an easy-to-understand manner
Python Note: When you want to know the attributes of an object
[Python] Calculate the number of digits required when filling in 0s [Note]
Calculate the probability of being a squid coin with Bayes' theorem [python]
Open an Excel file in Python and color the map of Japan
The eval () function that calculates a string as an expression in python
Get the return value of an external shell script (ls) with python3
An example of the answer to the reference question of the study session. In python.
Using the naive Bayes classifier implemented in Python 3.3, calculate the similarity from the co-occurrence frequency of words in sentences and strings.