Overview

I wrote a program to calculate how well the positions of the breaks match in two strings with breaks. In the parody of the genre "I tried to sing with XX", I made it to check how much the break of the phrase of the original lyrics and the break of the word of the parody lyrics match.

background

"I tried to sing with XX" is a parody sung to reproduce the pronunciation of the original lyrics with only the nouns of a specific category. The figure below is an example of "I tried to sing with XX" by replacing the lyrics of the nursery rhyme "Furusato" with the station name. In "I tried to sing with XX", the breaks in the words of the parody lyrics and the breaks in the phrases of the original lyrics tend to be made to match rather. The reason is not clear, but one possible reason is that it is easier to sing. I wanted to find out how much the parody matches, so I decided to write a program to evaluate the phrase matching rate of the two sentences.

The problem you want to solve

Given string A and string B, the goal is to find the two phrase match rates.

input

For simplicity, let's assume that the two strings have the same number of moulas. Also, the breaks between pronunciation (reading) and phrases (words) are known. The pronunciation is in katakana, and the breaks in the phrase are written in slashes. For example, assume the following character string as input.

String A (parody lyrics): Seagull/Seagull/Kagu/Naganogori/Rhinoceros/Kui/Sparrowhawk/Rat/Cow/Donkey/Leopard/Menada/Abi
String B (original lyrics): Kagome/Kagome/Kagome/Nakano/Triwa/It/It/Deatta/Ushirono/Showmen/Daare

output

Outputs the phrase match rate of character string A and character string B. The phrase matching rate is defined as the ratio of the phrase breaks in character string A that are at the same position as the phrase breaks in character string B. Whether or not they are in the same position is determined by the number of mouras of the character string that existed immediately before the break of the phrase. Also, the end of the character string is not considered as a break in the phrase. For example, in the example of Kagome shown above, the matching breaks in the character string B are shown by double quotation marks as follows.

String A (parody lyrics): Seagull "/" Seagull "/" Kagu "/" Naganogori/Rhinoceros/Kui/Tsumi/Rat "/" Cow/Donkey "/" Leopard/Menada/Abi
String B (original lyrics): Kagome/Kagome/Kagome/Nakano/Triwa/It/It/Deatta/Ushirono/Showmen/Daare

The number of breaks in the clause of string B (= the number of slashes) is 12, and 5 of them match the breaks in the clause of string A. Note that we count by the number of mouras (for example, leopard has 2 letters, but 1 moula), not the number of letters.

environment

macOS Catalina 10.15.7 python 3.8.0

code

It decomposes the given string into moura units and then counts the position of the slash.

import re

#Represent each condition with a regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
c4 = '[A-Vu]' #One katakana character (including long vowels)

cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
re_mora = re.compile(cond)

#Returns a list of katakana strings divided into moura units
def mora_wakachi(kana_text):
    return re_mora.findall(kana_text)


def phrase_partition_concordance(text1, text2):
  partition = "/"
  #Split string with delimiter
  kana_list1 = text1.split(partition)
  kana_list2 = text2.split(partition)
  #Divide each element into moura
  kana_list1 = [mora_wakachi(k) for k in kana_list1]
  kana_list2 = [mora_wakachi(k) for k in kana_list2]

  #Get the phrase position of text1
  partition_position1 = [0]
  for k in kana_list1:
    pos = partition_position1[-1] + len(k)
    partition_position1.append(pos)
  #The first and last are excluded from the phrase position
  partition_position1 = partition_position1[1:-1]

  #Get the phrase position of text2
  partition_position2 = [0]
  for k in kana_list2:
    pos = partition_position2[-1] + len(k)
    partition_position2.append(pos)
  #The first and last are excluded from the phrase position
  partition_position2 = partition_position2[1:-1]
  
  #Check if the phrase position of text1 is included in the phrase position of text2
  same_pos_num = 0
  for p in partition_position1:
    if p in partition_position2:
      same_pos_num += 1
  #Number of breaks in text1 clause
  partition_num = len(partition_position1)
  return same_pos_num / partition_num

text1 = "Seagull/Seagull/Kagu/Naganogori/Rhino/Cuy/Sparrowhawk/Rat/cow/Donkey/Leopard/Menada/Red-throated loon"
text2 = "Kagome/Kagome/Kagono/Nakano/TRIWA/It/It/Deatta/Ushirono/Showmen/Daare"

print(phrase_partition_concordance(text1,text2))

Calculation of match rate of character string breaks [python]