Calculation of match rate of character string breaks [python]

Overview

I wrote a program to calculate how well the positions of the breaks match in two strings with breaks. In the parody of the genre "I tried to sing with XX", I made it to check how much the break of the phrase of the original lyrics and the break of the word of the parody lyrics match.

background

"I tried to sing with XX" is a parody sung to reproduce the pronunciation of the original lyrics with only the nouns of a specific category. The figure below is an example of "I tried to sing with XX" by replacing the lyrics of the nursery rhyme "Furusato" with the station name. image.png In "I tried to sing with XX", the breaks in the words of the parody lyrics and the breaks in the phrases of the original lyrics tend to be made to match rather. The reason is not clear, but one possible reason is that it is easier to sing. I wanted to find out how much the parody matches, so I decided to write a program to evaluate the phrase matching rate of the two sentences.

The problem you want to solve

Given string A and string B, the goal is to find the two phrase match rates.

input

For simplicity, let's assume that the two strings have the same number of moulas. Also, the breaks between pronunciation (reading) and phrases (words) are known. The pronunciation is in katakana, and the breaks in the phrase are written in slashes. For example, assume the following character string as input.

output

Outputs the phrase match rate of character string A and character string B. The phrase matching rate is defined as the ratio of the phrase breaks in character string A that are at the same position as the phrase breaks in character string B. Whether or not they are in the same position is determined by the number of mouras of the character string that existed immediately before the break of the phrase. Also, the end of the character string is not considered as a break in the phrase. For example, in the example of Kagome shown above, the matching breaks in the character string B are shown by double quotation marks as follows.

The number of breaks in the clause of string B (= the number of slashes) is 12, and 5 of them match the breaks in the clause of string A. Note that we count by the number of mouras (for example, leopard has 2 letters, but 1 moula), not the number of letters.

environment

macOS Catalina 10.15.7 python 3.8.0

code

It decomposes the given string into moura units and then counts the position of the slash.

import re

#Represent each condition with a regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][ヮ yeo]' #Udan + "ヮ/A/I/E/Oh "
c2 = '[Ixishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Yayo]' #"Te/De "+"/I/Yu/Yo "
c4 = '[A-Vu]' #One katakana character (including long vowels)

cond = '('+c1+'|'+c2+'|'+c3+'|'+c4+')'
re_mora = re.compile(cond)

#Returns a list of katakana strings divided into moura units
def mora_wakachi(kana_text):
    return re_mora.findall(kana_text)


def phrase_partition_concordance(text1, text2):
  partition = "/"
  #Split string with delimiter
  kana_list1 = text1.split(partition)
  kana_list2 = text2.split(partition)
  #Divide each element into moura
  kana_list1 = [mora_wakachi(k) for k in kana_list1]
  kana_list2 = [mora_wakachi(k) for k in kana_list2]

  #Get the phrase position of text1
  partition_position1 = [0]
  for k in kana_list1:
    pos = partition_position1[-1] + len(k)
    partition_position1.append(pos)
  #The first and last are excluded from the phrase position
  partition_position1 = partition_position1[1:-1]

  #Get the phrase position of text2
  partition_position2 = [0]
  for k in kana_list2:
    pos = partition_position2[-1] + len(k)
    partition_position2.append(pos)
  #The first and last are excluded from the phrase position
  partition_position2 = partition_position2[1:-1]
  
  #Check if the phrase position of text1 is included in the phrase position of text2
  same_pos_num = 0
  for p in partition_position1:
    if p in partition_position2:
      same_pos_num += 1
  #Number of breaks in text1 clause
  partition_num = len(partition_position1)
  return same_pos_num / partition_num

text1 = "Seagull/Seagull/Kagu/Naganogori/Rhino/Cuy/Sparrowhawk/Rat/cow/Donkey/Leopard/Menada/Red-throated loon"
text2 = "Kagome/Kagome/Kagono/Nakano/TRIWA/It/It/Deatta/Ushirono/Showmen/Daare"

print(phrase_partition_concordance(text1,text2))

Recommended Posts

Calculation of match rate of character string breaks [python]
Basic grammar of Python3 system (character string)
Derivatives Learned Using Python-(1) Calculation of Forward Exchange Rate-
2.x, 3.x character code of python
Python f character (formatted string)
[Python] Calculation of Kappa (k) coefficient
Conversion of string <-> date (date, datetime) in Python
(Java, JavaScript, Python) Comparison of string processing
Python UTC ⇔ JST, character string (UTC) ⇒ JST conversion memo
[Python] How to change character string (str) data to date (strptime of datetime)
# 5 [python3] Extract characters from a character string
[Python] How to invert a character string
[Python] Calculation of image similarity (Dice coefficient)
[Python beginner memo] Python character string, path operation
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
Python string
[Python] Get the character code of the file
[Introduction to Python] Thorough explanation of the character string type used in Python!
Python basic course (4 numeric type / character string type)
[PowerShell] Get the reading of the character string
A memorandum of python string deletion process
Divides the character string by the specified number of characters. In Ruby and Python.
[python] Create a list of various character types
Character encoding when using csv module of python 2.7.3
1. Statistics learned with Python 1-2. Calculation of various statistics (Numpy)
Links and memos of Python character code strings
Convert the character code of the file with Python3
[Python] Chapter 02-02 Basics of Python programs (Handling of character strings)
[Python] Chapter 02-05 Basics of Python programs (string operations / methods)
[Python] Summary of eval / exec functions + How to write character strings with line breaks
Introduction of Python
Python string format
python string slice
python character code
Character range / character string range
Basics of Python ①
Basics of python ①
Copy of python
Python2 string type
Python string format
Python # string type
Python string inversion
Introduction of Python
Experience the good calculation efficiency of vectorization in Python
Get the variable name of the variable as a character string.
Calculation of standard deviation and correlation coefficient in Python
Cut a part of the string using a Python slice
[Python] How to expand variables in a character string
# Function that returns the character code of a string
[python] Calculation of months and years of difference in datetime
Basics of Python learning ~ What is a string literal? ~
Output a character string with line breaks in PyYAML
A python regular expression, or a memo of a match object
[Python] Types of statistical values (features) and calculation methods
I tried to summarize the string operations of Python
How to quickly count the frequency of appearance of characters from a character string in Python?