Divide Japanese (katakana) into syllable units [Python]

Introduction

I made a python function that divides Japanese (katakana character string) into syllable units (syllable division).

Genbun Mora word-separation Syllable word-separation
Gakkyu Shinbun Moth/Tsu/Kyu/-/Shi/Down/Bu/Down MothTsu/Kyu-/ShiDown/BuDown
Autobahn A/C/To/Ba/-/Down A/C/To/Ba-Down

Mora and syllable are typical division units of Japanese phonology. Mora is a delimiter when counting the so-called "5, 7, 5" haiku, and long vowels (-), sokuon (tsu), and nasal (n) are also counted as one beat. On the other hand, in syllables, long vowels, sokuons, and nasals are not counted alone, but are regarded as one beat together with the kana that can be a syllable in the previous single. When long vowels, sokuons, and nasals are continuous like "burn", a mora number of 3 or more makes one syllable. See Mora-Wikipedia for more information.

This article describes the syllable-based word-separation. The division by mora is explained below. Separate Japanese (katakana) in mora units [Python]

environment

policy

For ease of thinking, the input should be a double-byte katakana string that does not include symbols. In addition, it is assumed that the part that can be expressed in long vowels is converted to long vowels. This means, for example, that "gakukyu" is expressed as "gakukyu". Please see Separate article for how to convert kanji-kana mixed sentences into pronunciation katakana character strings. However, since MeCab is used, words that are not in the dictionary cannot be converted.

At this time, the syllable configuration conditions are defined as follows.

[Continuous (including 0 characters) character string of 1-4 below and "-/-n"]

  1. Udan + "a / i / e / o"
  2. Step (excluding "I") + "Ya / Yu / E / Yo"
  3. "Te / de" + "i / u"
  4. One uppercase kana character other than the above

this is

Regular expressions meaning
[Ukusutsunufumyuruguzudubupuvu][Ayeo] Udan + "A/I/E/Oh "
[Ikishini Himirigi Jijibipi][Nyayo] I-dan (excluding "I") + "Ya"/Yu/E/Yo "
[Tedde][Ju] "Te/De "+" i/Yu "
[Aiueoka-Jitsu-Moya Yuyo-Wov] 1 uppercase kana character other than ①②③
[Hmm]* "-/Tsu/Continuous character string (including 0 characters)

When'(①|②|③|④)⑤'You can write like this.

code

import re

#「((Udan + "A/I/E/Oh ")|(I-dan (excluding "I") + "Ya"/Yu/E/Yo ")|("Te/デ」+「I/Yu」)|(Uppercase kana))("-/Tsu/Continuous character string (including 0 characters))Regular expression
c1 = '[Ukusutsunufumyuruguzudubupuvu][Ayeo]' #Udan + "A/I/E/Oh "
c2 = '[Ikishini Himirigi Jijibipi][Nyayo]' #I-dan (excluding "I") + "Ya"/Yu/E/Yo "
c3 = '[Tedde][Ju]' #"Te/De "+" i/Yu "
c4 = '[Aiueoka-Jitsu-Moya Yuyo-Wov]' #Uppercase kana
c5 = '[Hmm]*' #"-/Tsu/Continuous character string (including 0 characters)

cond = '(?:'+c1+'|'+c2+'|'+c3+'|'+c4+')'+c5 #(?:)Is parentheses to avoid subpattern references
cond = '('+cond+')'
re_syllable = re.compile(cond)

def syllableWakachi(kana_text):
    return re_syllable.findall(kana_text)

text = 'Shinshun Chanson Show'
print(text)
print(syllableWakachi(text))
print('')

text = 'Tokyo Tokyo'
print(text)
print(syllableWakachi(text))
print('')

text = 'Autobahn'
print(text)
print(syllableWakachi(text))
print('')

text = 'Gakkyu Houkai'
print(text)
print(syllableWakachi(text))

The output is below.

Shinshun Chanson Show
['Shin', 'Shun', 'Shan', 'Son', 'show']

Tokyo Tokyo
['toe', 'Kyo', 'Tot', 'Kyo', 'Kyo', 'Mosquito', 'Kyo', 'Ku']

Autobahn
['A', 'C', 'To', 'Burn']

Gakkyu Houkai
['Gut', 'Kyu', 'C', 'E', 'C', 'Mosquito', 'I']

Recommended Posts

Divide Japanese (katakana) into syllable units [Python]
Divide data into project-like units with Django (2)
Divide your data into project-like units with Django (3)
[Python] Memo to translate Matplotlib into Japanese [Windows]
Divide your data into project-like units with Django
[Python] Divide Switch albums into folders by game
Try translating the Python Data Science Handbook into Japanese
Python error list (Japanese)
Make Japanese into Romaji
Japanese output in Python
python Environmentally-friendly Japanese setting
I tried to divide the file into folders with Python