[PYTHON] Uncertainty of Japanese unide code in Tacotron 2 series

Introduction

The method of converting text data into synthetic speech is called TextToSpeech (TTS). I haven't learned TextToSpeech this time, but if the text data is input in Japanese, I will record the failure story that the transliteration_cleaners of Tacotron2 system could not convert to Romaji well.

Tacotron 2 series

NVIDIA's TextToSpeech includes: This time, I tried flowtron, but it seems that the failure of unitecode for Japanese input is common in other versions. https://github.com/NVIDIA/flowtron https://github.com/NVIDIA/mellotron https://github.com/NVIDIA/tacotron2 (By the way, I don't know the difference between them in detail)

When learning with your own data of Japanese input

I haven't prepared the training data and haven't trained the model, but to train the original data, create the file list by yourself as shown below.

train.py


    ...
    data_config['training_files'] = 'filelists/train_filelist.txt'
    data_config['validation_files'] = 'filelists/validation_filelist.txt'
    data_config['text_cleaners'] = ['transliteration_cleaners']
    train(n_gpus, rank, **train_config)

image.png I think you need to write the file location, audio text, and speaker ID in the file list. I think it is necessary to prevent the speaker IDs from being duplicated in learning data in which multiple speakers are mixed. (Perhaps)

data.py


def load_filepaths_and_text(filename, split="|"):
    with open(filename, encoding='cp932') as f:   #Change encoding to cp932(For windows)
...
    def get_text(self, text):
        print(text)               #add to
        text = _clean_text(text, self.text_cleaners)
        print(text)               #add to

text/cleaners.py


def transliteration_cleaners(text):
    '''Pipeline for non-English text that transliterates to ASCII.'''
    text = convert_to_ascii(text)
    text = lowercase(text)
    text = collapse_whitespace(text)
    return text

And if you want to read Japanese instead of English, you may need to change the encoding to cp932 and change the cleaners to ['transliteration_cleaners']. This is'''Pipeline for non-English text that transliterates to ASCIIʻ'** (a pipeline that transliterates non-English text into ASCII) **, so I wonder if this is appropriate for Japanese input for a moment. think. I thought so.

But the conversion doesn't work

This is the output result of the print () statement added to def get_text. I was able to confirm that the hiragana and katakana "test" was successfully converted. On the other hand, Kanji has been converted to Chinese phonemes.

python


Epoch: 0
It's a test.
tesutodesu.
Tokyo Patent Approval Office
dong jing te xu xu ke ju
Testing the microphone.
maikunotesutozhong .

Hiragana and katakana are not the only ones

In the first place, a library called unitecode is used for conversion from Japanese (Unicode) to ASCII.

python


from unidecode import unidecode

def convert_to_ascii(text):
    return unidecode(text)

I've seen some conversions for this unitecode.

python


# coding: cp932
from unidecode import unidecode

text1 = 'AIUEO'
text2 = unidecode(text1)

print(text1)
print(text2)

text1 = 'a-I-U-E-O'
text2 = unidecode(text1)

print(text1)
print(text2)

text1 = 'Compatibility'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Consultation'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'This way'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Kotei'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Koote'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Kotei'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'This'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'This-Was'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Cat'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Cat'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Kanai'
text2 = unidecode(text1)
print(text1)
print(text2)

text1 = 'Crab'
text2 = unidecode(text1)
print(text1)
print(text2)

......

AIUEO
aiueo
a-I-U-E-O
aiueo
Compatibility
Xiang Xing
Consultation
Xiang Tan
This way
koutei
Kotei
koutei
Koote
kootei
Kotei
kotei
This
ko~tei
This-Was
ko-tei
Cat
kiyatsuto
Cat
kiyatsuto
Kanai
kani
Crab
kani

・ Kanji is converted to Chinese ・ "Aiuuyayuyo" is the same as "Aiueoya Yuyotsu" ・ "-" Is not recognized. ・ The conversion of "kani" and "crab" is the same There are many problems.

Therefore, unitecode is not suitable for Japanese conversion in the first place.

Example of pykakasi

When using pykakasi, it became as follows. Incomplete conversion of unidecode has been improved. Also, .setMode ('s', True) will automatically insert a space for each word.

python


# coding: cp932
from pykakasi import kakasi

kakasi = kakasi()

kakasi.setMode('H', 'a')
kakasi.setMode('K', 'a')
kakasi.setMode('J', 'a')
kakasi.setMode('E', 'a')
kakasi.setMode('s', True)

conv = kakasi.getConverter()

text = 'Aiueo and Aiueo.'
print(conv.do(text))

text = 'Compatibility and consultation'
print(conv.do(text))

text = 'Cat and cat'
print(conv.do(text))

text = 'Files and files'
print(conv.do(text))

text = 'Kotei Kotei Kotei Kotei Kotei'
print(conv.do(text))

text = 'Tokyo Patent Approval Office'
print(conv.do(text))

text = 'Simple and crab'
print(conv.do(text))
aiueo, to aiueo.
aishou to soudan
kyatto to kiyatsuto
fairu to fuairu
koutei  to  koutei  to  kootei  to  kootei  to  ko ~ tei
toukyou tokkyo kyoka kyoku
kan'i to kani

pyopenjtalk example

Need to install OpenJTalk? In this case, it seems to be decomposed not by word but by syllable. I don't know which is better to divide by words (maybe it depends on the learning model)

python


import pyopenjtalk

print(pyopenjtalk.g2p("Hello"))
'k o N n i ch i w a'

Summary

Tacotron2 series unitecode is not suitable for Japanese input, and it is wrong to use ** transliteration_cleaners. ** Therefore, if you want to translate the learning data into Japanese, you should create your own japanease_cleaners in text / cleaners.py. (Or do you prepare learning data converted to Romaji in advance?)

Recommended Posts

Uncertainty of Japanese unide code in Tacotron 2 series
Uncertainty of Japanese unide code in Tacotron 2 series
Learn Japanese document categorization using spaCy / GiNZA (failure)
Fourier series verification code written in Python
Handling of character code of file in IronPython
Comparison of Japanese conversion module in Python3
R: Use Japanese instead of Japanese in scripts
A collection of code often used in personal Python
Store Japanese (multibyte character string) in sqlite3 of python
Ruby, Python code fragment execution of selection in Emacs
List of Python code used in big data analysis
Quickly list multiple lines of text in your code