This is an introduction memo of the speech synthesis service "** Cloud Text-to-Speech **" provided by Google. I explained in as much detail as possible the flow from enabling the API service to obtaining the authentication file and calling it from your own program (C # or Python).
Basically, screen the contents of the official "Quickstart: Use Client Library" It is explained with a shot (note that steps that seem unnecessary are skipped).
** Cloud Text-to-Speech ** is a cloud service that generates ** read-aloud voice data ** (.mp3) from text data (Japanese OK). It is possible to output natural audio that is quite close to humans. You can check the quality by giving any text (Japanese is also OK) from here.
Register as a user from Google Cloud (https://cloud.google.com/?hl=ja).
You can use your Google Count for a free trial. However, a credit card is required at the time of registration. However, after the period ends, it will not be automatically transferred to a paid account **, and even if you switch to a paid count, the fee will be very reasonable (personally). Let's register without much effort.
-Text-to-Speech charges
<img width = "565" alt = "2020-04-" 30_14h28_42.png "src =" https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/152805/c1970995-c98a-5d21-529d-23d894916cfb.png ">
Below, we will proceed with the explanation as ** registered ** on Google Cloud.
Access Google Cloud Platform and log in.
A dialog will be displayed. Select "** New Project **".
Enter an appropriate project name (here, Text To Speech 20xxx
) and click" ** Create ** ".
You'll be returned to the dashboard, ** switch to the project you just created **.
Click the menu in the upper left to go to "** APIs and Services ", " Dashboard **".
Select ** Enable APIs and Services **.
Type Text to Speech in the text box.
Select ** Cloud Text-to-Speech API **.
Select ** Enable **.
Select "** Create Credentials **" that is required to use the service from your own program.
You will be taken to the "** Add Credentials to Project " screen, select " Cloud Text-to-Speech API " from the drop-down list below, and then " Required Authentication" Click "Information **".
The display will switch, so select "** No, not in use " and click " Required credentials **" again.
Enter the appropriate ** service account name ** (here, test
). ** No role is selected **. In addition, the "service account ID" is automatically generated. Select ** Next **.
The following dialog will be displayed. Select "** Create without role **".
The following dialog will be displayed, and the ** JSON file ** containing the authentication information will be downloaded to your PC.
Suppose you rename this file to " credentials.json
"and place it in" C: \ Users \ xxx \ Desktop
".
In the official Quick Start, the path of this file is ** environment variable ** Explains how to register as GOOGLE_APPLICATION_CREDENTIALS
and refer to the information via environment variables in the program. On the other hand, ** Here, the method is to refer to the information by directly specifying the path from the program without registering it in the environment variable </ font> **.
Start Visual Studio and select [** File ]-[ New ]-[ Project ], then select " Visual C # "-" Console App (.NET Core) * *"Choose.
Select [** Tools ]-[ NuGet Package Manager ]-[ Package Manager Console **] from the menu. Enter ʻInstall-Package Google.Cloud.TextToSpeech.V1 -Pre` in the console to run it.
PM> Install-Package Google.Cloud.TextToSpeech.V1 -Pre
Rewrite the contents of Program.cs
as follows.
Program.cs
using System;
using System.IO;
using Google.Cloud.TextToSpeech.V1;
using System.Diagnostics;
public class QuickStart {
public static void Main(string[] args) {
var credentialsFilePath = @"C:\Users\xxx\Desktop\credentials.json";
var textToSpeechClientBuilder = new TextToSpeechClientBuilder() {
CredentialsPath = credentialsFilePath
};
var client = textToSpeechClientBuilder.Build();
//Read-aloud text settings
SynthesisInput input = new SynthesisInput {
Text = "The destination is Nihonbashi."
};
//Voice type setting
VoiceSelectionParams voice = new VoiceSelectionParams {
Name = "ja-JP-Wavenet-D",
LanguageCode = "ja-JP",
SsmlGender = SsmlVoiceGender.Neutral
};
//Audio output settings
AudioConfig config = new AudioConfig {
AudioEncoding = AudioEncoding.Mp3,
Pitch = -2.0
};
// Text-to-Generate Speech request
var response = client.SynthesizeSpeech(new SynthesizeSpeechRequest {
Input = input,
Voice = voice,
AudioConfig = config
});
// Text-to-Saving Speech response (voice file)
var fileName = DateTime.Now.ToString("yyyy-MM-dd_HHmmss") + ".mp3";
using (Stream output = File.Create(fileName)) {
response.AudioContent.WriteTo(output);
Console.WriteLine($"Audio content'{fileName}'Saved as.");
}
Console.WriteLine("Do you want to open the folder where you output the file?[Y]/n");
var k = Console.ReadKey();
if (k.Key != ConsoleKey.N && k.Key != ConsoleKey.Escape) {
Process.Start("explorer.exe", Directory.GetCurrentDirectory());
}
}
}
When executed, an MP3 file "Mokukichi is Nihonbashi" will be generated.
In addition, ** Speech Synthesis Markup Language ** (SSML) is also supported, and if you change it as follows
It reads out, "Mokukichi is not ** Nipponbashi **, but ** Nipponbashi **." You can also insert ** intervals **, such as by <break time =" 200ms "/>
.
SSML format
SynthesisInput input = new SynthesisInput {
Ssml = "<speak>The destination is not Nihonbashi,<sub alias='Nipponbashi'>Nihonbashi</sub>is.</speak>".Replace("'", "\"")
};
pip install --upgrade google-cloud-texttospeech
python
from datetime import datetime
from pytz import timezone
from google.cloud import texttospeech
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('credentials.json')
client = texttospeech.TextToSpeechClient(credentials=credentials)
synthesis_input = texttospeech.types.SynthesisInput(
text='The destination is Akihabara.')
voice = texttospeech.types.VoiceSelectionParams(
language_code='ja-JP',
name='ja-JP-Wavenet-D',
ssml_gender=texttospeech.enums.SsmlVoiceGender.NEUTRAL)
audio_config = texttospeech.types.AudioConfig(
audio_encoding=texttospeech.enums.AudioEncoding.MP3,
pitch = -2.0
)
response = client.synthesize_speech(synthesis_input, voice, audio_config)
now = datetime.now(timezone('Asia/Tokyo'))
filename = now.strftime('%Y-%m-%d_%H%M%S.mp3')
with open(filename, 'wb') as out:
out.write(response.audio_content)
print(f'Audio content written to file {filename}')
Recommended Posts