[PYTHON] Speech recognition of wav files with Google Cloud Speech API Beta

*** Information as of August 2016 ***

A trial note of voice recognition for wav files on Google Cloud Speech API Beta.

CLOUD SPEECH API

As you can see in Google Cloud Speech API Beta, the API for speech recognition.

--Supports 80 languages --Resistant to noise --Contextual recognition --Device independent --Supports both real-time and recorded files

It seems to be an easy-to-use high-performance ASR.

Document

Official documentation python sample code

How to use from CLI (Google Cloud SDK + curl)

According to Quickstart

  1. Create a Google Cloud Platform account
  2. Create a project and enable the Speech API
  3. Generate a Service Account key file and download it at hand
  4. Install the command line tool Google Cloud SDK and use the above Service Account key file to get an authentication token.
  5. Using the obtained authentication token, throw voice data such as wav files prepared in advance to the API to obtain the recognition result.

Generate a Service Account key file (json) containing the private key and use it to get an authentication token each time.

From project creation to service account key file acquisition

As per the Set Up Your Project section of Quick Start.

However, when creating a "new service account" with 6 Service Account creation, there is an item called Role that is not in Document. I'm confused.

After registering the Service Account, you can download the json file, so save it in any location. Do not expose it to the public as it contains a private key.

Get an authentication token with the Google Cloud SDK

  1. Install the Google Cloud SDK so that you can hit the `` `gcloud``` command.
  2. Obtain an authentication token using the Service Account key file obtained above
$ gcloud auth print-access-token

Remember the authentication token that came back

API call with Curl

Create `` `sync-request.json``` as per Make a Speech API Request in QuickStart and

sync-request.json


{
  "config": {
      "encoding":"FLAC",
      "sample_rate": 16000
  },
  "audio": {
      "uri":"gs://cloud-samples-tests/speech/brooklyn.flac"
  }
}

In the directory where sync-request.json is

$ curl -s -k -H "Content-Type: application/json" \
    -H "Authorization:Authentication token obtained on Bearer" \
    https://speech.googleapis.com/v1beta1/speech:syncrecognize \
    -d @sync-request.json

Hopefully json will return the recognition result.

How to set voice data and recognition contents

The location and format settings of the input file are specified in the Request body with json (`sync-request.json``` in the above example). The example `sync-request.json``` uses a sample flac file pre-located in Google Cloud Storage, but at hand Of course, it is also possible to send audio data of, and it also supports encoding other than flac.

Send the audio file you have

SyncRecognize of Rest API reference As per syncrecognize), specify the sound source and recognition settings with `` `configof Request body, and specify the audio data withaudio```.

The audio specification is[RecognitionAudio](https://cloud.google.com/speech/reference/rest/v1beta1/RecognitionAudio)As you can see, if you want to send the audio file at hand with uri or content, you can encode it into a character string with Base64 and send it as content.



 Since the encoding method of the sample is FLAC and the sampling rate is 16000 (16khz), match it with the audio data to be sent.

## Use Speech API with python

 As you can see in the [Tutorial](https://cloud.google.com/speech/docs/rest-tutorial), you can call the Speech API from python instead of the `` `glcoud``` command + curl (Node.js). There is also a sample)
 This procedure doesn't require the Google Cloud SDK, but instead requires the [Google API Client Library](https://developers.google.com/api-client-library/python/start/installation). I thought I didn't need a library because I could use curl, but [API Discovery Service](https://developers.google.com/discovery/) & Google API Client Library is used to get authentication tokens. If you don't need these, you can use it without a library by following curl mentioned above.

### Get Service Account key file

 Same as step 1-3 of CLI above.

### Application Default Credential settings

 The procedure is as per [Sample Code](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/api/speech_rest.py), but here is the Service Account key for getting the authentication token The file must be set to the environment variable ``` GOOGLE_APPLICATION_CREDENTIALS``` in advance:

 `` `$ export GOOGLE_APPLICATION_CREDENTIALS = Service Account file path` ```

 When the authentication token is obtained by referencing this as [Application Default Credential](https://cloud.google.com/speech/docs/common/auth#authenticating_with_application_default_credentials) by the GoogleCredentials.get_application_default (). create_scoped () method. That thing.

### API call

 As per [Sample Code](https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/api/speech_rest.py):


#### **`$ python speech_rest.py audio file.wav`**

The recognition result is displayed with.

Caution

-* When recognizing Japanese voice, change `languageCode``` of body from ```en-US``` to `ja-UP. * --If you want to send FLAC encoded data, set encoding``` of the body to FLAC. --Since the recognition result is only json.dumps () in the sample, it is necessary to take measures so that it is displayed correctly when Japanese is recognized.

Since this sample is a process for one input file, if you want to recognize multiple files, it seems better not to repeat API Discover and token acquisition.

Since the authentication token seems to be updated at a reasonable frequency, care for token reacquisition is also required. What is the 401 suddenly returning during the test (experience 15-30 minutes?)? When I thought about it, the token was updated.

Usability

I'm sorry it's not quantitative:

--It takes some time to recognize (about 2-4 seconds?) --The recognition accuracy is quite high. Even if there is a fairly loud noise (playing music near the microphone), I can hear it properly. This accuracy is amazing without setting anything ――I want to try what happens when noise is a human voice --I haven't tried context-related options, so I'd like to use them in the future. --QuickStart says ** Learn in 5 minutes **, but 5 minutes was completely impossible for me and made me sad.

Recommended Posts

Speech recognition of wav files with Google Cloud Speech API Beta
Streaming speech recognition with Google Cloud Speech API
Transcribe WAV files with Cloud Speech API
Automatic voice transcription with Google Cloud Speech API
Stream speech recognition using Google Cloud Speech gRPC API on python3 on Mac!
Google Cloud Speech API vs. Amazon Transcribe
Comparison of cloud speech recognition accuracy of 4 major companies
[GCP] [Python] Deploy API serverless with Google Cloud Functions!
Use of Google Cloud Storage (GCS) with "GAE / Py"
Speech transcription procedure using Python and Google Cloud Speech API
Speech file recognition by Google Speech API v2 using Python
Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API
Upload and delete files to Google Cloud Storages with django-storage
I tried using docomo speech recognition API and Google Speech API in Java
Introducing Google Map API with rails
Google Cloud Vision API sample for python
English speech recognition with python [speech to text]
Explains JavaScript of Google Maps Geocoding API
Try using Python with Google Cloud Functions
Face recognition of anime characters with Keras
Use Google Cloud Vision API from Python
[GCP] Operate Google Cloud Storage with Python
Transcription of images with GCP's Vision API
Get holidays with the Google Calendar API
Serverless face recognition API made with Python
Make API of switchbot thermo-hygrometer with Node-RED
Execute API of Cloud Pak for Data analysis project Job with environment variables