[PYTHON] Speech transcription procedure using Google Cloud Speech API

Purpose

Article on transcribing voice using Google Cloud Speech API as one of the means to store the recorded sound source as text data by participating in English lectures and conferences. items / 659bde4cdc8ce5c78e29) was helpful, so I will reorganize the procedure below (procedure memo).

Advance preparation

Since Google Cloud Platform is used in this procedure, it is assumed that the project has been created after completing the service common edition (P9-P20) of Google Cloud Platform Easy Startup Guide. I will.
It is assumed that a sound source that has been converted to the following format has already been created (Reference: Voice conversion site, Sound source actually used (PyConJP2017 English) Keynote speech)).
- FLAC
Monaural
- 16000Hz
- 16bit

procedure

1. Enter the console screen

Go to the Google Cloud Platform URL (https://cloud.google.com/?hl=ja) and press Open Console to enter the console screen.

Google Cloud Platform console login screen: スクリーンショット 2017-09-16 11.55.30.png

Console screen: スクリーンショット 2017-09-16 11.56.12.png

2. Enable the Google Cloud Speech API

Select Tools & Services> APIs & Services> Library at the top left of the console screen, select Speech API from the list of APIs, and press Enable to enable the Google Speech API.

スクリーンショット 2017-09-16 12.00.13.png

API list screen スクリーンショット 2017-09-16 12.02.19.png

Enable API ([Disable] is displayed because it is already enabled) スクリーンショット 2017-09-16 12.02.58.png

You can check the activation of Google Speech API in [API and Services]> [Dashboard]: スクリーンショット 2017-09-16 12.19.01.png

3. Create API credentials (create service account key)

Select [API and Services]> [Credentials]> [Create Credentials]> [Service Account Key] on the left, set an appropriate [Service Account Name](assumed to be arkbbb here), and click the Create button. Press to download the JSON file.

スクリーンショット 2017-09-16 12.08.54.png

Service account key creation screen: スクリーンショット 2017-09-16 12.09.51.png

4. API authentication with Google Cloud Shell (service account key JSON upload & environment variable registration)

Start Google Cloud Shell with the Google Cloud Shell button at the top right of the Google Cloud Platform console screen, upload the JSON obtained in 3., and set it in the environment variable.

Google Cloud Shell Button: スクリーンショット 2017-09-16 12.21.39.png

JSON upload: スクリーンショット 2017-09-16 12.25.45.png

`Environment variable setting command`


$ export GOOGLE_APPLICATION_CREDENTIALS=[3.JSON name obtained in].json

5. Upload voice data

Upload the prepared voice data to Google Cloud Storage. First, select [Tools and Services]> [Storage]> [Browser] at the top left of the screen, create a bucket with [Create Bucket], double-click the created bucket, and click [Upload File] for audio data. To upload.

Go to Google Cloud Storage screen: スクリーンショット 2017-09-16 12.28.57.png

Creating a bucket (bucket name and other settings are in text): スクリーンショット 2017-09-16 12.30.19.png

Uploading files into your bucket: スクリーンショット 2017-09-16 12.36.32.png

6. Transcription execution Python script creation

Create a Python script for transcription execution on Google Cloud Shell.

`Python file editing command (editor as you like)`


$ nano transcribe.py

Python script for transcription (for English voice):

`transcribe.py`


# !/usr/bin/env python
# coding: utf-8
import argparse
import io
import sys
import codecs
import datetime
import locale

def transcribe_gcs(gcs_uri):
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        sample_rate_hertz=16000,
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        language_code='en-US')

    operation = client.long_running_recognize(config, audio)

    print('Waiting for operation to complete...')
    operationResult = operation.result()

    d = datetime.datetime.today()
    today = d.strftime("%Y%m%d-%H%M%S")
    fout = codecs.open('output{}.txt'.format(today), 'a', 'shift_jis')

    for result in operationResult.results:
      for alternative in result.alternatives:
          fout.write(u'{}\n'.format(alternative.transcript))
    fout.close()

if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument(
        'path', help='GCS path for audio file to be recognized')
    args = parser.parse_args()
    transcribe_gcs(args.path)

If you want to transcribe Japanese, modify the following line:

language_code='en-US')

↓

language_code='ja-JP')

7. Performing voice transcription

Execute transcription with the following command on Google Cloud Console.

$ python transcribe.py gs://Bucket name/Voice data name.flac

8. Execution result

If you check the file created by the ls command on the Google Cloud Console after execution, a text file named [output * .txt] will be created, so you can open it and check the result. The result for the first 1-2 minutes was below. If you listen to it together with Sound source, there are some mistakes, but you can see that it is mostly transcribed.

and not.
 We have just attended this big Tatum Outlet
 and we held a pydata event it was actually the first I did it
 and some of these slides are actually problem, says talk to and so at strata we saw many people talking about the Duke talking about Big Data there were looking at using Java in a management
 and there was a whole lot of our versus Python language rewards on Facebook
 the Travis and I were not content with the state of things we saw that python to play a very significant role Travis made the slide that's from The Little Prince that shows a snake swallowing the open
 he was also talking about using compilers make python faster
 it was also not that pilot event that we were very fortunate to have weido been awesome stopping by and we talked to him about things like the matrix multiplication operator we talked about coding expressions and things like that
 and so this actually his picture show does Travis and West McKinney who's the greater pandas and Guido van Rossum
 add
 and we ask we don't fix the packaging problem he told us that we should do it ourselves
 and so we did and that's how it came up with Honda and Anaconda which I think quite elegantly solves the difficult packaging problems for the Scientific Games
 so we accepted the challenge and so for those who don't know what Anaconda is very quickly I'll give you it is basically a very simple way and very reliable way to get final versions of many very popular typical to build packages in libraries in the python ecosystem

By the way, the actual result data is here

important point

Please note that there is a limit of the capacity that can be used free of charge (within 60 minutes of data per month), and you will be charged for more than that. Detailed reference
Voice conversion takes about 7-8 minutes per 25 minutes (the above voice data is about 25 minutes).
As you can see from the above results, it seems that paragraphs cannot be divided because the voice is just converted into characters.

reference

Article on transcription of voice using Google Cloud Speech API was helpful as information relatively recently (as of September 2017). ..

Impressions

It was confirmed that voice (English) transcription can be performed with a certain degree of accuracy using the Google Cloud Speech API (only one example).
If you get used to Google Cloud Platform itself, it doesn't bother you so much, but when you try to explain from scratch as above, I got the impression that there are many steps.
Since the Google Cloud Speech API itself is a beta version, usage restrictions and price plans may change in the future, so it seems better to check them each time you use it.