[PYTHON] Google Cloud Speech API vs. Amazon Transcribe

Transcription API Gachinko Battle

スクリーンショット 2020-01-10 23.02.46.png

Many of the articles in the "Comparison of Transcription APIs" that you can see in a quick glance are good / bad by transcribing very short lines (or minutes). Or, there are many things that are done for "too clear sound sources" such as news videos. Blog that buzzed about Amazon Transcribe also talks about high-precision transcription in English. It is known that the accuracy of English is high in the field of natural language processing, but I am concerned about how it is in Japanese.

What I want to know is --Japanese sound source --Sound source with some noise recorded by amateurs like podcasts ――Long sound source of about 1h --Sound source where multiple people are crosstalking Since it was how far we could fight against voice data with such characteristics (whether it could be transcribed) with just the API, we have verified various things.

In the first article, I summarized how to use the Google Cloud Speech API and made a hypothesis that transcription accuracy is low. -Voice transcription procedure using Python and Google Cloud Speech API

In the second article, I experimented with preprocessing methods to improve transcription accuracy with the Google Cloud Speech API. -Survey on the relationship between speech preprocessing and transcription accuracy in Google Cloud Speech API

This time, I would like to summarize the ** limit point ** of the transcription API by transcribing the Google Speech API transcription result with the best accuracy obtained last time vs. Amazon Transcribe, which has become a hot topic recently. I think.

If you want to know only the result, please read only the "Google Cloud Speech API vs. Amazon Transcribe result summary" item at the bottom.

** * Note: The conclusion obtained this time is the result of the voice data used this time and the preprocessing performed. Please understand that this result does not conclude the performance of the API. ** **

Transcription on Amazon Transcribe

Amazon's automatic transcription API Amazon Transcribe is a service that has existed for a long time, but at the end of November 2019, it became compatible with Japanese.

It's very easy to use compared to the Google Speech API, so I'll omit it here. Official tutorial and Classmethod's blog See / cloud / aws / amazontranscribe-japanese /) etc.

Not to mention transcription, we will do it for Japanese.

The file formats that Amazon Tanscribe can handle are mp3, mp4, wav, flac. The Google Speech API is a nice point because I could not specify general file formats such as mp3 and wav. Since the audio sampling rate is also automatically recognized, it seems that there is no need to specify it manually like Google does. Convenient.

By the way, Amazon Transcribe allows you to specify your own parameters as optional in addition to the required parameters. スクリーンショット 2020-01-04 18.09.39.png

To summarize briefly

Since there are two speakers in the sound source used this time, "Speaker identification" is 2, but it seems that it is 2 by default, so I did not specify it in particular, and executed all with the default (all not specified).

The processing time was about 10 minutes with a 1h sound source. The Google Speech API took just over 15 minutes, so the processing time is faster with Amazon transcribe.

Validation dataset and evaluation method

Same as Last time, No1 to No8 sound source (flac file) created by combining various preprocessing parameters is used. The sound source data is in here, so if you want to use it, please.

It targets the same file and runs by default without specifying any optional parameters on the Amazon Transcribe side, so it should be safe to compare on the same level as the results of the Google speech API.

The transcription output result of Amazon Transcribe is json. The number of characters and words was also counted in the same way as Last time. (Click here for the actual processing code)

For horizontal skewer comparison, the evaluation method is the same as Previous.

result

Quantitative results

No. file name Noise reduction processing Volume control sample rate hertz Number of transcription characters Total number of duplicated words Number of noun words with duplicates Total number of words without duplication Number of noun words without duplication
1 01_001_NoiRed-true_lev-true_samp16k.flac True True 16k 19320 10469 3150 1702 1057
2 02_001_NoiRed-true_lev-true_samp44k.flac True True 44k 19317 10463 3152 1708 1060
3 03_001_NoiRed-true_lev-false_samp16k.flac True False 16k 19278 10429 3166 1706 1059
4 04_001_NoiRed-true_lev-false_samp44k.flac True False 44k 19322 10453 3170 1706 1058
5 05_001_NiRed-false_lev-true_samp16k.flac False True 16k 19660 10664 3209 1713 1054
6 06_001_NiRed-false_lev-true_samp44k.flac False True 44k 19653 10676 3211 1701 1052
7 07_001_NiRed-false_lev-false_samp16k.flac False False 16k 19639 10653 3209 1702 1052
8 08_001_NiRed-false_lev-false_samp44k.flac False False 44k 19620 10638 3213 1702 1047

The figure is below.

スクリーンショット 2020-01-10 23.07.02.png スクリーンショット 2020-01-10 23.07.30.png スクリーンショット 2020-01-10 23.07.40.png

Almost all the results were the same between the samples. What can be said from the overall result is

--In Amazon Transcribe, unlike the result of Google Speech API, ** It is not affected by voice preprocessing ** (* The preprocessing performed here is noise reduction processing by Audacity and volume adjustment processing by Levelator. The results may be different under other conditions as well.)

Here, the No. 2 result with the highest "** Noun word number without duplication **" (although it is almost an error level) is used as the representative value as the best result in Amazon transcribe.

Since it may not be affected by the presence or absence of preprocessing, I tried Amazon transcribe with ** recorded raw wav file (No.0) ** and got the result.

The only difference between this wav file without any pre-processing and the No. 2 file is ** stereo or monaural **. In No.2, when converting from wav to flac file, stereo → monaural conversion is performed at the same time. This was necessary because the Google speech API only accepts monaural files.

No. file name Noise reduction processing Volume control sample rate hertz Number of transcription characters Total number of duplicated words Number of noun words with duplicates Total number of words without duplication Number of noun words without duplication
0 001.wav False False 44k 19620 10637 3212 1701 1046
2 02_001_NoiRed-true_lev-true_samp44k.flac True True 44k 19317 10463 3152 1708 1060

Strictly speaking, "total number of words without duplication" and "number of noun words without duplication" are higher in No. 2, but there is not much difference. If you can get almost the same accuracy without pre-processing & stereo ⇔ monaural conversion, it is best to "insert a raw wav file" that does not require pre-processing.

In Amazon Transcribe, the results were almost the same from No. 1 to No. 8, so without comparing the qualitative results between No. 1 to No. 8, "Best result of Google Cloud Speech API" and "Best result of Amazon Transcribe" I would like to compare the "results".

Google Cloud Speech API vs. Amazon Transcribe

Compare the values of the Google Cloud Speech API (best result No. 8) confirmed last time and the Amazon Transcribe (best result No. 2) confirmed this time. The results on Google are taken from Last Results.

Quantitative result comparison

Transcription character count comparison

スクリーンショット 2020-01-10 23.13.45.png It seems that Amazon Transcribe had many simple transcriptions for the 1h sound source.

Word count comparison

スクリーンショット 2020-01-10 23.13.55.png Perhaps because of the large number of transcriptions, the total number of words and nouns with duplicates was also the result of Amazon Transcribe.

On the other hand, the total number of words and nouns excluding duplicates are almost the same. As much as I planned ...

Qualitative result comparison

I will roughly compare what the content of the text was like in the same way as Last time.

Transcription result

The images are arranged side by side for easy comparison of the range at the beginning of the transcription. Google is on the left and Amazon is on the right. スクリーンショット 2020-01-10 23.15.54.png It's difficult to judge, but I feel that Google's transcription is still better. It's a comparison of acorns. (* Here, only the result of Google has line breaks, but both Google and Amazon originally transcribe with line breaks. The accuracy is delicate. So here, both of them are post-processed to delete line breaks by the author. going.)

Frequent nouns

Let's compare "noun words without duplication" and "the number of counts" on Google and Amazon respectively. Let's try to display the words that have appeared 11 times or more. スクリーンショット 2020-01-10 23.16.05.png

It seems that Amazon has more words that can be recognized. However, since most of the words are duplicated between Google and Amazon, it also shows that the transcription performance of both is not significantly different. In the result of Amazon, our company name "BrainPad" is also omitted, so it is good.

If you want to recognize more words (in this voice data), Amazon seems to be better. (Check if the word is meaningful)

Word cloud

Noun word cloud in the flow. The above visualization. Google is on the left and Amazon is on the right. スクリーンショット 2020-01-10 23.16.15.png

Google Cloud Speech API vs. Amazon Transcribe Results Summary

As a result of Google Cloud Speech API vs. Amazon Transcribe,

――For both Google and Amazon, Japanese transcription is (* just using the API with a simple preprocessing for this voice data) ** Practical transcription seems impossible ** --The result is almost the same when compared by the number of words that can be transcribed ** --Amazon Transcribe got the same accuracy as Google without preprocessing, so ** Amazon Transcribe wins for convenience ** --If you operate the console on the browser and transcribe with the GUI instead of installing the SDK and hitting the API with the CLI (almost all non-engineers should use this method), ** Difficulty of use Needless to say, Amazon Transcribe wins . Rather, the Google API is too difficult for non-engineers. ( By the way, transcription by voice input of Google Doc has become popular among non-engineers recently. It seems that it is) -* Processing time is a little faster with Amazon transcribe **. For a 1h file, Google takes a little over 15 minutes, while Amazon takes about 10 minutes.

personally,

As for Japanese transcription, ** both are far from practical level of accuracy **, so I have the impression that the transcription API can only be used for ** word extraction **. (And even if only words can be extracted, there is almost no use ...)

And if it is used only for word extraction, my personal conclusion is that ** Amazon Transcribe is good ** in that it can be used without preprocessing, it is easy to use with GUI, and the processing time is fast.

I haven't abandoned the possibility that the accuracy of transcription will improve if I can take clearer voices using more apt recording equipment (= improve the voice quality of input), but my recording environment (around 16000 yen) Since it is difficult for general users to prepare more than that (using an external microphone), it is impossible with the current technology to "transcribe Japanese characters quickly and at a low price using API". I think it is. It seems that Japanese transcription cannot be done overnight.

It's kind of unclear, so if you know the tips that "If you do this, you'll be able to transcribe and recognize it", please comment!

Recommended Posts

Google Cloud Speech API vs. Amazon Transcribe
Streaming speech recognition with Google Cloud Speech API
Speech transcription procedure using Google Cloud Speech API
Transcribe WAV files with Cloud Speech API
Automatic voice transcription with Google Cloud Speech API
Speech transcription procedure using Python and Google Cloud Speech API
Speech recognition of wav files with Google Cloud Speech API Beta
Stream speech recognition using Google Cloud Speech gRPC API on python3 on Mac!
Google Cloud Vision API sample for python
Use Google Cloud Vision API from Python
I tried using the Google Cloud Vision API
How to use the Google Cloud Translation API
Until you can use the Google Speech API
[Google Cloud Platform] Use Google Cloud API using API Client Library
Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API
[GCP] [Python] Deploy API serverless with Google Cloud Functions!
Speech file recognition by Google Speech API v2 using Python
Try to determine food photos using Google Cloud Vision API
I tried the Google Cloud Vision API for the first time
Let's publish the super resolution API using Google Cloud Platform