[PYTHON] Google Cloud Speech API vs. Amazon Transcribe

Transcription API Gachinko Battle

Many of the articles in the "Comparison of Transcription APIs" that you can see in a quick glance are good / bad by transcribing very short lines (or minutes). Or, there are many things that are done for "too clear sound sources" such as news videos. Blog that buzzed about Amazon Transcribe also talks about high-precision transcription in English. It is known that the accuracy of English is high in the field of natural language processing, but I am concerned about how it is in Japanese.

What I want to know is --Japanese sound source --Sound source with some noise recorded by amateurs like podcasts ――Long sound source of about 1h --Sound source where multiple people are crosstalking Since it was how far we could fight against voice data with such characteristics (whether it could be transcribed) with just the API, we have verified various things.

In the first article, I summarized how to use the Google Cloud Speech API and made a hypothesis that transcription accuracy is low. -Voice transcription procedure using Python and Google Cloud Speech API

In the second article, I experimented with preprocessing methods to improve transcription accuracy with the Google Cloud Speech API. -Survey on the relationship between speech preprocessing and transcription accuracy in Google Cloud Speech API

This time, I would like to summarize the ** limit point ** of the transcription API by transcribing the Google Speech API transcription result with the best accuracy obtained last time vs. Amazon Transcribe, which has become a hot topic recently. I think.

If you want to know only the result, please read only the "Google Cloud Speech API vs. Amazon Transcribe result summary" item at the bottom.

** * Note: The conclusion obtained this time is the result of the voice data used this time and the preprocessing performed. Please understand that this result does not conclude the performance of the API. ** **

Transcription on Amazon Transcribe

Amazon's automatic transcription API Amazon Transcribe is a service that has existed for a long time, but at the end of November 2019, it became compatible with Japanese.

It's very easy to use compared to the Google Speech API, so I'll omit it here. Official tutorial and Classmethod's blog See / cloud / aws / amazontranscribe-japanese /) etc.

Not to mention transcription, we will do it for Japanese.

The file formats that Amazon Tanscribe can handle are mp3, mp4, wav, flac. The Google Speech API is a nice point because I could not specify general file formats such as mp3 and wav. Since the audio sampling rate is also automatically recognized, it seems that there is no need to specify it manually like Google does. Convenient.

By the way, Amazon Transcribe allows you to specify your own parameters as optional in addition to the required parameters. スクリーンショット 2020-01-04 18.09.39.png

To summarize briefly

Audio identification -Channel identification ・・・ If the sound source is divided into multiple channels, set this to True. It makes sense. The channel is, for example, in the sound source of telephone customer support, when the respondent and the questioner are different sound sources, the channel is 2. Amazon Transcribe seems to output both the transcribed version of each channel and the transcribed version of the merged sound source when this parameter is set to true. --Speaker identification: The number of speakers in the sound source can be specified arbitrarily. The maximum is 10 people. The default is 2 people. (So if it's a sound source that only one person is talking about, maybe it's better to set this to 1?) --Alternative results ・・・ Classmethod's See blog --Vocabulary filtering: A function that automatically masks and transcribes specified words, such as inappropriate words. Classmethod's see blog --Custom vocabulary ・・・ To improve recognition accuracy by giving hints such as technical terms in a specific domain and proper nouns that amazon transcribe cannot recognize.

Since there are two speakers in the sound source used this time, "Speaker identification" is 2, but it seems that it is 2 by default, so I did not specify it in particular, and executed all with the default (all not specified).

The processing time was about 10 minutes with a 1h sound source. The Google Speech API took just over 15 minutes, so the processing time is faster with Amazon transcribe.

Validation dataset and evaluation method

Same as Last time, No1 to No8 sound source (flac file) created by combining various preprocessing parameters is used. The sound source data is in here, so if you want to use it, please.

It targets the same file and runs by default without specifying any optional parameters on the Amazon Transcribe side, so it should be safe to compare on the same level as the results of the Google speech API.

The transcription output result of Amazon Transcribe is json. The number of characters and words was also counted in the same way as Last time. (Click here for the actual processing code)

For horizontal skewer comparison, the evaluation method is the same as Previous.

result

Quantitative results

No.	file name	Noise reduction processing	Volume control	sample rate hertz	Number of transcription characters	Total number of duplicated words	Number of noun words with duplicates	Total number of words without duplication	Number of noun words without duplication
1	01_001_NoiRed-true_lev-true_samp16k.flac	True	True	16k	19320	10469	3150	1702	1057
2	02_001_NoiRed-true_lev-true_samp44k.flac	True	True	44k	19317	10463	3152	1708	1060
3	03_001_NoiRed-true_lev-false_samp16k.flac	True	False	16k	19278	10429	3166	1706	1059
4	04_001_NoiRed-true_lev-false_samp44k.flac	True	False	44k	19322	10453	3170	1706	1058
5	05_001_NiRed-false_lev-true_samp16k.flac	False	True	16k	19660	10664	3209	1713	1054
6	06_001_NiRed-false_lev-true_samp44k.flac	False	True	44k	19653	10676	3211	1701	1052
7	07_001_NiRed-false_lev-false_samp16k.flac	False	False	16k	19639	10653	3209	1702	1052
8	08_001_NiRed-false_lev-false_samp44k.flac	False	False	44k	19620	10638	3213	1702	1047

The figure is below.

Almost all the results were the same between the samples. What can be said from the overall result is

--In Amazon Transcribe, unlike the result of Google Speech API, ** It is not affected by voice preprocessing ** (* The preprocessing performed here is noise reduction processing by Audacity and volume adjustment processing by Levelator. The results may be different under other conditions as well.)

Here, the No. 2 result with the highest "** Noun word number without duplication **" (although it is almost an error level) is used as the representative value as the best result in Amazon transcribe.

Since it may not be affected by the presence or absence of preprocessing, I tried Amazon transcribe with ** recorded raw wav file (No.0) ** and got the result.

The only difference between this wav file without any pre-processing and the No. 2 file is ** stereo or monaural **. In No.2, when converting from wav to flac file, stereo → monaural conversion is performed at the same time. This was necessary because the Google speech API only accepts monaural files.

No.	file name	Noise reduction processing	Volume control	sample rate hertz	Number of transcription characters	Total number of duplicated words	Number of noun words with duplicates	Total number of words without duplication	Number of noun words without duplication
0	001.wav	False	False	44k	19620	10637	3212	1701	1046
2	02_001_NoiRed-true_lev-true_samp44k.flac	True	True	44k	19317	10463	3152	1708	1060

Strictly speaking, "total number of words without duplication" and "number of noun words without duplication" are higher in No. 2, but there is not much difference. If you can get almost the same accuracy without pre-processing & stereo ⇔ monaural conversion, it is best to "insert a raw wav file" that does not require pre-processing.

In Amazon Transcribe, the results were almost the same from No. 1 to No. 8, so without comparing the qualitative results between No. 1 to No. 8, "Best result of Google Cloud Speech API" and "Best result of Amazon Transcribe" I would like to compare the "results".

Google Cloud Speech API vs. Amazon Transcribe

Compare the values of the Google Cloud Speech API (best result No. 8) confirmed last time and the Amazon Transcribe (best result No. 2) confirmed this time. The results on Google are taken from Last Results.

Quantitative result comparison

Transcription character count comparison

It seems that Amazon Transcribe had many simple transcriptions for the 1h sound source.

Word count comparison

Perhaps because of the large number of transcriptions, the total number of words and nouns with duplicates was also the result of Amazon Transcribe.

On the other hand, the total number of words and nouns excluding duplicates are almost the same. As much as I planned ...

Qualitative result comparison

I will roughly compare what the content of the text was like in the same way as Last time.

Transcription result

The images are arranged side by side for easy comparison of the range at the beginning of the transcription. Google is on the left and Amazon is on the right. スクリーンショット 2020-01-10 23.15.54.png It's difficult to judge, but I feel that Google's transcription is still better. It's a comparison of acorns. (* Here, only the result of Google has line breaks, but both Google and Amazon originally transcribe with line breaks. The accuracy is delicate. So here, both of them are post-processed to delete line breaks by the author. going.)

Frequent nouns

Let's compare "noun words without duplication" and "the number of counts" on Google and Amazon respectively. Let's try to display the words that have appeared 11 times or more. スクリーンショット 2020-01-10 23.16.05.png

It seems that Amazon has more words that can be recognized. However, since most of the words are duplicated between Google and Amazon, it also shows that the transcription performance of both is not significantly different. In the result of Amazon, our company name "BrainPad" is also omitted, so it is good.

If you want to recognize more words (in this voice data), Amazon seems to be better. (Check if the word is meaningful)

Word cloud

Noun word cloud in the flow. The above visualization. Google is on the left and Amazon is on the right. スクリーンショット 2020-01-10 23.16.15.png

Google Cloud Speech API vs. Amazon Transcribe Results Summary

As a result of Google Cloud Speech API vs. Amazon Transcribe,

――For both Google and Amazon, Japanese transcription is (* just using the API with a simple preprocessing for this voice data) ** Practical transcription seems impossible ** --The result is almost the same when compared by the number of words that can be transcribed ** --Amazon Transcribe got the same accuracy as Google without preprocessing, so ** Amazon Transcribe wins for convenience ** --If you operate the console on the browser and transcribe with the GUI instead of installing the SDK and hitting the API with the CLI (almost all non-engineers should use this method), ** Difficulty of use Needless to say, Amazon Transcribe wins . Rather, the Google API is too difficult for non-engineers. ( By the way, transcription by voice input of Google Doc has become popular among non-engineers recently. It seems that it is) -* Processing time is a little faster with Amazon transcribe **. For a 1h file, Google takes a little over 15 minutes, while Amazon takes about 10 minutes.

personally,

As for Japanese transcription, ** both are far from practical level of accuracy **, so I have the impression that the transcription API can only be used for ** word extraction **. (And even if only words can be extracted, there is almost no use ...)

And if it is used only for word extraction, my personal conclusion is that ** Amazon Transcribe is good ** in that it can be used without preprocessing, it is easy to use with GUI, and the processing time is fast.

I haven't abandoned the possibility that the accuracy of transcription will improve if I can take clearer voices using more apt recording equipment (= improve the voice quality of input), but my recording environment (around 16000 yen) Since it is difficult for general users to prepare more than that (using an external microphone), it is impossible with the current technology to "transcribe Japanese characters quickly and at a low price using API". I think it is. It seems that Japanese transcription cannot be done overnight.

It's kind of unclear, so if you know the tips that "If you do this, you'll be able to transcribe and recognize it", please comment!