Transcription accuracy is lower than I expected

As I mentioned at the end of the article that summarized the previous How to use Google Speech API, I ran into the problem that the character recognition accuracy was lower than I expected. I did.

It seems that about 80% of the characters are transcribed in rebuild.fm, but in my case, [half of them are not recognized by experience](https://github.com/ysdyt/podcast_app/blob/master/text/google_speech_api /001.txt) Impression. It wasn't perfect, but it was pretty devastating, even though I expected to understand the conversation when I read the transcribed text.

Based on the premise that "Speech API is not bad, my preprocessing is bad", I tried various combinations of parameters and the presence or absence of preprocessing and compared the accuracy. The purpose of this time is to find the best preprocessing method in the Google Speech API.

Sound source data to be verified

I targeted the first sound source of the podcast "Platinum Mining.FM" that I recorded and distributed. What is uploaded is a clear sound source that has been edited, but since I am careful about recording, such as recording in a quiet room, the sound quality is so clear that there is not much difference even with raw data.

The length of the sound source is just 1h. I cut it at the position 1h from the start by editing. It doesn't make much sense to be 1h, but I just wanted to try it on a long sound source and made it a good experiment target.

Verification items

Previous article In order to confirm the temporary construction made at the end, this time ** "Presence / absence of noise reduction processing" "Presence / absence of volume adjustment" "sample We will verify the three items of "difference in rate hertz" **.

-** Noise reduction processing ** ・・・ White noise processing performed by Audacity. The execution method is here

-** Volume control processing ** ・・・ Processing to perform automatic volume control with Leverator. The execution method is here

-** sample rate hertz ** ・・・ Audio sampling rate. Although sampling at 16kHZ is recommended in the Speech API, it is noted that sound sources originally recorded at 16kHZ or higher are not resampled at 16kHZ and are input to the Speech API at the sampling rate as they are recorded. The contents and execution method are here. Since the default sampling rate of the microphone used for recording is 44kHz, we will try two patterns of 16kHz or 44kHz this time.

I made a total combination of parameters for the above three items. As shown in the table below, there are 8 types in total.

No.	file name	Noise reduction processing	Volume control	sample rate hertz	file size
1	01_001_NoiRed-true_lev-true_samp16k.flac	True	True	16k	73.1MB
2	02_001_NoiRed-true_lev-true_samp44k.flac	True	True	44k	169.8MB
3	03_001_NoiRed-true_lev-false_samp16k.flac	True	False	16k	64.7KB
4	04_001_NoiRed-true_lev-false_samp44k.flac	True	False	44k	147.4KB
5	05_001_NiRed-false_lev-true_samp16k.flac	False	True	16k	75.8KB
6	06_001_NiRed-false_lev-true_samp44k.flac	False	True	44k	180.9KB
7	07_001_NiRed-false_lev-false_samp16k.flac	False	False	16k	68.1KB
8	08_001_NiRed-false_lev-false_samp44k.flac	False	False	44k	160.2KB

As far as the file size is concerned, if "sample rate hertz" is set to 16k, the file size will drop sharply. This is normal. It was unclear how the presence or absence of "noise reduction processing" and "volume control" affects the file size.

By the way, the sound source released every time on Shirokane Mining.FM

--Noise reduction processing → True, --Volume control → True

sampling rate → 16k

It corresponds to the same processing as No.1 of.

Method of verification

Execution method

The execution method of Google Speech API was done according to Previous article.

Evaluation method

It is really dull to check whether the characters are transcribed correctly one by one, so here we will roughly qualitatively check which parameter is the most accurate transcription.

However, it is difficult to evaluate it as it is, so it seems to be a quantitative result.

--Total number of transcription characters --Total number of words extracted by mecab (with duplicates) --Number of nouns extracted by mecab (with duplication) --Total number of words extracted by mecab (no duplication) --Number of nouns extracted by mecab (no duplication)

I will put out as an evaluation item.

result

Quantitative results

The values of the quantitative results are as follows.

No.	file name	Noise reduction processing	Volume control	sample rate hertz	Number of transcription characters	Total number of duplicated words	Number of noun words with duplicates	Total number of words without duplication	Number of noun words without duplication
1	01_001_NoiRed-true_lev-true_samp16k.flac	True	True	16k	16849	9007	2723	1664	1034
2	02_001_NoiRed-true_lev-true_samp44k.flac	True	True	44k	16818	8991	2697	1666	1030
3	03_001_NoiRed-true_lev-false_samp16k.flac	True	False	16k	16537	8836	2662	1635	1026
4	04_001_NoiRed-true_lev-false_samp44k.flac	True	False	44k	16561	8880	2651	1659	1019
5	05_001_NiRed-false_lev-true_samp16k.flac	False	True	16k	17219	9191	2758	1706	1076
6	06_001_NiRed-false_lev-true_samp44k.flac	False	True	44k	17065	9118	2727	1675	1055
7	07_001_NiRed-false_lev-false_samp16k.flac	False	False	16k	16979	9045	2734	1679	1047
8	08_001_NiRed-false_lev-false_samp44k.flac	False	False	44k	17028	9120	2727	1664	1040

It's a little difficult to understand if it's a table, so I made a graph.

--Considering all items ** The best result was No.5 ** (No noise reduction processing, volume adjustment, sampling 16kHz) -** What was bad was No.3 or No.4 ** (Both are bad because they are the same)

What can be said from the quantitative results is

--Noise reduction processing ** (False) ** is better --The volume adjustment process should be ** (True) **

--Since the presence or absence of "noise reduction processing" and "volume adjustment processing" has a greater effect on sampling, it can be said that there is almost no difference between 16kHz or 44kHz, but strictly speaking, "noun words without duplication" In the "number" item, 16kHz is always a slightly better result, so ** 16kHz seems to be better **.

Qualitative results

Qualitatively check the transcription results of the best No. 5 and the worst No. 3 (No. 4 was fine, but for the time being).

Transcription result

The images are arranged side by side for easy comparison of the range of one part of the entire transcription. The left is No.5 and the right is No.3.

Well, I'm not sure.

Frequent nouns

I'm not sure, so let's compare the "noun words without duplication" and "the number of counts" output from No. 5 and No. 3, respectively. Let's try to display the words that have appeared 11 times or more.

Frequently occurring words look almost the same.

Word cloud

By the way, although the information does not increase in particular, I will put out a word cloud and take a look at it somehow. No. 5 on the left and No. 3 on the right.

By the way, "Shochu" is a transcription mistake of "resident".

Summary

Of the three items verified, the combination that gave the best quantitative results was

--Noise reduction processing → None --Volume adjustment processing → Yes

sample rate hertz → 16kHz

have become. As stated in the API official, it seems better not to have noise reduction processing. On the other hand, it is better to have volume adjustment processing, so it seems that it is better for API that the volume is clear (not small). Lastly, it was said that those recorded at 16kHz or higher should not be resampled, but even when recording at 44kHz, it seems better to resample to 16kHz in terms of API (however, it does not seem to affect the big picture). But.)

When qualitatively comparing the transcription result output by the combination of items with the best results (No. 5) and the transcription result output by the combination of items with the worst results (No. 3), the transcription results It seemed that there was almost no difference in the frequently-used words that were successful, and it was found that the difference in parameters did not make a big difference in the transcription content itself. For words with rare appearances, there may be cases where new transcription is successful, but I have not confirmed it because it is out of the scope of this verification (because I do not have the energy to confirm so much). ..

The mystery of "about 80% of the transcription is possible with rebuild.fm" deepens, but I think that the transcription accuracy of the Google Speech API is the limit for my recordable sound source quality. The road to automatic transcription is still steep.

Future Work

I would like to try the best accurate Google Speech API transcription result vs. Amazon Transcribe obtained this time.

Many of the "compared" articles I've seen say good / bad by transcribing a few lines (or minutes). Or, there are many things that are done for "too clear sound sources" such as news videos.

About Transcribe Buzzing Blog also talks about high-precision transcription in English. It is known that the accuracy of English is high in the field of natural language processing, but the point is whether it is Japanese.

What I want to know is how far I can fight against Japanese sound sources, noisy sound sources recorded by amateurs like podcasts, long sound sources of about 1h, and sound sources where multiple people are crosstalking with just the API. So I would like to verify it.

[PYTHON] Investigation of the relationship between speech preprocessing and transcription accuracy in the Google Cloud Speech API