As I mentioned at the end of the article that summarized the previous How to use Google Speech API, I ran into the problem that the character recognition accuracy was lower than I expected. I did.
It seems that about 80% of the characters are transcribed in rebuild.fm, but in my case, [half of them are not recognized by experience](https://github.com/ysdyt/podcast_app/blob/master/text/google_speech_api /001.txt) Impression. It wasn't perfect, but it was pretty devastating, even though I expected to understand the conversation when I read the transcribed text.
Based on the premise that "Speech API is not bad, my preprocessing is bad", I tried various combinations of parameters and the presence or absence of preprocessing and compared the accuracy. The purpose of this time is to find the best preprocessing method in the Google Speech API.
I targeted the first sound source of the podcast "Platinum Mining.FM" that I recorded and distributed. What is uploaded is a clear sound source that has been edited, but since I am careful about recording, such as recording in a quiet room, the sound quality is so clear that there is not much difference even with raw data.
The length of the sound source is just 1h. I cut it at the position 1h from the start by editing. It doesn't make much sense to be 1h, but I just wanted to try it on a long sound source and made it a good experiment target.
Previous article In order to confirm the temporary construction made at the end, this time ** "Presence / absence of noise reduction processing" "Presence / absence of volume adjustment" "sample We will verify the three items of "difference in rate hertz" **.
-** Noise reduction processing ** ・ ・ ・ White noise processing performed by Audacity. The execution method is here
-** Volume control processing ** ・ ・ ・ Processing to perform automatic volume control with Leverator. The execution method is here
-** sample rate hertz ** ・ ・ ・ Audio sampling rate. Although sampling at 16kHZ is recommended in the Speech API, it is noted that sound sources originally recorded at 16kHZ or higher are not resampled at 16kHZ and are input to the Speech API at the sampling rate as they are recorded. The contents and execution method are here. Since the default sampling rate of the microphone used for recording is 44kHz, we will try two patterns of 16kHz or 44kHz this time.
I made a total combination of parameters for the above three items. As shown in the table below, there are 8 types in total.
No. | file name | Noise reduction processing | Volume control | sample rate hertz | file size |
---|---|---|---|---|---|
1 | 01_001_NoiRed-true_lev-true_samp16k.flac | True | True | 16k | 73.1MB |
2 | 02_001_NoiRed-true_lev-true_samp44k.flac | True | True | 44k | 169.8MB |
3 | 03_001_NoiRed-true_lev-false_samp16k.flac | True | False | 16k | 64.7KB |
4 | 04_001_NoiRed-true_lev-false_samp44k.flac | True | False | 44k | 147.4KB |
5 | 05_001_NiRed-false_lev-true_samp16k.flac | False | True | 16k | 75.8KB |
6 | 06_001_NiRed-false_lev-true_samp44k.flac | False | True | 44k | 180.9KB |
7 | 07_001_NiRed-false_lev-false_samp16k.flac | False | False | 16k | 68.1KB |
8 | 08_001_NiRed-false_lev-false_samp44k.flac | False | False | 44k | 160.2KB |
As far as the file size is concerned, if "sample rate hertz" is set to 16k, the file size will drop sharply. This is normal. It was unclear how the presence or absence of "noise reduction processing" and "volume control" affects the file size.
By the way, the sound source released every time on Shirokane Mining.FM
--Noise reduction processing → True, --Volume control → True
It corresponds to the same processing as No.1 of.
The execution method of Google Speech API was done according to Previous article.
It is really dull to check whether the characters are transcribed correctly one by one, so here we will roughly qualitatively check which parameter is the most accurate transcription.
However, it is difficult to evaluate it as it is, so it seems to be a quantitative result.
--Total number of transcription characters --Total number of words extracted by mecab (with duplicates) --Number of nouns extracted by mecab (with duplication) --Total number of words extracted by mecab (no duplication) --Number of nouns extracted by mecab (no duplication)
I will put out as an evaluation item.
The values of the quantitative results are as follows.
No. | file name | Noise reduction processing | Volume control | sample rate hertz | Number of transcription characters | Total number of duplicated words | Number of noun words with duplicates | Total number of words without duplication | Number of noun words without duplication |
---|---|---|---|---|---|---|---|---|---|
1 | 01_001_NoiRed-true_lev-true_samp16k.flac | True | True | 16k | 16849 | 9007 | 2723 | 1664 | 1034 |
2 | 02_001_NoiRed-true_lev-true_samp44k.flac | True | True | 44k | 16818 | 8991 | 2697 | 1666 | 1030 |
3 | 03_001_NoiRed-true_lev-false_samp16k.flac | True | False | 16k | 16537 | 8836 | 2662 | 1635 | 1026 |
4 | 04_001_NoiRed-true_lev-false_samp44k.flac | True | False | 44k | 16561 | 8880 | 2651 | 1659 | 1019 |
5 | 05_001_NiRed-false_lev-true_samp16k.flac | False | True | 16k | 17219 | 9191 | 2758 | 1706 | 1076 |
6 | 06_001_NiRed-false_lev-true_samp44k.flac | False | True | 44k | 17065 | 9118 | 2727 | 1675 | 1055 |
7 | 07_001_NiRed-false_lev-false_samp16k.flac | False | False | 16k | 16979 | 9045 | 2734 | 1679 | 1047 |
8 | 08_001_NiRed-false_lev-false_samp44k.flac | False | False | 44k | 17028 | 9120 | 2727 | 1664 | 1040 |
It's a little difficult to understand if it's a table, so I made a graph.
--Considering all items ** The best result was No.5 ** (No noise reduction processing, volume adjustment, sampling 16kHz) -** What was bad was No.3 or No.4 ** (Both are bad because they are the same)
What can be said from the quantitative results is
--Noise reduction processing ** (False) ** is better --The volume adjustment process should be ** (True) **
--Since the presence or absence of "noise reduction processing" and "volume adjustment processing" has a greater effect on sampling, it can be said that there is almost no difference between 16kHz or 44kHz, but strictly speaking, "noun words without duplication" In the "number" item, 16kHz is always a slightly better result, so ** 16kHz seems to be better **.
Qualitatively check the transcription results of the best No. 5 and the worst No. 3 (No. 4 was fine, but for the time being).
The images are arranged side by side for easy comparison of the range of one part of the entire transcription. The left is No.5 and the right is No.3.
Well, I'm not sure.
I'm not sure, so let's compare the "noun words without duplication" and "the number of counts" output from No. 5 and No. 3, respectively. Let's try to display the words that have appeared 11 times or more.
Frequently occurring words look almost the same.
By the way, although the information does not increase in particular, I will put out a word cloud and take a look at it somehow. No. 5 on the left and No. 3 on the right.
By the way, "Shochu" is a transcription mistake of "resident".
Of the three items verified, the combination that gave the best quantitative results was
--Noise reduction processing → None --Volume adjustment processing → Yes
have become. As stated in the API official, it seems better not to have noise reduction processing. On the other hand, it is better to have volume adjustment processing, so it seems that it is better for API that the volume is clear (not small). Lastly, it was said that those recorded at 16kHz or higher should not be resampled, but even when recording at 44kHz, it seems better to resample to 16kHz in terms of API (however, it does not seem to affect the big picture). But.)
When qualitatively comparing the transcription result output by the combination of items with the best results (No. 5) and the transcription result output by the combination of items with the worst results (No. 3), the transcription results It seemed that there was almost no difference in the frequently-used words that were successful, and it was found that the difference in parameters did not make a big difference in the transcription content itself. For words with rare appearances, there may be cases where new transcription is successful, but I have not confirmed it because it is out of the scope of this verification (because I do not have the energy to confirm so much). ..
The mystery of "about 80% of the transcription is possible with rebuild.fm" deepens, but I think that the transcription accuracy of the Google Speech API is the limit for my recordable sound source quality. The road to automatic transcription is still steep.
Future Work
I would like to try the best accurate Google Speech API transcription result vs. Amazon Transcribe obtained this time.
Many of the "compared" articles I've seen say good / bad by transcribing a few lines (or minutes). Or, there are many things that are done for "too clear sound sources" such as news videos.
About Transcribe Buzzing Blog also talks about high-precision transcription in English. It is known that the accuracy of English is high in the field of natural language processing, but the point is whether it is Japanese.
What I want to know is how far I can fight against Japanese sound sources, noisy sound sources recorded by amateurs like podcasts, long sound sources of about 1h, and sound sources where multiple people are crosstalking with just the API. So I would like to verify it.
Recommended Posts