[PYTHON] Translate Coursera's WebVTT format subtitles with the GCP Cloud Translation API

This article is the 24th day of Shinagawa Advent Calendar 2019.

Recently, I am taking a Coursera course with the assistance of the company. Most of the courses are in English, but some volunteers have Japanese subtitles. However, some courses may not have subtitles even if the course says "with Japanese subtitles". For example, Deep Learning Specialization has almost no Japanese subtitles in the 2nd, 3rd, and 5th weeks. So I made Japanese subtitles with GCP Cloud Translate API.

Introduction

I put the set in gist. This is a commentary article. I haven't done anything to show it to people, so please forgive me for the file name.

What is Coursera?

Coursera is an online course service. Learning includes not only videos but also programming practices and homework using Jupyter Notebook. You can discuss with other students and TAs in the forum. I will leave the review of each course to other articles, but I felt that it was a much more efficient learning method than doing research on my own.

What is WebVTT?

WebVTT is a text format for subtitling videos within the <video> element of HTML. It has a simple format and has the following structure [^ 1].

WEBVTT

1
00:00:00.000 --> 00:00:01.755
Hello, I'm Carolyn.

2
00:00:01.755 --> 00:00:03.795
I'd like to welcome
you to our course

3
00:00:03.795 --> 00:00:06.505
on Machine Learning for
Business Professionals.

As you can see, it is a pair of the playback time for displaying the subtitles and the content of the subtitles. The detailed format is described in MDN WebVTT Documents. You can also use decorative tags, but Coursera seems to have a simple text format as described above.

Nowadays, many media players such as VLC Media Player also support WebVTT subtitles. For example, VLC will automatically read a .vtt file with the same file name as the video if it is in the same folder. The same applies to the Android version of the app.

What is Cloud Translation API?

Unlike the Google Translate web service, there is a charge [^ 2]. The Introduction Page has two APIs, AutoML Transration and Translation API, the latter of which.

Billing is done in units of one character, and as a guideline, it feels like "20,000 characters for a 30-minute video costs $ 0.8". Another analogy is that if you translate all videos without Japanese subtitles for 1 to 5 weeks of Deep Learning Specialization, it will cost about 1,500 yen. [^ 3].

Process flow

I will explain the WebVTT translation script from here. The general processing flow is as follows.

One subtitle may be cut off in the middle of a sentence, so read the subtitles in succession, and if you find a period / exclamation mark, assemble one sentence there.
Translate the assembled sentence.
In order to separate the translated sentence like the original subtitle, memorize the number of characters for each subtitle before translation and divide the translated sentence by the ratio.
Reassemble to WebVTT format.

Example

For example, suppose you have the following subtitles [^ 1].

2
00:00:01.755 --> 00:00:03.795
I'd like to welcome
you to our course

3
00:00:03.795 --> 00:00:06.505
on Machine Learning for
Business Professionals.

First, put them together in one sentence up to the period. Remember that the number of characters for ID 2 is 37 and the number of characters for ID 3 is 47. Then translate the sentence.

Before translation: I'd like to welcome you to our course on Machine Learning for Business Professionals.
After translation:Welcome to the Machine Learning course for business professionals.

Now break the translation into the original two subtitles. The character ratio of the original subtitles is 37:47, so we will separate based on this ratio. However, it is difficult to read if it is cut in the middle of a word or before a particle such as "no" or "o", so do not cut it there. The result is as follows.

00:00:01.755 --> 00:00:03.795
Machine learning for business professionals

00:00:03.795 --> 00:00:06.505
Welcome to the course.

If you translate normally, this is the length that will be displayed with one subtitle, so it feels a little strange. I think there is a lot of room for improvement.

Another sentence that follows the above is:

4
00:00:06.505 --> 00:00:08.160
I lead a team of machine learning

5
00:00:08.160 --> 00:00:10.080
engineers who have successfully

6
00:00:10.080 --> 00:00:12.450
implemented many
machine learning projects

7
00:00:12.450 --> 00:00:14.475
across various industries.

00:00:06.505 --> 00:00:08.160
I am in various industries

00:00:08.160 --> 00:00:10.080
Many machine learning projects

00:00:10.080 --> 00:00:12.450
A team of machine learning engineers who successfully implemented

00:00:12.450 --> 00:00:14.475
I'm leading.

Aside from machine-translated sentences, I think the subtitles are separated by good salt plums.

Other libraries, etc.

-webvtt-py: Used to parse and build WebVTT. It seems that it does not support ID and it disappears from the translated file, but there is no problem because the ID is not required to display the subtitles. -MeCab: Used to distinguish words and part of speech. -google-cloud-translate: A Python library that makes it easy to use the Cloud Translation API. Authentication is performed by setting the API token file in the environment variable.

Known issues

--Sometimes the breaks are not good —— Due to various problems, subtitle division may become unbalanced. ――Cannot separate 3 or more consecutive particles and punctuation marks well ――I'm just neglecting to build logic, but in rare cases there is a pattern where the beginning of a line becomes a punctuation mark. --Cannot be applied to sentences without a period --In the 4th week of Deep Learning Specialization, Parameters vs Hyperparameters was the case, but there is a translation without a period. As expected, it was impossible with this logic. --Translation of technical words does not work ――Google Translate translates AI jargon well, but it can still be translated directly. ――You can use the API function to "create and register a dictionary", but I'm wondering if it's okay because it can be converted in the brain without much effort.

Summary

Even if I am not good at listening to English, I feel that with the support of subtitles, even machine translation can be quite helpful in understanding. Coursera has many great courses. I think it's much more efficient than reading an article or book to get systematic knowledge, so why not give it a try?

[^ 1]: Quoted from the Introduction in the first week of Machine Learning for Business Professional. This course is free if you do not need a certificate. [^ 2]: There is also Try to use Google Translate API for free. I haven't tried it. [^ 3]: Since I was making a script, I included the amount used in trial and error.