[PYTHON] Subtitle data created with Amazon Transcribe

Trigger

I bought a DVD box of "Mayday !: The Truth and Truth of Aircraft Accidents" (English title Air Crash Investigation). It's an English version, but I thought there would be subtitles, but I didn't even have English subtitles ...

Fortunately, access control by CSS (Content Scramble System) was not applied, so I tried to make subtitles somehow.

What i did

  1. Extract only audio from video data (ffmpeg)
  2. Transcribe from voice data (Amazon Transcribe)
  3. Convert the transcription result to SubRip subtitle data (Python)
  4. Embed subtitle data in video data (mkvmerge)

Installation of necessary tools

For macOS, you can install it using homebrew. It is also a prerequisite that you can use Python 3 and pip.

brew install ffmpeg mkvtoolnix
pip3 install boto3

1. Extract only audio from video data

You can easily copy to a file containing only audio data using ffmpeg.

ffmpeg -i original.m4v -acodec copy -vn output.m4a

2. Transcription from voice data

To use Amazon Transcribe, you need to upload audio data to S3. This time, for simplicity, I wrote a simple script in Python that just uploads to S3 and submits a job to Amazon Transcribe.

The code that comes out after this is a code that I wrote in a little over 10 minutes, so it's pretty rough overall ...

01-transcribe.py


from boto3 import client, resource
import os
import sys

AWS_ACCESS_KEY = "hogehoge"
AWS_SECRET_ACCESS_KEY = "fugafuga"
BUCKET = "somebucket"

def upload(filepath):
    basename = os.path.basename(filepath)

    s3_client = resource(
        "s3",
        region_name="ap-northeast-1",
        aws_access_key_id=AWS_ACCESS_KEY,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    )
    s3_client.Bucket(BUCKET).upload_file(filepath, basename)


def transcribe(filename):

    url = "s3://{}/{}".format(BUCKET, filename)

    transcribe_client = client(
            "transcribe",
            region_name="ap-northeast-1",
            aws_access_key_id=AWS_ACCESS_KEY,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    )
    response = transcribe_client.start_transcription_job(
            TranscriptionJobName=filename,
            LanguageCode="en-US",
            MediaFormat="mp4",
            Media={
                    "MediaFileUri": url
            },
            OutputBucketName=BUCKET,
    )

def main():
    filepath = sys.argv[1]
    upload(filepath)
    transcribe(os.path.basename(filepath))


if __name__ == "__main__":
    main()

Pass the above file as an argument and execute it.

python 01-transcribe.py output.m4a

Download the resulting JSON file from the web console. (Omission)

3. Convert the transcription result to SubRip subtitle data

The resulting JSON contains an array of recognized words and their start and end times. If you display each word as subtitles as it is, it will be too difficult to read.

Therefore, I decided to determine the display range according to certain rules and generate subtitle data (srt).

This rule is decided appropriately, so please play with it as you like.

02-makesrt.py


import json
import sys


def sec2time(sec):
    h = int(sec/3600)
    m = int((sec%3600) / 60)
    s = int(sec % 60)
    mils = int((sec%1)*1000)
    return "{:02d}:{:02d}:{:02d},{:03d}".format(h, m, s, mils)

def convert2srt(filepath):
    with open(filepath, "r") as f:
        data = json.load(f)

    start_time = 0
    end_time = 0
    s = ""
    index = 0
    for item in data["results"]["items"]:
        is_output = False

        if "start_time" in item:
            item["start_time"] = float(item["start_time"])
            item["end_time"] = float(item["end_time"])

            if item["start_time"] - end_time > 3:
                #Did you have time
                is_output = True

            elif len(s) >= 110:
                #If it's getting longer
                is_output = True
            
            if s != "":
                if len(s)>1 and s[-2].isupper():
                    pass
                else:
                    last = s[-1]
                    if last in (".", "?", "!"):
                        is_output = True
                    
                    if last == "," and len(s) > 80:
                        is_output = True

        if is_output:

            end_time = min(item["start_time"], end_time+2.0)

            if s != "":
                print(index)
                index += 1
                print("{0} --> {1}".format(sec2time(start_time), sec2time(end_time)))
                print(s)
                print("")

            start_time = 0
            end_time = 0
            s = ""

        if "start_time" in item:
            if start_time == 0:
                start_time = item["start_time"]
            end_time = item["end_time"]
            if s and (len(item["alternatives"][0]["content"])>1 or s[-1] != "."):
                s += " " + item["alternatives"][0]["content"]
            else:
                s += item["alternatives"][0]["content"]
        else:
            s += item["alternatives"][0]["content"]

    if s != "":
        print(index)
        index += 1
        print("{0} --> {1}".format(sec2time(start_time), sec2time(end_time+2.0)))
        print(s)
        print("")

def main():
    filepath = sys.argv[1]
    convert2srt(filepath)

if __name__ == "__main__":
    main()

Specify a JSON file as an argument and save the result in a text file by redirect.

python 02-makesrt.py result.json > result.srt

The output should look like this

output


64
00:06:00,139 --> 00:06:16,839
Then the next Nano second, it was pure, unadulterated pandemonium Way number three going down.

65
00:06:16,839 --> 00:06:18,720
It looks like we lost number three engine.

66
00:06:18,720 --> 00:06:23,149
We're descending rapidly coming back.

4. Embed subtitle data in video data

With mkvmerge, you can easily embed subtitle data in mkv files.

mkvmerge -o output.mkv original.m4v --language 0:eng --track-name 0:English result.srt

The embedded subtitles can be displayed when playing back on VLC.

result

Well, I think that it is displayed almost without any discomfort.

vlcsnap-2020-03-03-00h47m23s630.png

vlcsnap-2020-03-03-00h50m06s662.png

The real thrill of programming is that you can quickly create tools at such times.

Recommended Posts

Subtitle data created with Amazon Transcribe
Sample data created with python
Retrieving food data with Amazon API (Python)
Data analysis with python 2
Visualize data with Streamlit
Reading data with TensorFlow
Data manipulation with Pandas!
Shuffle data with pandas
Data Augmentation with openCV
Normarize data with Scipy
Data analysis with Python
LOAD DATA with PyMysql
A network diagram was created with the data of COVID-19.
Embed audio data with Jupyter
Graph Excel data with matplotlib (1)
Artificial data generation with numpy
Extract Twitter data with CSV
Get Youtube data with python
Face recognition with Amazon Rekognition
Clustering ID-POS data with LDA
Learn new data with PaintsChainer
Binarize photo data with OpenCV
Graph Excel data with matplotlib (2)
Save tweet data with Django
Achieve "Bals" with Amazon Echo
Data processing tips with Pandas
Interpolate 2D data with scipy.interpolate.griddata
Read json data with python