I want to display real-time translations on Zoom's webcam video!

An example of real-time translation from English to Japanese at Zoom Meeting. Screenshot 2020-05-01 at 20.54.58.jpg

While Zoom easily crosses national borders, if you can't communicate in English, you won't be able to benefit from it, so I tried to build a simple mechanism.

As a rough flow,

Use Soundflower to perform real-time voice translation of internally routed voice from Zoom voice output using Microsoft Azure API in Python. The translation result sent by OSC from python is displayed in subtitles according to the webcam input with touch designer. Output using Siphon Spout Out in Touch Designer and let Zoom recognize it as a virtual webcam via CamTwist. A feeling of power.

Zoom doesn't have to be a professional account at all.

Tried environment / software

・ Mac Catalina ・ Python3.7 ・ Microsoft Azure account ・ Touch Designer ・ Soundflower ・ TwistCam

Install Soundflower, TwistCam

Download from here

Soundflower https://github.com/mattingalls/Soundflower/releases/tag/2.0b2 (Look carefully at the notes)

TwistCam http://camtwiststudio.com/

Soundflower settings

When installed, the item "sound flower" is displayed for both input and output in the sound menu of mac, so set input 2ch and output 2ch. This allows you to treat the sound you hear in Zoom as a microphone input. In windows, voice meeter banana is quite effective. So far, only soundflower has been found to work properly with mac. Screenshot 2020-05-01 at 21.11.47.jpg Screenshot 2020-05-01 at 21.14.58.jpg

Real-time speech translation using Azure

Within Azure, use an API called Cognitive Services. https://azure.microsoft.com/ja-jp/services/cognitive-services/ Register from the following page. I also have a contract for the free trial version, so if you want to do it firmly, of course it costs money.

After registering, make a note of your subscription key and area code.

Call Azure real-time translation from mac python environment

The sample code is here. https://github.com/Azure-Samples/cognitive-services-speech-sdk Download this. Of all the files in the python / console folder "YourSubscriptionKey", "YourServiceRegion" Rewrite.

Rewrite the inside of the translation_sample.py file to get the value of the real-time translation result from the voice input of mac.

Settings for OSC

#Beginning of sentence
from pythonosc import udp_client
from pythonosc.osc_message_builder import OscMessageBuilder
IP = '~'
PORT =Set appropriately

Set the translation destination to Japanese. Added code to send to touch designer in OSC.


def translation_continuous():
    """performs continuous speech translation from input from an audio file"""
    # <TranslationContinuous>
    # set up translation parameters: source language and target languages
    translation_config = speechsdk.translation.SpeechTranslationConfig(
        subscription=speech_key, region=service_region,
        speech_recognition_language='en-US',
        target_languages=('ja', 'fr'), voice_name="de-DE-Hedda")
    audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)

    # Creates a translation recognizer using and audio file as input.
    recognizer = speechsdk.translation.TranslationRecognizer(
        translation_config=translation_config, audio_config=audio_config)

    def result_callback(event_type, evt):
        """callback to display a translation result"""
        print("{}: {}\n\tTranslations: {}\n\tResult Json: {}".format(
            event_type, evt, evt.result.translations['ja'], evt.result.json))
        client = udp_client.UDPClient(IP, PORT);
        msg = OscMessageBuilder(address='/translation')
        msg.add_arg(evt.result.translations['ja'])
        m = msg.build()
        client.send(m)

    done = False

　  #abridgement

Now, if you execute main.py in the console folder from the command prompt and play youtube in English appropriately, the translation result should be displayed in the console like this.

Screenshot 2020-05-01 at 21.45.09.jpg

Synthesize translation results and webcam data with Touch Designer

I've only used the touch designer a few times, so I'm groping. I think that this can also be implemented with oF.

Select the following nodes from the menu and connect them.

・ (TOP) video device in: webcam input ・ (TOP) Text: Display translated subtitles ・ (DAT) OSC In: Change subtitle text in response to OSC ・ (TOP) Over: Combine webcam video and subtitles ・ (TOP) Syphon Device Out: Output as syphon By the way, syphon seems to be open source for exchanging images between applications on Mac OSX.

In the osc node, enter the port selected in python and rewrite the code as follows.

def onReceiveOSC(dat, rowIndex, message, bytes, timeStamp, address, args, peer):
	op("text2").par.text = message.strip("/translation ")
	return

You should now see something like this: Screenshot 2020-05-01 at 22.05.19.jpg

Output Touch Designer output to Zoom through TwistCam

Start TwistCam. Select syphon and you should see the touchDesigner item. Within this software, the output from Touch Designer can be converted into a virtual webcam.

This starts zoom. Screenshot 2020-05-01 at 22.12.32.jpg I think that Cam Twist appears in Zoom's camera selection, so if you select it, the touch designer screen will be the mainstay.

The accuracy is pretty good. If you rewrite the python code from Japanese to English, you should be able to do it immediately. It's not particularly difficult, but I used a lot of software, so make a note. Please comment if there is a better way.

[PYTHON] Forcibly introduce real-time translation to Zoom