[PYTHON] Speech synthesis and speech recognition by Microsoft Project Oxford

Since May 1, 2015, Microsoft has released a machine learning API as part of a project called Project Oxford.

[Face, image, and voice recognition APIs available from Microsoft's Project Oxford](http://jp.techcrunch.com/2015/05/01/20150430microsofts-project-oxford-gives-developers-access-to- facial-image-and-speech-recognition-apis /)

This time, we will take up the Speech API that performs speech synthesis and speech recognition.

This is because there are quite a few services that synthesize speech, but when it comes to speech recognition, the ones that can be used as APIs are quite limited. Mostly Android / iOS SDKs, even though they can be used on the web, they are browser-dependent. Google also has a Speech API, but I can't find any official documentation, and the limit of 50 times a day is quite strict (as of July 2015. It doesn't seem to increase if you pay).

Project Oxford is Public Beta as of July 2015, and for now it is free and can be used without any restrictions (Japanese is also supported). There are APIs such as face recognition other than speech synthesis, so please try it at here.

Environmental preparation

First, prepare the environment for using the Speech API. A Microsoft Azure account is required to use it, so register it.

Microsoft Azure

There is a description that it is for one month, but since the Speech API used this time is free, I think that it is probably okay even after one month.

Once you have created an account, access the portal. The Speech API seems to be purchased via Marketplace, so press the "New" button at the bottom left and select Marketplace.

From here, select the Speech API. Since you can see the Face API etc., I think that you can purchase the API of Project Oxford by the same method (* Currently FREE).

After purchase, you can refer to the key required to access the API by pressing the "Manage" button below.

At this point, the environment preparation is complete.

Use of API

As for API, SDK exists as well as other speech recognition, but it can also be used in Web API format. You can download the SDK that suits your environment / use from the following.

Software Development Kit (SDK)

The official documentation is below.

This time, we will describe the usage in Web API format and sample code in Python3 (but any language such as JavaScript / Ruby / PHP / Java can be used as long as HTTP can be skipped). For HTTP Request in Python, the standard is quite difficult, so use requests. I want to use it quickly because it's kind of annoying! For those who say, I made a simple library below, so please try it here.

icoxfog417/pyoxford

Authentication

First, authenticate using the key required for API access obtained in the environment preparation earlier. There are two keys, but primary is client_id and secondary is client_secret (secret token). Below is a sample code for authentication (excerpt from the repository above).

    def authorize(self, client_id, client_secret):
        url = "https://oxford-speech.cloudapp.net//token/issueToken"

        headers = {
            "Content-type": "application/x-www-form-urlencoded"
        }

        params = urllib.parse.urlencode(
            {"grant_type": "client_credentials",
             "client_id": client_id,
             "client_secret": client_secret,
             "scope": "https://speech.platform.bing.com"}
        )

        response = requests.post(url, data=params, headers=headers)
        if response.ok:
            _body = response.json()
            return _body["access_token"]
        else:
            response.raise_for_status()

The authentication token (_body [" access_token "]) obtained here will be used for future synthesis / recognition.

Speech synthesis

Now, let's try speech synthesis. In the following, the argument text is the character string to be voice-synthesized, and token is the authentication token obtained earlier.

    def text_to_speech(self, text, token, lang="en-US", female=True):
        template = """
        <speak version='1.0' xml:lang='{0}'>
            <voice xml:lang='{0}' xml:gender='{1}' name='{2}'>
                {3}
            </voice>
        </speak>
        """

        url = "https://speech.platform.bing.com/synthesize"
        headers = {
            "Content-type": "application/ssml+xml",
            "X-Microsoft-OutputFormat": "riff-16khz-16bit-mono-pcm",
            "Authorization": "Bearer " + token,
            "X-Search-AppId": "07D3234E49CE426DAA29772419F436CA",
            "X-Search-ClientID": "1ECFAE91408841A480F00935DC390960",
            "User-Agent": "OXFORD_TEST"
        }
        name = "Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)"
        data = template.format(lang, "Female" if female else "Male", name, text)

        response = requests.post(url, data=data, headers=headers)

        if response.ok:
            return response.content
        else:
            raise response.raise_for_status()

As you can see in the template above, the request is sent in XML format for speech called SSML. About this, docomo site is detailed. The limit of voice that can be synthesized is 15 seconds. The result is returned in binary format, so if you save this as an audio file (.wav, etc.), you can listen to the synthesized audio.

Other detailed parameters are as follows.

X-Search-AppId、X-Search-ClientID
It will be a GUID without hyphens (such as 07D3234E49CE426DAA29772419F436CA). Probably for mobile, AppID seems to be the ID of the application and ClientID is the ID of the client (per installation). When using with HTTP, both GUIDs are fine.
name
A voice type called voice font, which specifies something like a person's name. The documentation SupportedLocalesVoiceFonts has a list of available ones.

voice recognition

Next, let's try voice recognition. Let's recognize it by using the content (binary) that was synthesized by voice as it is. It seems that it is supposed to be used to make text continuously while recognizing it, and it seems that the limit is 10 seconds at a time and 14 seconds in total (requestid unit?).

   def speech_to_text(self, binary, token, lang="en-US", samplerate=8000, scenarios="ulm"):
        data = binary
        params = {
            "version": "3.0",
            "appID": "D4D52672-91D7-4C74-8AD8-42B1D98141A5",
            "instanceid": "1ECFAE91408841A480F00935DC390960",
            "requestid": "b2c95ede-97eb-4c88-81e4-80f32d6aee54",
            "format": "json",
            "locale": lang,
            "device.os": "Windows7",
            "scenarios": scenarios,
        }

        url = "https://speech.platform.bing.com/recognize/query?" + urllib.parse.urlencode(params)
        headers = {"Content-type": "audio/wav; samplerate={0}".format(samplerate),
                   "Authorization": "Bearer " + token,
                   "X-Search-AppId": "07D3234E49CE426DAA29772419F436CA",
                   "X-Search-ClientID": "1ECFAE91408841A480F00935DC390960",
                   "User-Agent": "OXFORD_TEST"}

        response = requests.post(url, data=data, headers=headers)

        if response.ok:
            result = response.json()["results"][0]
            return result["lexical"]
        else:
            raise response.raise_for_status()

This is a request with a slightly acrobatic feeling as if both GET / POST, which is the information about the file with the query parameter and the file body with the body, are combined.

appID
Fixed pattern with "D4D52672-91D7-4C74-8AD8-42B1D98141A5"
instanceid
Device-specific ID. Same meaning as X-Search-ClientID?
requestid
Single value per request
device.os
The OS type of the device, such as Windows7. Appropriate and acceptable.
scenarios
Choice of ulm or web search. The default is ulm, but it's a mystery what this stands for.
sourcerate
Audio file sampling rate. 8000 or 16000, 8000 is the default. If you are confident in this value (for example, you are recording properly), setting trustsourcerate = true seems to be considered a strict setting.

You can also optionally specify the following:

maxnbest
How many recognized candidates are returned. The default is 3
result.profanitymarkup
The recognized result may be a word that a good child cannot hear, so if you set it to 1, it will remove these words (default is 1, which means it is valid).

The return value returns some recognized character strings in descending order of probability. It is contained as an array in results, where lexical is the string and confidence is the accuracy.

That is all for the explanation. You can easily synthesize / recognize voice, so please give it a try.