[PYTHON] Convert voice to text using Azure Speech SDK

Introduction

Let's convert voice to text using Azure Speech SDK.

Development environment

Recognize voice from microphone

    1. Log in to the Azure portal and create a voice service. image.png
  1. Go to the resource you created and make a copy of the key and location. image.png

    1. Create a Python 3.6 environment.
conda create -n py36 python=3.6
conda activate py36

Four. Install the library.

pip install azure-cognitiveservices-speech

Five. Create a program.

It is a program that displays the recognition result by inputting voice only once. Paste the key you copied earlier into "YourSubscriptionKey" and the location you copied earlier into "YourServiceRegion". I want to recognize Japanese, so set the language to "ja-JP".

import azure.cognitiveservices.speech as speechsdk

 Creates an instance of a speech config with specified subscription key and service region.
 Replace with your own subscription key and service region (e.g., "westus").
speech_key, service_region, language = "YourSubscriptionKey", "YourServiceRegion", "ja-JP"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

 Creates a recognizer with the given settings
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

print("Say something...")


 Starts speech recognition, and returns after a single utterance is recognized. The end of a
 single utterance is determined by listening for silence at the end or until a maximum of 15
 seconds of audio is processed.  The task returns the recognition text as result. 
 Note: Since recognize_once() returns only a single utterance, it is suitable only for single
 shot recognition like command or query. 
 For long-running multi-utterance recognition, use start_continuous_recognition() instead.
result = speech_recognizer.recognize_once()

 Checks result.
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))

This is a program that continuously inputs voice and displays the recognition result. Similarly, please set the key, location, and language.

import azure.cognitiveservices.speech as speechsdk
import time

 Creates an instance of a speech config with specified subscription key and service region.
 Replace with your own subscription key and service region (e.g., "westus").
speech_key, service_region, language = "YourSubscriptionKey", "YourServiceRegion", "ja-JP"
speech_config = speechsdk.SpeechConfig(
    subscription=speech_key, region=service_region, speech_recognition_language=language)

 Creates a recognizer with the given settings
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config)

print("Say something...")

def recognized(evt):
    print('「{}」'.format(evt.result.text))
    # do something

def start(evt):
    print('SESSION STARTED: {}'.format(evt))

def stop(evt):
    print('SESSION STOPPED {}'.format(evt))

speech_recognizer.recognized.connect(recognized)
speech_recognizer.session_started.connect(start)
speech_recognizer.session_stopped.connect(stop)

try:
    speech_recognizer.start_continuous_recognition()
    time.sleep(60)
except KeyboardInterrupt:
    print("bye.")
    speech_recognizer.recognized.disconnect_all()
    speech_recognizer.session_started.disconnect_all()
    speech_recognizer.session_stopped.disconnect_all()
  1. Execute the following command and talk to it.
python stt.py

The recognition result is displayed as follows. image.png

Recognize audio from audio files (.wav)

    1. The installation method is the same as above.
  1. Create a program.

A program that reads .wav files and displays voice recognition results. Set the key and location.

import azure.cognitiveservices.speech as speechsdk

 Creates an instance of a speech config with specified subscription key and service region.
 Replace with your own subscription key and region identifier from here: https://aka.ms/speech/sdkregion
speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

 Creates an audio configuration that points to an audio file.
 Replace with your own audio filename.
audio_filename = "aboutSpeechSdk.wav"
audio_input = speechsdk.audio.AudioConfig(filename=audio_filename)

 Creates a recognizer with the given settings
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

print("Recognizing first result...")

 Starts speech recognition, and returns after a single utterance is recognized. The end of a
 single utterance is determined by listening for silence at the end or until a maximum of 15
 seconds of audio is processed.  The task returns the recognition text as result. 
 Note: Since recognize_once() returns only a single utterance, it is suitable only for single
 shot recognition like command or query. 
 For long-running multi-utterance recognition, use start_continuous_recognition() instead.
result = speech_recognizer.recognize_once()

 Checks result.
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print("Recognized: {}".format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized: {}".format(result.no_match_details))
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation_details = result.cancellation_details
    print("Speech Recognition canceled: {}".format(cancellation_details.reason))
    if cancellation_details.reason == speechsdk.CancellationReason.Error:
        print("Error details: {}".format(cancellation_details.error_details))

For the audio file, use sampledata \ audiofiles \ aboutSpeechSdk.wav in cognitive-services-speech-sdk.

    1. Execute the following command and see the result.
python stt_from_file.py

If the key and location are incorrect, you will get the following error.

(py36) C:\Users\good_\Documents\PythonProjects\AzureSpeech>python stt_from_file.py
Recognizing first result...
Speech Recognition canceled: CancellationReason.Error
Error details: Connection failed (no connection to the remote host). Internal error: 1. Error details: 11001. Please check network connection, firewall setting, and the region name used to create speech factory. SessionId: 77ad7686a9d94b7882398ae8b855d903

The result is as follows. image.png

It has 52 seconds, but it seems to end when it recognizes the first line.

Four. To read continuously and recognize voice, do as follows.

import azure.cognitiveservices.speech as speechsdk
import time 

 Creates an instance of a speech config with specified subscription key and service region.
 Replace with your own subscription key and region identifier from here: https://aka.ms/speech/sdkregion
speech_key, service_region = "YourSubscriptionKey", "YourServiceRegion"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)

 Creates an audio configuration that points to an audio file.
 Replace with your own audio filename.
audio_filename = "aboutSpeechSdk.wav"
audio_input = speechsdk.audio.AudioConfig(filename=audio_filename)

 Creates a recognizer with the given settings
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

print("Recognizing...")

def recognized(evt):
    print('「{}」'.format(evt.result.text))
    # do something

def start(evt):
    print('SESSION STARTED: {}'.format(evt))

def stop(evt):
    print('SESSION STOPPED {}'.format(evt))

speech_recognizer.recognized.connect(recognized)
speech_recognizer.session_started.connect(start)
speech_recognizer.session_stopped.connect(stop)

try:
    speech_recognizer.start_continuous_recognition()
    time.sleep(60)
except KeyboardInterrupt:
    print("bye.")
    speech_recognizer.recognized.disconnect_all()
    speech_recognizer.session_started.disconnect_all()
    speech_recognizer.session_stopped.disconnect_all()

Five. Let's try again.

It seems that voice recognition is possible continuously as shown below! image.png

Thank you for your hard work.

reference

-[Quick Start: Recognize voice from microphone](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/speech-to-text-from-microphone?tabs = dotnet% 2Cx-android% 2Clinux% 2Cjava-runtime% 2Cwindowsinstall & pivots = programming-language-python) -[Quick Start: Recognize voice from audio file](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/speech-to-text-from-file? tabs = linux% 2Cbrowser% 2Cwindowsinstall & pivots = programming-language-python # sample-code) --Convert speech to text using Azure Speech Service (STT)

Recommended Posts

Convert voice to text using Azure Speech SDK
I tried using Azure Speech to Text.
Speech to speech in python [text to speech]
Introduction to discord.py (3) Using voice
I tried Watson Speech to Text
Convert a large number of PDF files to text files using pdfminer
English speech recognition with python [speech to text]
Voice authentication & transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text)
Using Azure ML Python SDK 5: Pipeline Basics
I tried to classify text using TensorFlow
Convert PDF attached to email to text format
Convert STL to Voxel mesh using Python VTK
I implemented Google's Speech to text in Django
Convert json format data to txt (using yolo)
Convert to HSV
Convert a text file with hexadecimal values to a binary file
[Python] Convert PDF text to CSV page by page (2/24 postscript)