[PYTHON] Voice authentication & transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text)

History

I recently got into American football (NFL). However, I can't understand English. .. .. Even if you don't understand the voice, can you somehow decipher it by writing it? Thinking, I challenged to make voice text of player interviews with Raspberry Pi 3 × Julius × Watson (Speech to Text)

Thing you want to do

img20170324_14192489.jpg

The image looks like this Getting robots to listen: Using Watson’s Speech to Text service

environment

Premise

The following is assumed to be ready. For reference, list the link of the site that I referred to

--Enable the microphone on Raspberry Pi 3 -Easy to do! Conversation with Raspberry pi using speech recognition and speech synthesis -Try voice recognition and voice synthesis with Raspberry Pi 2 --Julius installation on Raspberry Pi 3 -Voice recognition by Julius-Utilization of domestic open source library --User registration to watson (It seems that all services can be used free of charge for one month after registration)

procedure

  1. Talk to Raspberry Pi 3 using Julius (images ①②)
  2. Voice recording (image ③)
  3. Connect from Raspberry Pi 3 to watson (Speech to Text) (Image ④)
  4. Textualize youtube player interview with Raspberry Pi 3 x watson (image ⑤)

■ Talk to Raspberry Pi 3 with Julius

Julius seems to have a reading file and a grammar file to speed up authentication. After trying both, I decided to use a grammar file this time.

Please refer to Raspberry Pi 3 x Julius (reading file and grammar file) for the verification result.

1.1 Overview of voice analysis processing

If you start Julius in module mode (*), the audio will be returned in XML. If you say "Start Watson", you will get the following XML.

<RECOGOUT>
  <SHYPO RANK="1" SCORE="-2903.453613" GRAM="0">
    <WHYPO WORD="Watson" CLASSID="Watson" PHONE="silB w a t o s o N silE" CM="0.791"/>
  </SHYPO>
</RECOGOUT>
<RECOGOUT>
  <SHYPO RANK="1" SCORE="-8478.763672" GRAM="0">
    <WHYPO WORD="Watson started" CLASSID="Watson started" PHONE="silB w a t o s o N k a i sh i silE" CM="1.000"/>
  </SHYPO>
</RECOGOUT>

Therefore, for the spoken word, parse the XML and describe the process to be executed. (It's not good, but it's solid ...)

#Judge and process voice
def decision_word(xml_list):
    watson = False
    for key, value in xml_list.items():
        if u"Raspberry pi" == key:
            print u"Yes. What is it?"
        if u"Watson" == key:
            print u"Roger that. prepare."
            watson = True
    return watson

1.2 Start Julius server and connect to Julius server from client side

The Julius server is now started as a subprocess

#Start julius server
def invoke_julius():
    logging.debug("invoke_julius")
    # -Prohibit log output with the nolog option
    reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
    p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
    time.sleep(3.0)
    return p

#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500

#Connect with julius
def create_socket():
    logging.debug("create_socket")
    # TCP/Connect to julius with IP
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((JULIUS_HOST, JULIUS_PORT))
    sock_file = sock.makefile()

    return sock

1.3 Voice analysis (XML analysis)

As mentioned above, XML is returned from Julius, so get the to </ RECOGOUT> tags from it and analyze it. *. If there is a tag, an error will occur during XML parsing, so processing other than </ s> is included.

#Extract the specified tag from the data obtained from julius
def extract_xml(tag_name, xml_in, xml_buff, line):
    xml = False
    final = False
    if line.startswith("<RECOGOUT>"):
        xml = True
        xml_buff = line
    elif line.startswith("</RECOGOUT>"):
        xml_buff += line 
        final = True
    else:
        if xml_in:
            xml_buff += escape(line) 
            xml = True
                
    return xml,xml_buff,final

# <s>Removed tags (corresponding because an error occurred during XML parsing)
def escape(line):
    str = line.replace("<s>",'')
    str = str.replace('</s>','')
    return str
    
#Parse the XML of julius analysis results
def parse_recogout(xml_data):

    #Get the word of the recognition result
    #Save results in dictionary
    word_list = []
    score_list = []
    xml_list = {} 
    for i in xml_data.findall(".//WHYPO"):
        word = i.get("WORD") 
        score = i.get("CM")
        if ("[s]" in word) == False:
            word_list.append(word)
            score_list.append(score)
    xml_list = dict(izip(word_list, score_list))
    return xml_list

1.4 Overall

It's a little long, but the whole thing from 1.1 to 1.3 looks like this.

#Extract the specified tag from the data obtained from julius
def extract_xml(tag_name, xml_in, xml_buff, line):

    xml = False
    final = False
    if line.startswith("<RECOGOUT>"):
        xml = True
        xml_buff = line
    elif line.startswith("</RECOGOUT>"):
        xml_buff += line 
        final = True
    else:
        if xml_in:
            xml_buff += escape(line) 
            xml = True
                
    return xml,xml_buff,final

# <s>Removed tags (corresponding because an error occurred during XML parsing)
def escape(line):
    str = line.replace("<s>",'')
    str = str.replace('</s>','')

    return str
    
#Parse the XML of julius analysis results
def parse_recogout(xml_data):

    #Get the word of the recognition result
    #Save results in dictionary
    word_list = []
    score_list = []
    xml_list = {} 
    for i in xml_data.findall(".//WHYPO"):
        word = i.get("WORD") 
        score = i.get("CM")
        if ("[s]" in word) == False:
            word_list.append(word)
            score_list.append(score)
    xml_list = dict(izip(word_list, score_list))
    return xml_list

#Judge and process voice
def decision_word(xml_list):
    watson = False
    for key, value in xml_list.items():
        if u"Raspberry pi" == key:
            print u"Yes. What is it?"
        if u"Watson" == key:
            print u"Roger that. prepare."
            watson = True
    return watson

#Julius server
JULIUS_HOST = "localhost"
JULIUS_PORT = 10500

#Start julius server
def invoke_julius():
    logging.debug("invoke_julius")
    # -Prohibit logging with the nolog option
    #Soon,-Output the log to a file with the logfile option etc.
    reccmd = ["/usr/local/bin/julius", "-C", "./julius-kits/grammar-kit-v4.1/hmm_mono.jconf", "-input", "mic", "-gram", "julius_watson","-nolog"]
    p = subprocess.Popen(reccmd, stdin=None, stdout=None, stderr=None)
    time.sleep(3.0)
    return p

#disconnect julius server
def kill_process(julius):
    logging.debug("kill_process")
    julius.kill()
    time.sleep(3.0)

#Connect with julius
def create_socket():
    logging.debug("create_socket")
    # TCP/Connect to julius with IP
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.connect((JULIUS_HOST, JULIUS_PORT))
    sock_file = sock.makefile()

    return sock

#Close connection with julius
def close_socket(sock):
    logging.debug("close_socket")
    sock.close()

#Main processing
def main():
    #Start julius server
    julius = invoke_julius()
    #Connect to julius
    sock = create_socket()

    julius_listening = True
    bufsize = 4096

    xml_buff = ""
    xml_in = False
    xml_final = False
    watson = False

    while julius_listening:            
        #Get analysis result from julius
        data = cStringIO.StringIO(sock.recv(bufsize))
        #Get one line from the analysis result
        line = data.readline()
        while line:
            #Only the line showing the speech analysis result is extracted and processed.
            #Extract and process only the RECOGOUT tag.
            xml_in, xml_buff, xml_final = extract_xml('RECOGOUT', xml_in, xml_buff, line)
            if xml_final:
                #Analyze mxl
                logging.debug(xml_buff)
                xml_data = fromstring(xml_buff)
                watson = decision_word( parse_recogout(xml_data))
                xml_final = False
                #If the result is "Watson", go to voice authentication
                if watson:
                    julius_listening = False #Julius finished
                    break
            #Get one line from the analysis result
            line = data.readline()

    #Close socket
    close_socket(sock)
    #Disconnect julius
    kill_process(julius)← Watson's voice authentication "Speech to text" records using arecord, so Julius disconnects (because the microphone device collides, ...)
    if watson:
        speechToText()← If you are told "Watson", execute the processes ③ and ④

def initial_setting():
    #Log settings
    logging.basicConfig(filename='websocket_julius2.log', filemode='w', level=logging.DEBUG)
    logging.debug("initial_setting")
if __name__ == "__main__":
    try:
        #Initialization process
        initial_setting()
        #Main processing
        main()

    except Exception as e:
        print "error occurred", e, traceback.format_exc()
    finally:
        print "websocket_julius2...end"

■ Voice recording

Start the voice recording process (execute the arecord command) in multithreading. Binary data will be sent to watson each time it is recorded so that the voice can be converted to text in real time. (* .Data exchange to watson will be described later)

def opened(self):
    self.stream_audio_thread = threading.Thread(target=self.stream_audio)
    self.stream_audio_thread.start() 

#Start recording process
def stream_audio(self):
    # -Hide message with q option
    reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"]
    p = subprocess.Popen(reccmd,stdout=subprocess.PIPE)
    print 'Ready. Please voice'
    while self.listening:
        data = p.stdout.read(1024)
        try: 
            self.send(bytearray(data), binary=True)← Pass binary data to watson
        except ssl.SSLError: pass

■ Connect from Raspberry Pi 3 to watson (Speech to Text)

Use the webSocket version of Speech to Text to convert voice to text in real time. For Speech to text, please also refer to I tried Watson Speech to Text.

Implemented with reference to this sample source Getting robots to listen: Using Watson’s Speech to Text service

3.1 Connect to watson (Speech to Text)

Connect to watson using the library for watson (watson-developer-cloud-0.23.0)

class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
        username = "XXXXXXX"
        password = "XXXXXXX"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")

        self.listening = False
        try:
            WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)])
            self.connect()
        except: print "Failed to open WebSocket."

3.2 Connect to watson with webSocket.

    # websocket(Connect)
    def opened(self):
        self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')

3.3 watson voice authentication

The execution result (voice data) of the arecord command executed in the multithread described above is sent to watson. It's a little long, but ... 2. Voice recording-3. When I put together the connection from Raspberry Pi 3 to watson (Speech to Text), it looks like this.

class SpeechToTextClient(WebSocketClient):
    def __init__(self):
        ws_url = "wss://stream.watsonplatform.net/speech-to-text/api/v1/recognize"
        username = "XXXXXXX"
        password = "XXXXXXX"
        auth_string = "%s:%s" % (username, password)
        base64string = base64.encodestring(auth_string).replace("\n", "")

        self.listening = False
        try:
            WebSocketClient.__init__(self, ws_url,headers=[("Authorization", "Basic %s" % base64string)])
            self.connect()
        except: print "Failed to open WebSocket."

    # websocket(Connect)
    def opened(self):
        self.send('{"action":"start","content-type": "audio/l16;rate=16000","continuous":true,"inactivity_timeout":10,"interim_results":true}')
        self.stream_audio_thread = threading.Thread(target=self.stream_audio)
        self.stream_audio_thread.start() 
        
    #Start recording process
    def stream_audio(self):
        while not self.listening:
            time.sleep(0.1)

        # -Hide message with q option
        reccmd = ["arecord", "-f", "S16_LE", "-r", "16000", "-t", "raw", "-q"]
        p = subprocess.Popen(reccmd,stdout=subprocess.PIPE)
        print 'Ready. Please voice'
        while self.listening:
            data = p.stdout.read(1024)
            try: 
                self.send(bytearray(data), binary=True)
            except ssl.SSLError: pass

■ Text of youtube player interviews with Raspberry Pi 3 x watson

4.1 Implementation of received_message

When connecting with webSocket, it seems that the analysis result from watson can be received in the received_message event.

    # websockt(Receive message)
    def received_message(self, message):
        print message 

4.2 watson analysis results

The analysis result seems to be returned as a json object.

With this kind of feeling, I was able to convert the voice into text in real time.

キャプチャ.PNG

2017/4/16 postscript I made a video like this. https://youtu.be/IvWaHISF6nY

Finally

Impression that voice cannot be authenticated well when talking with multiple people or when there is music. Still, I thought it was simply amazing that the voice became text in real time. I want to play more and more with voice authentication.

Recommended Posts

Voice authentication & transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text)
I tried Watson Speech to Text
Log in to Raspberry PI with ssh without password (key authentication)
English speech recognition with python [speech to text]
Raspberry Pi 3 x Julius (reading file and grammar file)
Automatic voice transcription with Google Cloud Speech API
Convert voice to text using Azure Speech SDK
Connect to MySQL with Python on Raspberry Pi
Easy IoT to start with Raspberry Pi and MESH
Try to visualize the room with Raspberry Pi, part 1
Use raspberry Pi and Julius (speech recognition). ③ Dictionary creation
GPGPU with Raspberry Pi
DigitalSignage with Raspberry Pi
Easy introduction to home hack with Raspberry Pi and discord.py
Update Python for Raspberry Pi to 3.7 or later with pyenv
I tried mushrooms Pepper x IBM Bluemix Text to Speech
Create an LCD (16x2) game with Raspberry Pi and Python
Connect Raspberry Pi to Alibaba Cloud IoT Platform with Python
Introduced python3-OpenCV3 to Raspberry Pi
Mutter plants with Raspberry Pi
I talked to Raspberry Pi
Introducing PyMySQL to raspberry pi3
Speech to speech in python [text to speech]
I tried to automate the watering of the planter with Raspberry Pi
I made a web server with Raspberry Pi to watch anime