Julius seems to have a reading file and a grammar file to speed up authentication. I will try which one is easier to use.
Try Julius voice authentication to create a Raspberry Pi Robo that answers your question.
As shown in the figure below, the final goal is voice authentication and transcription with Raspberry Pi 3 x Julius x Watson (Speech to Text). (http://qiita.com/nanako_ut/items/1e044eb494623a3961a5)
This time, we will verify Julius in parts ① and ② in the figure below.
The following is assumed to be ready. For reference, list the link of the site that I referred to
--Enable the microphone on Raspberry Pi 3 -Easy to do! Conversation with Raspberry pi using speech recognition and speech synthesis -Try voice recognition and voice synthesis with Raspberry Pi 2 --Julius installation on Raspberry Pi 3 -Voice recognition by Julius-Utilization of domestic open source library
$ cat julius_watson.yomi
Raspberry pi
Watson started
Watson finished Watson Shuryo
Test test
Start
End end
End End
Reply to me
Say something
Regards
iconv -f utf8 -t eucjp ~/julius_watson.yomi | ~/julius-kits/dictation-kit-v4.3.1-linux/bin/yomi2voca.pl > ~/julius-kits/dictation-kit-v4.3.1-linux/julius_watson.dic
$ cat ~/julius-kits/dictation-kit-v4.3.1-linux/julius_watson.jconf
-w julius_watson.dic ← Converted to dictionary format above.Specify dic
-v model/lang_m/bccwj.60k.htkdic
-h model/phone_m/jnas-tri-3k16-gid.binhmm
-hlist model/phone_m/logicalTri
-lmp 8.0 -2.0
-lmp2 8.0 -2.0
-b 1500
-b2 100
-s 500
-m 10000
-n 30
-output 1
-input mic
-zmeanframe
-rejectshort 800
-charconv EUC-JP UTF-8
$ cd julius-kits/dictation-kit-v4.3.1-linux
~/julius-kits/dictation-kit-v4.3.1-linux $ julius -C julius_watson.jconf -demo
STAT: include config: julius_watson.jconf
WARNING: m_chkparam: "-lmp" only for N-gram, ignored
WARNING: m_chkparam: "-lmp2" only for N-gram, ignored
STAT: jconf successfully finalized
~ Halfway through ~
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for real-time decoding: *
* NOTICE: The first input may not be recognized, since *
* no initial mean is available on startup. *
*************************************************************
Stat: adin_oss: device name = /dev/dsp (application default)
Stat: adin_oss: sampling rate = 16000Hz
Stat: adin_oss: going to set latency to 50 msec
Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples)
STAT: AD-in thread created
pass1_best:Regards ← Noise
sentence1:Regards ← Noise
pass1_best:Regards ← Say "Thank you"
sentence1:nice to meet you
pass1_best:Watson ← Speak "Watson"
sentence1:Watson
pass1_best:Watson start ← Say "Watson start"
sentence1:Watson started
pass1_best:Raspberry Pi ← Say "I want to be hungry"
sentence1:Raspberry pi
<<< please speak >>>^C
Even with noise, "Thank you" is displayed ... Are all undecidable words interpreted as the last letter of the reading file? ??
$ cat julius_watson.voca
Watson w a t s o n
Raspberry pi r a z u p a i
NFL a m e f u t o
Electric d e n k i
% WO
W o
% PLEASE
Put on t u k e t e
Erase k e sh i t e
% NS_B
[s] silB
% NS_E
[s] silE
cat julius_watson.grammar
S : NS_B WATSON_ PLEASE NS_E
WATSON_ : WATSON
WATSON_ : WATSON WO
cp julius-4.3.1/gramtools/mkdfa/mkfa-1.44-flex/mkfa julius-4.3.1/gramtools/mkdfa/mkfa
cp julius-4.3.1/gramtools/dfa_minimize/dfa_minimize julius-4.3.1/gramtools/mkdfa/dfa_minimize
sudo julius-4.3.1/gramtools/mkdfa/mkdfa.pl julius_watson
julius_watson.grammar has 3 rules
julius_watson.voca has 5 categories and 9 words
---
Now parsing grammar file
Now modifying grammar to minimize states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[6/6]
Now making deterministic finite automaton[6/6]
Now making triplet list[6/6]
5 categories, 6 nodes, 6 arcs
-> minimized: 6 nodes, 6 arcs
---
generated: julius_watson.dfa julius_watson.term julius_watson.dict
Stat: adin_oss: device name = /dev/dsp (application default)
Stat: adin_oss: sampling rate = 16000Hz
Stat: adin_oss: going to set latency to 50 msec
Stat: adin_oss: audio I/O Latency = 32 msec (fragment size = 512 samples)
STAT: AD-in thread created
pass1_best: [s]Watson started[s]← Speak "Watson"
pass1_best_wordseq: 3 0 2 4
pass1_best_phonemeseq: silB | w a t o s n | k a i s i | silE
pass1_best_score: -3108.902100
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 23 generated, 23 pushed, 5 nodes popped in 122
sentence1: [s]Watson started[s]
wseq1: 3 0 2 4
phseq1: silB | w a t o s n | k a i s i | silE
cmscore1: 1.000 0.482 0.476 1.000
score1: -3108.899414
pass1_best: [s]Raspberry Pi[s]← Say "Raspberry Pi"
pass1_best_wordseq: 3 0 2 4
pass1_best_phonemeseq: silB | r a z u p a i | s i t e | silE
pass1_best_score: -3268.691406
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 23 generated, 23 pushed, 5 nodes popped in 132
sentence1: [s]Raspberry Pi[s]
wseq1: 3 0 2 4
phseq1: silB | r a z u p a i | s i t e | silE
cmscore1: 1.000 0.959 0.691 1.000
score1: -3268.694824
<<< please speak >>>
When you pronounce a word, do you interpret it by supplementing the noun + verb? ?? There is a feeling of not being good. .. .. It's a little to make up for what I didn't say ...
It seems that you can connect to Julius from other modules by adding the -module option. So, start Julius with the -module option, and try to output the connection / analysis result from python to the julius server.
Julius connection & analysis result output program. Someone copied the source of the ancestor, but ... I lost sight of the source. .. .. We will update it as soon as it becomes clear.
Julius_test.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
import socket
import cStringIO
host = 'XXX.XXX.XX.XX' #← Enter the local host address
port = 10500
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((host, port))
xml_buff = ""
in_recoguout = False
while True:
data = cStringIO.StringIO(sock.recv(4096))
line = data.readline()
while line:
if line.startswith(""):
in_recoguout = True
xml_buff += line
elif line.startswith(""):
xml_buff += line
print xml_buff
in_recoguout = False
xml_buff = ""
else:
if in_recoguout:
xml_buff += line
line = data.readline()
sock.close()
First, start Julius in module mode
~/julius-kits/dictation-kit-v4.3.1-linux $ julius -C main.jconf -C am-gmm.jconf -module
Execution result
$ python Julius_test.py
<RECOGOUT>
<SHYPO RANK="1" SCORE="-5520.531738">
<WHYPO WORD="" CLASSID="<s>" PHONE="silB" CM="0.200"/>
<WHYPO WORD="voice" CLASSID="voice+noun" PHONE="o N s e:" CM="0.187"/>
<WHYPO WORD="Authentication" CLASSID="Authentication+noun" PHONE="n i N sh o:" CM="0.074"/>
<WHYPO WORD="test" CLASSID="test+noun" PHONE="t e s u t o" CM="0.273"/>
<WHYPO WORD="。" CLASSID="</s>" PHONE="silE" CM="1.000"/>
</SHYPO>
</RECOGOUT>
Here is a summary of the differences when running in module mode.
■ Reading file
cd julius-kits/dictation-kit-v4.3.1-linux
julius -C julius_watson.jconf -module
■ Grammar file
julius -C julius-kits/grammar-kit-v4.1/hmm_mono.jconf -input mic -gram julius_watson
※.hmm_mono.In jconf,-Describe module option
The result of saying "Watson started"
■ Grammar file
<RECOGOUT>
<SHYPO RANK="1" SCORE="-2817.017578" GRAM="0">
<WHYPO WORD="[s]" CLASSID="3" PHONE="silB" CM="1.000"/>
<WHYPO WORD="Watson" CLASSID="0" PHONE="w a t s n" CM="0.973"/>
<WHYPO WORD="erase" CLASSID="2" PHONE="k e s h i t e" CM="0.560"/>
<WHYPO WORD="[s]" CLASSID="4" PHONE="silE" CM="1.000"/>
</SHYPO>
</RECOGOUT>
■ Reading file
<RECOGOUT>
<SHYPO RANK="1" SCORE="-2903.453613" GRAM="0">
<WHYPO WORD="Watson" CLASSID="Watson" PHONE="silB w a t o s o N silE" CM="0.791"/>
</SHYPO>
</RECOGOUT>
<RECOGOUT>
<SHYPO RANK="1" SCORE="-8478.763672" GRAM="0">
<WHYPO WORD="Watson started" CLASSID="Watson started" PHONE="silB w a t o s o N k a i sh i silE" CM="1.000"/>
</SHYPO>
</RECOGOUT>
If you say "Start Watson" ・ Grammar file ⇒ It will reply that the accuracy of "Erase Watson" that was hit by "Watson" is high. ・ Reading file ⇒ Since nouns and verbs are separated, "Watson" and "Watson start" are judged separately.
⇒ How to register the grammar file in words? ?? Even if you don't call it a sentence, if you speak with a noun + verb, it seems that misrecognition will increase considerably. This time, the grammar file looks better.
Julius was described as slow with raspberryPi2, but I felt it was quite fast with raspbeerypi3. For the purpose of improving authentication speed, it may not be necessary to have a reading file or grammar file. If I could limit the words I spoke to to some extent, I wondered if I would use a reading file or grammar file to improve the authentication rate.
Recommended Posts