Using Sphinx4
If you followed the last post, you set up sphinx and got it to work, but now you want to really use it. There are a number of parts of the recognizer you need to know about that I will discuss:
1. Configuration File
2. Grammar
3. Acoustic Model
4. Linguist
5. Dictionary
Configuration File
This file tells the recognizer everything about how you want the recognizer to work. It is located (if you followed my previous post) in src->wavfile-> config.xml. It is an xml file, so you can view it in the xml editor in eclipse which has the ability to collapse part of the file, but I prefer a regular text editor. You can view it in the default text editor by right-clicking on it, open with->text editor. You’ll see the first part of the file:
<property name=”absoluteBeamWidth” value=”-1″/>
<property name=”relativeBeamWidth” value=”1E-90″/>
<property name=”wordInsertionProbability” value=”1E-36″/>
<property name=”languageWeight” value=”8″/>
<property name=”silenceInsertionProbability” value=”1″/>
<property name=”skip” value=”0″/>
<property name=”logLevel” value=”WARNING”/>
<property name=”recognizer” value=”recognizer”/>
<property name=”linguist” value=”flatLinguist”/>
<property name=”frontend” value=”mfcFrontEnd”/>
Linguist
Unless you know what you’re doing, don’t change anything here. The last three lines define which recognizer, linguist, and frontend you’re going to use. Focus your attention on the linguist and note that it’s called “flatLinguist.” If you search through the file, you’ll see a component named “flatLinguist” which tells the recognizer which grammar and acoustic model to use. One of the purposes of the linguist is to bring these things together. It’s simply a part of sphinx that helps the recognizer.
Acoustic Model
The acoustic model is a very important part of the recognizer. In fact, it’s not part of the recognizer at all; the acoustic model is one of the things that is changeable. If you followed the previous post, you’ll notice that you used “TIDIGITS” (file name is TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar). The acoustic model holds the statistical information that the recognizer needs to put phonemes, syllables, words, and then sentences together. Different models come from different sources and are for different purposes. For example, the TIDIGITS model is specifically designed and created for numbers. If your application only needs to recognize numbers (for example, your cell phone listens to you speak numbers) then you would use this model. Notice that there are two more models. Focus your attention on the one called WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar (by the way, these acoustic models follow a specific naming convention, hence the long, complicated name). This WSJ acoustic model was created from Wall Street Journal spoken text. It, like other acoustic models, contains a dictionary of words spoken, and how to map the phonemes to those words. This model would be very useful for most spoken words you would say, so I will walk you through getting sphinx set up with this model. In order to do that, you have to get things set up in the config.xml file.
Add this info to your config.xml file in the Acoustic Model Configuration Information section:
<component name=”acousticModelWSJ”
type=”edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model”>
<property name=”loader” value=”sphinx3LoaderWSJ”/>
<property name=”unitManager” value=”unitManager”/>
</component>
<component name=”sphinx3LoaderWSJ”
type=”edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader”>
<property name=”logMath” value=”logMath”/>
<property name=”unitManager” value=”unitManager”/>
</component>
This sets up the acoustic model WSJ along with the necessary loader. If you compare it to the TIDIGITS model and loader, you’ll notice that it’s pretty much the same.
Dictionary
With the acoustic model in place, you now need to set up sphinx to use the dictionary that the acoustic model can understand and use. You can pase this into the config.xml file in the section called The Dictionary Configuration:
<component name=”dictionaryWSJ”
type=”edu.cmu.sphinx.linguist.dictionary.FullDictionary”>
<property name=”dictionaryPath”
value=”resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d”/>
<property name=”fillerPath”
value=”resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/.
fillerdict”/>
<property name=”addSilEndingPronunciation” value=”false”/>
<property name=”unitManager” value=”unitManager”/>
</component>
Grammar
You’re almost ready to run the recognizer using the WSJ acoustic model and dictionary. Now you just have to make sure the grammar and linguist know to use it in stead of the TIDIGITS model. Underneath The Linguist Configuration, you’ll notice this line:
<property name=”acousticModel” value=”acousticModel”/>
But if you look at the acousticModel component lower in the config.xml file, that uses the TIDIGITS. The one you want is acousticModelWSJ, so you can remove the above line and paste this in it’s place:
<property name=”acousticModel” value=”acousticModelWSJ”/>
Similarly, in The Grammar Configuration, you need to set this line:
<property name=”dictionary” value=”dictionary”/>
to
<property name=”dictionary” value=”dictionaryWSJ”/>
Be sure to save the config file. You can download my config file made after all of these changes here (be sure to right-click->save as).
Now, if you run the recognizer (WavFile) like you did in the previous post, it should still recognize “one two three four five.” This model contains at least some digit information, so it recognized this file just fine.
In order to recognize your own file, there are just a few more things you need to do. First, you need to have a file that contains some kind of utterance. For example, you might record yourself saying “I like to drive cars” and save it as a 16-bit wav file called “myfile.wav” (sphinx only takes 16 bit wav files) then you can then copy that file into the src/wavfile folder (the same place where the other wav files are) and on line 53 of WavFile.java, change what is inside the quotes from “12345.wav” to “myfile.wav” and it should be ready. One more thing- you need to find the WSJ model file (with the really long filename) and then right-click on it and add it to the build path.
Once you have completed that, you have to change the grammar to listen to what you are going to say. In src/wavfile, find the file called digits.gram and open it (still in eclipse). Notice that this grammar has the digits that were spoken before in the 12345.wav file. It looks like this:
#JSGF V1.0;
/**
* JSGF Digits Grammar for Hello World example
*/
grammar digits;
public <numbers> = (oh | zero | one | two | three | four | five | six | seven | eight | nine) * ;
Change it to look like this (or adapt it to include the words you actually recorded earlier):
#JSGF V1.0;
/**
* JSGF Digits Grammar for Hello World example
*/
grammar digits;
public <numbers> = (I | like | to | drive | cars) * ;
Save the file, then try to recognize it. The recognizer should pick it up. In future posts, we’ll go into specifics on the grammar and how to convert mp3s in sphinx so you don’t always have to convert everything to wav.
Jim:
Thank god for your website! I’ve been looking into the different software packages that are available for doing forced-alignment, and it’s been an uphill battle trying to figure out how everything is supposed to work. Thanks for your excellent posts, I will try to follow along your tutorials.
May 3, 2008, 2:29 pmadmin:
I’m glad to hear that it was helpful to someone. Let me know if you have any other questions. I’d post more, but I’m in the middle of moving, new job, etc, but I have plans for more posts as soon as possible.
May 3, 2008, 3:33 pmMagdalena:
Hi, i followed your instructions (on XP and Vista) but after i changed config file to WSJ (either by editing or copying the whole file from this website) the error occured: java.lang.OutOfMemoryError: Java heap space. I changed the amount of memory that VJM can use to 1 GB but it’s still not working. Perhaps i should change something in my project? Anyway you did great job with this tutorial. I find it very usefull for my classes at univeristy. Thanks!
June 8, 2008, 5:51 pmMagdalena:
I figured out what was the problem. The solution is Run -> Open Run Dialog -> (x)=Arguments and in section VM arguments you have to put -Xmx512m for example (adjust to your of memory). It will change the memory that is used to run the project.
June 9, 2008, 11:11 amAlex:
Your blog is interesting!
Keep up the good work!
August 15, 2008, 1:22 amRui:
hi, nice website. How about some words about adapting sphinx to recognize another language like german?
October 8, 2008, 7:34 amJuliano Menezes:
Amigo eu estou com o seguinte erro eu estou fazendo um projeto que necessita do reconhecimento de voz, só que estou tendo um problema.
O meu software executar normal, dai aparece para falar a palavra só que ele esta travando, dai ele tipo que para de executar, eu estou usando no netbeans será que pode ser isso ?
Pois se voce me ajudar eu agradeço!
Juliano Menezes
October 13, 2008, 7:42 amadmin:
Rui,
Thanks for bringing that up. All Sphinx4 needs is an acoustic/language model in the language you are using and you’re set. Sadly, there isn’t anything useful for German right now, but there are resources for Spanish and French. In a later post I’ll adapt it, but if you’re feeling ambitious, you can find everything you need here for Spanish:
October 13, 2008, 2:40 pmhttp://speech.mty.itesm.mx/~jnolazco/proyectos.htm
admin:
Juliano,
NetBeans não deve ser muito diferente. Você deve ser capaz de copiar o código que eu tenho junto com os arquivos jar java e ser capaz de executá-lo. Ele pode estar bloqueado por causa de um problema de memória heap space. Certifique-se de juntar-Xmx512m aos argumentos da linha de comando.
Eu não uso NetBeans. Se você não tem nada contra o eclipse, então sugiro baixá-lo e seguir as instruções no meu primeiro post para buscá-la criado. É muito detalhado.
October 13, 2008, 2:44 pmbaküzen » Blog Archive » Acoustic Model Creation in Sphinx4:
[…] places do to that, the dictionary, the loader, and the acoustic model definitions. Refer to my original post on sphinx4 on how do deal with the config file.
January 24, 2009, 10:30 amNisha:
Hi..I have tried the Hello Digits Program with tidigits…it works fine when you are providing wav file that contains only digits but if a wav file with words and digits both is given as an input it converts the words also into digits instead of ignoring them…I know it has something to do with configuration file intial options but I tried editing didn’t help…can u suggest something…
December 23, 2009, 11:50 pmadmin:
It’s most likely your language model. The tidigits language model has data that are only trained on digits. I would suggest using the Wall Street Journal language model, which I think also comes with sphinx, or at least you can find it on the CMU website. It has digits and a lot of news wire, so it has normal words and digits.
Good luck!
January 1, 2010, 2:25 pmplk:
i wanna use HUB4 model in the transcriber demo.i configured the config.xml for HUB4 and trigram language model instead of grammer…
program is going into infinite loop trying to recognize the wav file..
thanks in advance if u reply..
January 30, 2010, 7:28 amPLK
admin:
Does it work when using a grammar? Try that first.
January 31, 2010, 12:22 amplk:
hi admin,
thanks for the reply. i tried what u said in the above post by using WSJ and manual gram file and its working fine with me.
While during hub4 and linguistic model,when i run it, it shows some warning about missing word that are present in model and not in linguistic dictionary. so it means that atleast both hub4 and linguistic are being used during execution hence have been configured correctly.problem occurs during recognition of wav.
i am using language_model.arpaformat.DMP as linguistic
here is my config.xml file
streamDataSource
accuracyTracker
speedTracker
memoryTracker
recognizerMonitor
beamFinder
streamDataSource
premphasizer
windower
fft
melFilterBank
dct
batchCMN
featureExtraction
concatDataSource
speechClassifier
speechMarker
nonSpeechDataFilter
premphasizer
windower
fft
melFilterBank
dct
liveCMN
featureExtraction
unitExitActiveList
wordActiveList
wordActiveList
activeList
activeList
activeList
configMonitor
Thanks in advance if u reply,
January 31, 2010, 4:36 amPLK
plk:
oops…the config file didnt came properly in the above comment.
January 31, 2010, 4:38 amsend me ur mail id so that i can mail u….
PLK
Matt:
Hi, I’ve been experimenting with Sphinx 4 over the past couple of days. I am trying to use a dictation grammar instead of a rule based grammar that you are using in this tutorial. I have not been able to find much through google except for using the general java speech api. Is there a way to do this with Sphinx? Could you point me in a direction?
Thanks,
February 7, 2010, 2:59 pm-Matt
plk:
hello admin,
I tried hub4 with grammar and its working fine.
I want to convert bigger wave file.
Plz help me with that.I need it very urgently.
Plz help to configure language model inside hub4.
Thanking u in advance.
February 13, 2010, 9:35 pmPLK
kmj:
Great website!
I modified Transcriber to recognize words using WSJ and a trigram language model. Everything seems to work fine, except not one word is recognized correctly!! I already tried changing between 8 and 16 Hz with no luck. Any ideas?
kmj
February 15, 2010, 10:53 amTanya:
Hi Admin,
I would like to work Sphinx 4.0 to recognise 20 words of my country’s indigenous language.
I manage to make it by adding the word in library (without creating the acoustic model because there’s one sun programmer suggest that if work under 100 words, there’s no need of building own acoustic model)and compile using HelloWorld.java.
The average recognition of words is around 0.7. And he (the sun programmer) suggested me to play around with absoluteBeamWidth, relativeBeamWidth, wordInsertionProbability, languageWeight in HelloWorld.config, in order to improve accuracy, and there’s nothing relevant to the java coding or grammar structure.
My problem now is I fail to find relevant and useful resources about the beamWidth, wordInsertionProbability, languageWeight, what dose all these parameters for, how they affect the accuracy. Can Sir suggest me the useful resources onto these? I’ve look into sphinx and cmu sphinx website and both do not contain and explain much about it.
Really appreciate your great helps!!
March 21, 2010, 10:26 pmadmin:
Tanya,
You can certainly change around those parameters to improve accuracy. To do that, open up the confix.xml file and you’ll see them pretty close to the top. When sphinx is trying to recognize speech, it comes up with a lot of different phoneme strings and then with different word strings. It could literally come up with hundreds of different hypotheses and process all of them to see which one is the best. However, this would be time and memory consuming, so it only chooses a few at a time to work with. That’s what the “beam” is. If you make the beam smaller, it will only hang onto a small amount of the possible strings, whereas if you make it larger, it will hold onto more. You can try making the number a little bit bigger on the beam width. Absolute beam width means no matter what the strings are and what their probabilities might be, it will use a fixed beam width (usually some kind of threshold). Whereas, the relative beam width might change depending on the strings that are guessed. I would start increasing the size (maybe double the size) on the absolute beam width. If that doesn’t increase your accuracy, then I would say that these things won’t help you much. You can always play around with the grammar (that did the best for me every time). Make sure your phoneme mappings to the English phoneme system are as close as you can get. Otherwise, the only thing might be a new acoustic model. I know it’s cumbersome, but if you want accuracy, even if you tune your sphinx engine perfectly, if your data can’t get you the results, then you have to get better data.
March 24, 2010, 8:43 amTanya:
Thanks Admin! It does point the direction for me
March 24, 2010, 8:46 am