Acoustic Model Creation using SphinxTrain

Before you look at this, you can peruse the official SphinxTrain documentation at the CMU website. It’s not for the faint-hearted, but if you’re a programmer and know how to get around Linux, then use it instead. Even if you’re interested in each step of how to do this, you may want to consider a much easier way….

Getting the programs

First, you need sphinx3 (I had to go back a few versions, or else it wouldn’t work). and SphinxTrain (This is the nightly build location; there isn’t an official release). Again, I assume you’re using Linux as root. Once downloaded, un-tar them:

>tar -xvf sphinx3…
>tar -xvf SphinxTrain-

(Another assumption….you have gcc and g++ (the c and c++ compilers) installed on your machine.) After they expand into their respective folders, go into the sphinx3 folder and run the config script:
>./configure
If there are no errors, it should have made a make file. Make sure you’re root and run the command:
>make
This will take a while. You will also need to run
>make install

Now move into the SphinxTrain folder and perform the following steps:
>./configure
>make

No need to run “make install” for SphinxTrain

Creating your project/task folder

Okay, now you need to make a project folder. For example’s sake, I’ll call our project myam (>mkdir myam) and it needs to be in the same folder that SphinxTrain and sphinx3 are in.  Then naviagte into myam and run this command:
>../SphinxTrain/scripts_pl/setup_SphinxTrain.pl –task myam

Notice that myam is the name of the task and is also the name of your folder. It doesn’t have to be, but it makes things easier later.

Collecting your data

Put all of these files, unless otherwise specified, into your myam/etc folder.

First, you need the audio files that you want to use as your model of speech. I happened to have about 160 wav files, each of them is a single-sentence utterance. For example, if you listened to the first one, it might say “a player threw the ball to me” and that is all. Therefore, you need a bunch of single-sentence audio files, preferably in wav or raw format. Put all of your audio files into the myasm/wav folder

Next, you need a control file. It’s just a text file. Name it myam_train.fileids (you MUST name it [name]_train.fileids where [name] is the name of your taks, if you’re not using myam) that has the name of each of your audio files (note that there are no file extensions).     0001
0002
0003
0004

Next, you need a transcription file that has the transcript of everything uttered where each line has a single file’s utterance on it. It MUST correspond to your control file your control file. For example, if I look at my control file, it says 0001 on the first line, therefore the transcript for the first line of my corpus file will be “A player threw the ball to me” because that’s the transcript of 0001.wav. The corpus file, another text file named myam.corpus, should have as many lines as your control file. Remove any punctuation. For exmaple:

a player threw the ball to me
does he like to swim out to sea
how many fish are in the water
you are a good kind of person

Corresponds to my 0001, 0002, 0003, and 0004 files in that order.

What if I don’t have any transcripts of my audio files? Well, you’ll have to get some. NLP has to start somewhere, which means some people have to deal with manually creating data to train from. There is also a vast amount of data on the Internet where you can find audio/transcript bundles, some at the LDC (but that requires a membership).

Anyway, you now have your folder of audio files, a control file, and a transcription file. You still have a long ways to go. You still need a main dictionary which includes each word and the phonemes that make up the word, a filler dictionary, and a phone list. Lucky for you, CMU has an online tool that does the dictionary part for free: http://www.speech.cs.cmu.edu/tools/lmtool-adv.html.

This website asks for several files, but you really only need one and that’s transcript file (myam.corpus). Browse for your transcription file under the “Sentence corpus file:” field.  Then click “Compile Knowledge Base” and wait a few seconds for the results. Download the sentence file and call it myam_train.transcription (notice that this file differs from your corpus file only in that it has start and end stentence tags <s> and </s> and everything is upper-case). Download the dictionary file and call it myam.dic. Download the LM file and call it myam.lm. You’ll only need the first two for SphinxTrain, but the LM file is handy to have for other things. Put all files in your myam/etc file.

You will next need a filler dictionary. You can get specific here with different filler sounds, but we’ll just put together a base one. Make a file called myam.filler and paste this into it:

<s>     SIL
<sil>   SIL
</s>    SIL

That leaves one last file, your phone file. This tells the trainer what phonemes that are part of your training set. You should only have the phonemes you need, no more, no less. How do you find the phonemes you need? Open up your myam.dic file. You’ll see words and then you’ll see a breakdown of how those words are pronounced. For example, in my dic file I have:

ACTING    AE K T IH NG

The AE, K, T, IH, NG are all phonemes that make up the word acting. You’ll need a list of all the phonemes used without duplicates. You can either just follow the next steps and home the errors tell you which phonemes are missing, or you can go to the page I made that will extract the phonemes for you:

http://bakuzen.com/extractphoneme.php

Be gentle. I just threw it together as I put together this post. It takes the dictionary file (myam.dic) that was generated by the CMU site and displays all the unique phonemes. The only problem is…they aren’t all completely unique. You may have to go through and take out duplicates. I don’t know why, but some of them aren’t counted as unique in the php unique_array function. Anyway, copy the phoneme list into a file called myam.phone.

That’s it for file collecting. A recap:

  1. All wav audio files into the myam/wav folder
  2. The rest will be in the myam/etc folder
    1. myam.dic
    2. myam.filler
    3. myam.phone
    4. myam_train.fileids
    5. myam_train.transcription
    6. feat.params
    7. sphinx_train.cfg

NOTE!!! Double check the following…..

  1. your .dic,.filler, .phones, and .transcription file have everything capitalized. If not, you can capitalize everything with Kate in Linux or PSPad in Windows (or a similar program)
  2. you have an empty line at the bottom of each file
  3. you have the same number of lines in the .transcription file as you do in the .fileids file
  4. make sure your .phone file has no duplicate entries

You have some configuring to do now. Open up myam/etc/sphinx_train.cfg with an editor (>kate myam/etc/sphinx_train.cfg). It looks like a fairly daunting file, but there won’t be much you have to change here. First, notice that $CFG_DB_NAME = “myam” or whatever you set your task name to be. Many other properties in this file hinge around that name. That’s why we named the .dic, .phone, and other files the way we did. Also notice the $CFG_BASE_DIR is set to the directory where your task folder exists. If you ever moved the folder, you’ll need to  change this path. The next property, $CFG_SPHINXTRAIN_DIR, is set to the relative path where your SphinxTrain folder is, just in case it needs something from it.

Now, on to editing a few things. First, you’ll want to find the line that has the property: $CFG_WAVFILE_EXTENSION in it. To the right of the = sign is the file extension of your audio files. This is appended onto each of the filenames in your myam.fileids file. I set mine to ‘wav’ and you need to be sure the single quotes are there, too. I also had to set the $CFG_WAVFILE_TYPE = ‘mswav’ since my wav files were created in Windows (by someone else). I forgot to set this at first, and it never gives an error; the training sort of just hangs and doesn’t do anything. Save your changes, and close the editor.

Creating the model

NOW you get to create the acoustic model. First, navigate to your myam folder and then run this command:

>./scripts.pl/make_feats -ctl etc/voxforge_it_train.fileids

This creates feature files from the wav files and stores them in the myam/feat directory. It should move through the files fairly quickly. It went through my 160 files, each averaging about 10 words, in a few seconds.

If there were no errors, you can move onto the last part. Run this command from the myam folder:

>./scripts.pl/RunAll.pl

This is where the magic happens, and it could (really should) take several minutes depending on how much data you have. It first goes through and makes sure the data you have are usable, and then it actually goes through the different phases of the acoustic model training. It logs errors to the myam/logdir folder, and it creates an easier-to-read html error log in your myam folder, named myam.html (or the name of your task+.html). The bottom of the file has your latest log information.You will probably have several errors and warnings, but if there was no “fatal error” then your training should be complete.

Making the model usable for Sphinx4

I really just copied an acoustic model jar file, like the WSJ one, renamed it to zip, and created a similar file structure. CMU has a tutorial that is very helpful in putting together your sphinx4 acoustic mode, and I refer you to that for further help. Once you get your folder structure created (just to test out, make it the same as CMU’s structure), you’ll need this file structure:

cd_continuous_8gau/means
cd_continuous_8gau/mixture_weights
cd_continuous_8gau/variances
cd_continuous_8gau/transition_matrices
dict/cmudict.0.6d
dict/fillerdict
etc/TOY_8gau_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef
etc/TOY_8gau_13dCep_16k_40mel_130Hz_6800Hz.ci.mdef

There are several files in cd_continuous_8ga, etc, and dict. The files that belong in the cd_continuous_8gau folder can be found in your myam/model_parameters/eimodel.ci_cont/ folder (the names correspond). The dic folder wants any dictionary files. Add your myam.dic and your myam.filler dictionaries to it. The etc directory uses two files found in the myam/model_architecture/ directory. The mdef file will be your myam.alltriphones.mdef file, and the ci.mdef file will be your myam.ci.mdef file. Copy each file into the correct folders.

Now you need to create a file in the directory that holds the etc, dict, and cd_continuous_8ga folders. The file needs to be named model.props and you need to add these properties to it:

description = any description of your model file
modelClass = edu.cmu.sphinx.model.acoustic.EI_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model
modelLoader = edu.cmu.sphinx.model.acoustic.EI_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader

isBinary = true
featureType = 1s_c_d_dd
vectorLength = 39
sparseForm = false

numberFftPoints = 512
numberFilters = 40
gaussians = 8
minimumFrequency = 130
maximumFrequency = 6800
sampleRate = 16000

dataLocation = cd_continuous_8gau
modelDefinition = etc/myam.ci.mdef

Save the file and close the editor. You will also need the  Model.class, ModelLoader.class, and the PropertiesDumper.class. You can either get those from an exisiting jar, or go to the site I referred you to on how to create it correctly.

Now navtigate into the etc folder where your mdef files are. Create a file called variables.def and put this info into it:

set exptname = myam
set vector_length = 13
set dictionary = $base_dir/lists/myam.dic
set fillerdict = $base_dir/lists/myam.filler
set statesperhmm = 3
set skipstate = no
set gaussiansperstate = 8
set feature = 1s_c_d_dd
set n_tied_states = 4000
set agc = none
set cmn = current
set varnorm = no

Notice the myam.dic and myam.filler. Be sure to use the link I provided for more information. Save the file and close.

Now, if you want to do it the easy way, go back to your first folder (if you followed the CMU way, the foler “edu” and create a zip file out of it. Rename the zip file to a jar file exension and you now have an acoustic model. The rest is linking it into Sphinx4 via eclipse and setting up the information in your config.xml file. There will be three places do to that, the dictionary, the loader, and the acoustic model definitions. Refer to my original post on sphinx4 on how do deal with the config file.  I had to play with it for a while before I got everything to work correctly, but it was thrilling to see my own first acoustic model to work in Sphinx4.

If you have any trouble, feel free to leave a comment with your question and I’ll see what I can’t help you through. There is also a great site called http://voxforge.org/ that is a go-to site for DIY NLP people out there. A site like that may make my site obsolete one way, but I’ll still be around for those of you who aren’t as programming savvy as those folks typically are. It’s an excellent site and I encourage you to look there for data you can use in acoustic model creation, help on problems you run into, and also to contribute by adding data you have, giving insights on aspects of NLP, or helping people by answering questions they may have.

10 Comments

  1. Stefania:

    Hi .. I been trying to train sphinx4 to recognize numbers and a few words in spanish and i read this instructions and i don’t understand why sphinx 3 is necesary to use sphinxtrain .. by the way i couldn’t improve the accuracy by training the framework .. in fact it work better before the training .. i’m guessing that i’m doing it wrong .. but i’m not using sphinx3 .. should i?

    Thanks a lot !

    Stefania

    ps: sorry for my english ..

  2. Peter:

    Hi, I made it!!!!!! I successfully made a spanish trainning!!!!
    Stefania, I can help you if yo want to.
    My e-mail is petermon@gmail.com

  3. admin:

    Nice work, Peter. The English phonemes were enough to cover what you needed then?

  4. admin:

    Meir,

    I wasn’t able to download the zip file you pointed me to. Did you use English phonemes to generate the Arabic model?

  5. Meir:

    I managed to train my model to recognize hebrew words.
    It only works when i’m directing the decoder to use hebrew.ci_cont HMM. It doesn’t work when i direct it to work with any of the hebrew.cd_cont_1000* dirs.
    Do you know why this can happen?
    Also, how can i debug decoder recognition problems? for example, i have 2 very similar wav files - one for training and another for testing (i can see in SoundForge they look very very similar) and still the decoder is unable to match them. Are there any tools/log files i can use to see the coarse of action the decoder uses to find similarities between the two?

    I re-placed the zip file in: http://dl.dropbox.com/u/344251/hebrew.zip
    Try to copy the url to new browser window - works for me.

  6. admin:

    When you say decoder, do you mean sphinx4? There are different log levels you can set in your config.xml file (under loglevel):
    * SEVERE (highest value) - an error occurs that makes continuing the operation difficult or impossible
    * WARNING - an anomalie has occured, but the operation is continuing
    * INFO - general information
    * CONFIG - information about a components configuration
    * FINE - tracing messages
    * FINER - finer grained tracing messages (lots of output)
    * FINEST (lowest value) - finest grained tracing messages (huge amounts of output)

    I don’t find any verbose mode in sphinx3, but when you do forced alignment in sphinx2, one can set verbose mode to 9. You might try something like that even if it’s not specified.

    But you probably would have figured that out. So, I’m not sure what you’re using to decode. As for your files, your training files all looked good. I know I’ve had trouble any time I named the DMP file anything other than an4.lm.DMP. I also noticed that you’re using Windows, and I got mine to work on Linux so there might be something there. I know that when I used some other people’s files, I had problems because they were in a Windows format and Linux had a hard time with them. I’m just starting as basic as possible, because that’s usually where the problems are.

  7. Meir:

    I’m using sphinx3 for decoding for now. Haven’t made the conversion to 4. I’ll try to find the logging level in sphinx 3 and set them to as lowest level as i can. Up until now i just read the log files as they were - didn’t think there was logging level anywhere.

    I wanted to ask why the CI files work for decoder and the CD not. In all the tutorials i read so far, the requirement is to use CD files. Is there anything wrong i’m doing here by using the CI files?

    BTW: the problems with linux vs. windows stems for the casing problems of the wording and the filenames in configuration. Usually when this happens it’s very clear states the problem (not finding file / not finding word etc…).

    Meir

  8. Peter:

    Hi, nope, actually I did not an “alphabetical” trainning, I made a “phoneme” trainning. (Using words as phonems).

  9. Tanya:

    Hi,I’m not so familiar with Linux, but currently i’m using Window XP with Sphinx 4.0, can guide me how to use window xp to use SphinxTrain to build up acoustic model? I believe I need the C++ compiler to compile SphinxTrain. Is it correct? Actually I would like to do speech recognition for my country rural language. So it is necessary for me to start build the acoustic model right? Thanks!

  10. admin:

    Just find a c++ compiler for windows. You may need to get a version of Make for windows that compiles it, then there’s a win32 folder that has some of the other things you might need, like bat files and some sample scripts for Windows. It should be mostly the same otherwise.

    Good luck!

Leave a comment

You must be logged in to post a comment.