Forced Alignment in Sphinx2

Getting and Installing Sphinx2 and Sphinx3
Sphinx2 was designed to run on any architecture that has a C compiler. For our purposes, we’ll use a Linux system. The steps below can be followed for Sphinx2 or Sphinx3 (at least the installation steps are the same).

1. Download the software from CMU Sphinx

2. Untar the file (tar -xvf filename) and set the permissions to run the config file (chmod a+x configure)

3. Run the config file: (./configure) from the proper directory. This will generate a make file

4. Run make with root privileges (just type make). This should install everything you need where things belong.

5. Run make install (make install) as root.

Preparing for Forced Alignment

1. You will need the following files:
-the audio files
-control
-corpus
-dictionary
-language model
-language model DMP
-run-batchalig.sh

NOTE!! Make sure you save your files in a linux format. In other words, either create the file in linux or open a file that does work in linux,save as something else, then paste in your information. If you don’t you may spend 3+ hours wondering why your ctl and cor files aren’t being read properly.

2. Create a folder where all of these files will go

3. Convert the audio files to raw format. If you are taking files from mp3 format to wav then to raw, BE SURE to remove all of the mp3 tag data or the recognizer will choke. There are many download-able programs that can remove the tag information. You can remove it by hand in a text editor if you’re careful. There are many tools out there that do this, but one useful tool is called sox. You can download it and run it from a Windows or Linux machine. Once you have it, run the following script (in Linux bash):

for f in *.wav; do sox “$f” “${f%.wav}.raw”; done

This will convert all of your wav files to raw and name them appropriately. It won’t move or delete the original files.

4. Put all of those converted files into your folder.

5. Create the control file. This file simply says where all of the audio files are. You don’t need to put the .raw extension. For example, if you have two raw files called file1.raw and file2.raw, your control file would look like this:

file1
file2

for argument, we’ll name the file mystuff.ctl

6. Create a corresponding corpus file. This is basically a transcript of what is said in the files. The lines of the transcript should correspond to the files in your control file. For example, if the file1 audio is “I like to run” and the file2 audio is “I ran 8 miles” then your corpus file should look like this (the first line must have *align_all*):

  • align_all*

I like to run
I ran 8 miles

Note that there is no punctuation and that the order of lines corresponds to the filenames in the control file. For argument, we’ll name this file mystuff.cor

7. Next, you need a dictionary. A dictionary file has each word in your corpus file (step 6) with annotated phonemes. That would be a pain to do by hand, especially if you had a lot of words to annotate. CMU has an online tool to help you:

http://www.speech.cs.cmu.edu/tools/lmtool-adv.html

This actually generates more than just the dictionary file. It also generates the language model file that you also need, and a sentence file. Just browse to your sentence corpus file (mystuff.cor). You can give it an optional exception dictionary, which is a dictionary file that already has some annotated words. This will override the default generated annotations for those words with the annotation you give in that exception dictionary. The other thing you can input is an additional words file, which is just a list of words (unannotated) that you want included in your dictionary.

You can tell it which phone set to use. Just use the default and click “Compile Knowledge Base” which will take a few seconds. When it’s done, paste in the text from the two files into your dictionary and language model files. For argument, they would be called mystuff.dic and mystuff.lm. You can save the sentence file also, which may come in handy sometime.

But you’re not done yet. You have to convert the dictionary file because sphinx2 won’t recognize the phoneme set given from the website. You need to run the following command:

stress2sphinx mystuff.dic >> mystuff.dic.conv

Now the latter file is the dictionary you will use.

8. Now you need the language model DMP file. This is basically your language model in binary form. Rename your file to an4.ug.lm otherwise it may not work. Then, invoke the program that came with Sphinx2 (or you may need to download the lm3g2dmp from CMU and run make):

lm3g2dmp <text LM file> <target directory>

(eg, lm3g2dmp an4.ug.lm /home/mydir)

This will create a dump file called an4.ug.lm.DMP. Copy that file into your folder.

9. CHECKPOINT: You should now have all the files you need except the run-batchalig.sh. See the files in step 1 above and make sure you have them all in your folder except that one file.

Preparing the Script File

1. The final file you’ll need is the run-batchalig.sh file. It’s what brings all of your files together and runs the recognizer on them. Create the file and paste in the following contents:

/usr/local/sphinx2/src/examples/sphinx2-batch \
-adcin TRUE \
-adcext raw \
-ctlfn /home/myplace/myfolder/mystuff.ctl \
-tactlfn /home/myplace/myfolder/mystuff.cor \
-ctloffset 0 \
-ctlcount 100000000 \
-datadir /home/myplace/myfolder/ \
-agcmax TRUE \
-langwt 6.5 \
-fwdflatlw 8.5 \
-rescorelw 9.5 \
-ugwt 0.5 \
-fillpen 1e-10 \
-silpen 0.005 \
-inspen 0.65 \
-top 1 \
-topsenfrm 3 \
-topsenthresh \
-70000 \
-beam 2e-06 \
-npbeam 2e-06 \
-lpbeam 2e-05 \
-lponlybeam 0.0005 \
-nwbeam 0.0005 \
-fwdflat FALSE \
-fwdflatbeam 1e-08 \
-fwdflatnwbeam 0.0003 \
-bestpath TRUE \
-kbdumpdir /home/myplace/myfolder/mystuff \
-dictfn /home/myplace/myfolder/mystuff/mystuff.dic.conv \
-noisedict /usr/local/sphinx2/hmm/hmm/6k/noisedict \
-phnfn /usr/local/sphinx2/hmm/hmm/6k/phone \
-mapfn /usr/local/sphinx2/hmm/hmm/6k/map \
-hmmdir /usr/local/sphinx2/hmm/hmm/6k \
-hmmdirlist /usr/local/sphinx2/hmm/hmm/6k \
-8bsen TRUE \
-sendumpfn /usr/local/sphinx2/hmm/hmm/6k/sendump \
-cbdir /usr/local/sphinx2/hmm/hmm/6k \
-verbose 9

Explanation: Beause this program takes a lot of command line arguments, it’s easiest to put those into a script file like this. The first line of this file is the actual program invocation. Your sphinx2-batch binary might be in a different place, so make sure you find it. The \’s in the file tell the operating system to continue reading the next line as if it’s on the same line, so you must have a \ at the end of each line.

Notice all the lines that read one of our mystuff files. For example, the -ctlfn flag looks at the file /home/myplace/myfolder/mystuff.ctl that we made earlier. The same goes for the -tactlfn and your .cor file, the -datadir flag and where all of your stuff is, the -kbdumpdir flag and where you have your DMP file, and the -dictfn flag that tells where your .dic file is. The rest of the args you can leave alone, but make sure the ones that look for a specific file or directory are looking in the right place. For a full description of the flags, look at this website: http://cmusphinx.sourceforge.net/sphinx2/doc/sphinx2.html

Running the Script

1. Run the script by making it executable (chmod a+e run-batchalig.sh) while in your folder. It should take a while to run. Keep your eye on it in case there are errors. You can pipe the results cleanly to a file like this:

./run-batchalig.sh > results.txt

6 Comments

  1. bakuzen » Blog Archive » Word-Level Forced Alignment in Sphinx4:

    […] not sure what forced alignment is, I posted previously on what it is and how do do it in sphinx 2 here. I’ve been working feverishly to find a way to do a phoneme-level alignment like sphinx2 can […]

  2. Marcin Bugala:

    hi,

    i followed your tutorial and i get the following error:

    INFO: d:\boogie\sphinx2-0.6\sphinx2-0.6\src\libsphinx2\fbs_main.c(1798):
    Doing ‘damn -858993460) ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠damn (-858993460 ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠╠
    ╠╠damn’ in utterance E01A_DLG16_05_Ray
    INFO: d:\boogie\sphinx2-0.6\sphinx2-0.6\src\libsphinx2\uttproc.c(1186): Batchmod
    e
    INFO: d:\boogie\sphinx2-0.6\sphinx2-0.6\src\libsphinx2\uttproc.c(1382): Samples
    histogram (E01A_DLG16_05_Ray) (4/8/16/30/32K): 64.4%(44488) 14.7%(10124) 13.7%(9
    489) 6.9%(4739) 0.3%(212); max: 32606
    7.195 = AGC MAX
    INFO: d:\boogie\sphinx2-0.6\sphinx2-0.6\src\libsphinx2\sc_vq.c(576): VQ-TIME= 0.
    0sec, SCR-TIME= 0.0sec (CPU)
    ERROR: “d:\boogie\sphinx2-0.6\sphinx2-0.6\src\libsphinx2\fbs_main.c”, line 2061:
    Partial alignment not implemented

    My sentence is:

    “damn it take cover”

    My generated grammar is:

    COVER K AH V ER
    DAMN D AE M
    IT IH T
    TAKE T EY K

    it’s the same for other files. I’m using windows vista.
    Any idea what’s wrong?

  3. Marcin Bugala:

    ok, this error is because i wrote ALIGN_ALL instead of align_all, however i get another error:

    ERROR: “d:\boogie\sphinx2-0.6\sphinx2-0.6\src\libsphinx2\time_align.c”, line 310
    4: Last state not reached at end of utterance

  4. admin:

    I had that error before….sadly, I didn’t write down the solution. It means that you have a set of words that align to phonemes, but you don’t use up all the phonemes by the time the utterance is done with the alignment, so you have some leftover states. You can either fix your audio file (ie, take off any extra front space), make sure there isn’t any tag information. Of course, there may be another format issue in one of your files, for example you may need an extra empty line at the end of your text files. It’s usually something silly like that.

  5. Marcin Bugala:

    still the same error

    i tried:
    - running on windows vista and kubuntu 9.10
    - different sound files
    - adding and deleting an extra empty line in files mystuff.cor, mystuff.ctl, mystuff.dic.conv
    - searching for tags in a .raw file using hex editor hxD (not sure though how those tags shoul look. like ?)
    - adding silence at the end of a sound file

    also i noticed, that you should put *align_all* in .cor file AFTER you generate dictionary using online tool or else you’ll get phoneme transcription for *align_all*

  6. admin:

    I found this on the CMU website:

    Q. During force-alignment, the log file has many messages which say “Final state not reached” and the corresponding transcripts do not get force-aligned. What’s wrong?

    A. The message means that the utterance likelihood was very low, meaning in turn that the sequence of words in your transcript for the corresponding feature file given to the force-aligner is rather unlikely. The most common reasons are that you may have the wrong model settings or the transcripts being considered may be inaccurate. For more on this go to Viterbi-alignment.

Leave a comment

You must be logged in to post a comment.