Word-Level Forced Alignment in Sphinx4

If you’re not sure what forced alignment is, I posted previously on what it is and how do do it in sphinx 2 here. I’ve been working feverishly to find a way to do a phoneme-level alignment like sphinx2 can do, but I haven’t been able to without spending many hours deep in the code. Maybe our friends at CMU will make that available to us sometime in the future. For now, we have word-level alignment and if you must have phoneme-level alignment, refer to my original post.

It’s not much more work than setting up sphinx4 and then getting the right result. There is a ForcedAlignerGrammar that you should use along with the DynamicFlatLinguist. That means some changes to your config file. Add the following:

<component name=”forcedGrammar” type=”edu.cmu.sphinx.linguist.language.grammar.ForcedAlignerGrammar”>
<property name=”dictionary” value=”dictionaryWSJ”/>
<property name=”referenceText” value=”"/>
<property name=”addSilenceWords” value=”true”/>
<property name=”addFillerWords” value=”false”/>
</component>

<!– This might already be in your config file –>

<component name=”dynamicFlatLinguist”
type=”edu.cmu.sphinx.linguist.dflat.DynamicFlatLinguist”>
<property name=”logMath” value=”logMath”/>
<property name=”grammar” value=”forcedGrammar”/>
<property name=”acousticModel” value=”wsj”/>
<property name=”wordInsertionProbability”
value=”${wordInsertionProbability}”/>
<property name=”silenceInsertionProbability”
value=”${silenceInsertionProbability}”/>
<property name=”languageWeight” value=”${languageWeight}”/>
<property name=”unitManager” value=”unitManager”/>
</component>

You can download my full config file here. By the way, I set the log level to INFO. If you don’t like all the output, you can set it back to WARNING.

As for the code, you need to change a few things in the FileRecognizer class we made before. You have to bring in a lot of classes from the configuration manager because you have to allocate a few things by hand at certain times to get things working correctly. At least, this was my experience. You’ll have to add these lines next to where you get the recognizer from the configuration manager:

grammar = (ForcedAlignerGrammar) cm.lookup(”forcedGrammar”);
ling = (DynamicFlatLinguist) cm.lookup(”dynamicFlatLinguist”);
ling.allocate();

Notice that the grammar is the ForcedAlignerGrammar that we added to the config file. Be sure that class is being imported.

The next change is setting up the grammar in the call to the recognizer.  Recall that The point here isn’t to transcribe spoken audio. The term “forced alignment” means you take the audio and the transcript and you find the timestamps where the audio aligns with the text. Therefore, you need to tell the recognizer what text to align the audio file with. In my case, I have an audio file that says “are you done” so I need to tell the grammar and the recognizer what the reference text is. You can set the reference text in the recognizer as you make the actual call to do the recognition:

grammar.setReferenceText(”are you done”);
recognizer.allocate();
Result result = recognizer.recognize(”are you done”);

Note also here that you allocate the recognizer just before you use it. The last thing you need to do is get the result that has the timestamps. That can be done with this line:

System.out.println(result.getTimedBestResult(true, true));

The two boolean parameters are for fillers (silences) and if you want the word token first. That is, the first one is true if you want to see where the pauses or silences start and finish between the words, which can be useful. The second one should probably be set to true because you want to know what word the timestamps are referring to.

When I run my recognition with “are you done” as the reference text, this is what I get:

<s>(0.0,0.32) are(0.32,0.53) <sil>(0.53,0.62) you(0.62,0.81) done(0.81,1.1) <sil>(1.1,1.47 )

The final java code can be found here. I know I retype the reference text and that’s not good programming practice. I’m not going to tell you where the file needs to go or any other eclipse-specific thing because if you’re wanting to do forced alignment, you probably know what you’re doing. But, if you have any questions even about that, please let me know. Good luck!

The Case for Standards

Standards are a huge issue in the computational linguistics world. At CALICO and LREC some big discussions were made on standards, more so at CALICO. There was talk about standardizing XML schemas, or some format for something so everyone could read it. No one should have proprietary software, according to most people there. Well, here is a practical post on a famous computer science blog about standards: Martian Headsets.

CALICO

The Computer Assisted Language Instruction Consortium, or CALICO, met in San Francisco in March this year. Some work I did using Elicited Imitation as a measure for second language proficiency (and working on automating that measure with speech recognition) was accepted and I was able to attend. The pre-conference workshop was probably the best part of it all, but the rest of the conference was worth going to if you want to see what you can do with technology to teach a second language. I was hoping for more NLP and computational linguistics content, but it was mostly a group of second language instructors at the college level along with language learning software companies trying to see how they could most effectively instruct students to learn language.  If someone happened to be using NLP to get the job done, then that was interesting.

It was interesting for me personally for my own language learning studies. If you are a DIY language learner, then you’ll know how difficult it can be. Some people think that simple regular exposure to the language will suffice while others think that a detailed study of grammar is necessary. The former makes sense because, well, children don’t learn grammar and end up speaking fluently. It takes a few years, so that’s where a study of grammar will come in at the right times to accelerate that learning. This is all anecdotal, of course, but my own experience makes me believe this. I speak Japanese with a level of fluency, but I’m not yet proficient in German or Spanish (the two languages I am working on now). I focus on increasing my Japanese knowledge and vocabulary along with being exposed to German. I listen to music and news podcasts in both languages. I can tell you what Japanese and German music I like if  you ask, but for now I’ll just give you links to the two podcasts I listen to:

Deutsche Welle Podcast

Yomiuri News Podcast

The former is a 10-minute, slowly read podcast specifically for those learning German. The latter is a 20-minute news reel including sports (mostly Japanese baseball), an opinion column, and a daily topic. I am usually able to get a few new vocabulary words each day.  There is another podcast, Learn to Speak German, that is done by a German guy. It’s quite good, but more for the intermediate German student.

You can find a lot of resources for most languages out there. It took me a while to find a good podcast for Japanese news, but there are several out there. Some daily exposure to the language is very helpful, even if you can’t fully understand it. Try to pick out vocabulary and phrases and write them down so you can look them up later. Pick up a grammar book and see what grammar principles you can pull from the phrases. After you feel you understand the news pretty well, purchase a book you know pretty well (I got Harry Potter) for you to read out loud so you can help train your mouth and mind to use the language. The best would be to speak with a native in the language after you feel comfortable. After some time doing  this you’re on your way to learning a new language.

For more ideas on how to best learn a language, see the CALICO website.

Festival

Festival is basically the opposite of Sphinx. Sphinx is a speech recognition engine whereas Festival is a way of taking text and making it into speech. Like Sphinx, it only has a certain level of ability, but it does quite well considering the difficult problem that test to speech (TTS) is. Also, like Sphinx, it is currently a project of Carnegie Mellon University. It’s by no means completely fluent sounding, but it’s easy to understand and very useful for things like dictating news or emails for you among many other applications.

I will walk through two guides to help you get Festival up and running. First I will do it the hard more difficult way and then show you Ubuntu users the easier way.

Festival on Linux

I assume that you’re using Linux, but Festival will work on Windows with a similar install procedure (though many of the things that it requires like a C and Scheme compiler tend to come standard with, or are easy to get for, Linux.

1. Go to the download site: http://festvox.org/packed/festival/latest/
2. Download the following files (or the latest version) from that site:

  • festival-1.96-beta.tar.gz (the TTS engine)
  • speech_tools-1.2.96-beta.tar.gz (some necessary tools)
  • festlex_CMU.tar.gz (not sure what this is)
  • festlex_POSLEX.tar.gz (the part of speech lexicon)
  • festvox_kallpc16k.tar.gz (the speech model that contains the voice information)

3. Create a folder and put all of the files you just downloaded into that folder
4. Untar all the files in the order that they are named in step 2. eg:

>tar -xvf  festival-1.96-beta.tar.gz

5. You should now have a festival and a speech_tools folder with everything in them
6. Configure and build speech tools by navigating into the speech_tools folder ( > cd speech_tools) and running the command:

>./configure (this will take a few seconds to run)
>make (this could take a few minutes- this compiles everything and creates the files necessary to run the program on your machine)

7. Follow step 6 again, except this time navigate into the festival folder in stead of the speech_tools folder
8. You should now be ready to run festival. You can open it by navigating to the festival/bin folder and running the command:

> ./festival

This will open the program. You will see a command line interpreter. You can type “help” to get some help or go to the online manual here. It takes in specific commands or any Scheme command. The TTS system is called by invoking several Scheme command. You can test it by simply running the command:

> (SayText “Hello World”)

and you should hear a voice say “Hello World” from your computer speakers (assuming you have some that work and are turned on).

Now create a file called “myfile.txt” and open it up with an editor. Add a line of of text to it (for example, “I know how to use festival”) and save the file. Then run this command in festival (either make sure you save the file in the same directory you are running festival from, or type the full path to the file in the quotes):

> (tts “myfile.txt” nil)

You can quit festival by running the command:

(quit)

Finally, you can run festival on files without having to actually enter the festival interpreter. You can invoke festival with the –tts flag and give it as many filename arguments as you wish. For example (again, you may need to type the full path of the file):

>festival –tts myfile.txt

Which is what you would typically use if you were going to use this program with other programs. You can save the text to a file and then invoke this program to read that text. There is a festival server that you can send requests to over a certain port, but it is known to have security issues. There are several voice models you can use, one with a US English accent and one with a British English accent. Both are found on the download site. You can go to festvox.org to see how you can contribute to creating these voice models and improving them.

Ubuntu Installation

1. sudo apt-get install festival
2. sudo apt-get install speech-tools

This final step can be replaced with any of the possible voices below.

3. sudo apt-get install festvox-kdlpc16k
Possible voices:
festvox-don - minimal British English male speaker for festival
festvox-ellpc11k - Castilian Spanish male speaker for Festival
festvox-rablpc16k - British English male speaker for festival, 16khz sample rate
festvox-rablpc8k - British English male speaker for festival, 8khz sample rate

You can then run festival simply by typing festival in a shell and following step 8 above.

Asterisk

Someone asked about Asterisk, the open source telephony software. I have to admit that I’ve never heard of it before, so I took the opportunity to do some research. It does everything telephony which includes Call Queuing, Call Recording, Call Retrieval, Call Routing (DID & ANI), Call Transfer, Call Waiting, Caller ID, VoIP Gateways, and the list goes on. It supports several protocols, as well. Have a look at the features page to see what it is capable of. You can even purchase the Asterisk appliance that has a fully functional Asterisk server with all the hardware needed to utilize it.

Sadly, Asterisk isn’t technically an NLP tool, thought it does have a transcriber and other speech processing parts to it and I see why people would want to utilize this with other NLP technologies, so I’m sure it will come up again and I may end up using it someday. I went ahead and installed it and ran  into some problems that I found to be pretty common, so I’ll finish by including the installation procedure that I followed all the way up to running and accessing Asterisk. You should be able to go from there.

Installing Asterisk

Afer you download the latest stable version, type these commands in a console (assuming Ubuntu):

> tar -xvf asterisk

> ./configure

If you get an error that says you need the termcap, do this:

> sudo apt-get install libncurses5-dev

and then re-try > ./conifgure.

If you’re not using Ubuntu, then you can search for the libtermcap libraries. There are RPMs out there.

> make

> make install

> asterisk -r

By now you shold be at the Asterisk console. It is waiting for scripts. If you type ‘help’ you should see a very long list of commands. If you’re serious about using Asterisk commerically or spending a lot of time with  it, then buy the book.

Using Sphinx4

If you followed the last post, you set up sphinx and got it to work, but now you want to really use it. There are a number of parts of the recognizer you need to know about that I will discuss:

1. Configuration File

2. Grammar

3. Acoustic Model

4. Linguist

5. Dictionary

Configuration File

This file tells the recognizer everything about how you want the recognizer to work. It is located (if you followed my previous post) in src->wavfile-> config.xml. It is an xml file, so you can view it in the xml editor in eclipse which has the ability to collapse part of the file, but I prefer a regular text editor. You can view it in the default text editor by right-clicking on it, open with->text editor. You’ll see the first part of the file:

<property name=”absoluteBeamWidth” value=”-1″/>
<property name=”relativeBeamWidth” value=”1E-90″/>
<property name=”wordInsertionProbability” value=”1E-36″/>
<property name=”languageWeight” value=”8″/>
<property name=”silenceInsertionProbability” value=”1″/>
<property name=”skip” value=”0″/>
<property name=”logLevel” value=”WARNING”/>

<property name=”recognizer” value=”recognizer”/>
<property name=”linguist” value=”flatLinguist”/>
<property name=”frontend” value=”mfcFrontEnd”/>

Linguist

Unless you know what you’re doing, don’t change anything here. The last three lines define which recognizer, linguist, and frontend you’re going to use. Focus your attention on the linguist and note that it’s called “flatLinguist.” If you search through the file, you’ll see a component named “flatLinguist” which tells the recognizer which grammar and acoustic model to use. One of the purposes of the linguist is to bring these things together. It’s simply a part of sphinx that helps the recognizer.

Acoustic Model

The acoustic model is a very important part of the recognizer. In fact, it’s not part of the recognizer at all; the acoustic model is one of the things that is changeable. If you followed the previous post, you’ll notice that you used “TIDIGITS” (file name is TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar). The acoustic model holds the statistical information that the recognizer needs to put phonemes, syllables, words, and then sentences together. Different models come from different sources and are for different purposes. For example, the TIDIGITS model is specifically designed and created for numbers. If your application only needs to recognize numbers (for example, your cell phone listens to you speak numbers) then you would use this model. Notice that there are two more models. Focus your attention on the one called WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar (by the way, these acoustic models follow a specific naming convention, hence the long, complicated name). This WSJ acoustic model was created from Wall Street Journal spoken text. It, like other acoustic models, contains a dictionary of words spoken, and how to map the phonemes to those words. This model would be very useful for most spoken words you would say, so I will walk you through getting sphinx set up with this model. In order to do that, you have to get things set up in the config.xml file.

Add this info to your config.xml file in the Acoustic Model Configuration Information section:

<component name=”acousticModelWSJ”
type=”edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model”>
<property name=”loader” value=”sphinx3LoaderWSJ”/>
<property name=”unitManager” value=”unitManager”/>
</component>

<component name=”sphinx3LoaderWSJ”
type=”edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader”>
<property name=”logMath” value=”logMath”/>
<property name=”unitManager” value=”unitManager”/>
</component>

This sets up the acoustic model WSJ along with the necessary loader. If you compare it to the TIDIGITS model and loader, you’ll notice that it’s pretty much the same.

Dictionary

With the acoustic model in place, you now need to set up sphinx to use the dictionary that the acoustic model can understand and use. You can pase this into the config.xml file in the section called The Dictionary Configuration:

<component name=”dictionaryWSJ”
type=”edu.cmu.sphinx.linguist.dictionary.FullDictionary”>
<property name=”dictionaryPath”
value=”resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d”/>
<property name=”fillerPath”
value=”resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/.
fillerdict”/>
<property name=”addSilEndingPronunciation” value=”false”/>
<property name=”unitManager” value=”unitManager”/>
</component>

Grammar

You’re almost ready to run the recognizer using the WSJ acoustic model and dictionary. Now you just have to make sure the grammar and linguist know to use it in stead of the TIDIGITS model. Underneath The Linguist Configuration, you’ll notice this line:

<property name=”acousticModel” value=”acousticModel”/>

But if you look at the acousticModel component lower in the config.xml file, that uses the TIDIGITS. The one you want is acousticModelWSJ, so you can remove the above line and paste this in it’s place:

<property name=”acousticModel” value=”acousticModelWSJ”/>

Similarly, in The Grammar Configuration, you need to set this line:

<property name=”dictionary” value=”dictionary”/>

to

<property name=”dictionary” value=”dictionaryWSJ”/>

Be sure to save the config file. You can download my config file made after all of these changes here (be sure to right-click->save as).

Now, if you run the recognizer (WavFile) like you did in the previous post, it should still recognize “one two three four five.” This model contains at least some digit information, so it recognized this file just fine.

In order to recognize your own file, there are just a few more things you need to do. First, you need to have a file that contains some kind of utterance. For example, you might record yourself saying “I like to drive cars” and save it as a 16-bit wav file called “myfile.wav” (sphinx only takes 16 bit wav files) then you can then copy that file into the src/wavfile folder (the same place where the other wav files are) and on line 53 of WavFile.java, change what is inside the quotes from “12345.wav” to “myfile.wav” and it should be ready. One more thing- you need to find the WSJ model file (with the really long filename) and then right-click on it and add it to the build path.

Once you have completed that, you have to change the grammar to listen to what you are going to say. In src/wavfile, find the file called digits.gram and open it (still in eclipse). Notice that this grammar has the digits that were spoken before in the 12345.wav file. It looks like this:

#JSGF V1.0;

/**
* JSGF Digits Grammar for Hello World example
*/

grammar digits;

public <numbers> = (oh | zero | one | two | three | four | five | six | seven | eight | nine) * ;

Change it to look like this (or adapt it to include the words you actually recorded earlier):

#JSGF V1.0;

/**
* JSGF Digits Grammar for Hello World example
*/

grammar digits;

public <numbers> = (I | like | to | drive | cars) * ;

Save the file, then try to recognize it. The recognizer should pick it up. In future posts, we’ll go into specifics on the grammar and how to convert mp3s in sphinx so you don’t always have to convert everything to wav.


Sphinx4

What is Sphinx?

Sphinx is an open source project by Carnegie Mellon University that deals with Natural Language Processing. I primarily use it for speech recognition.

Read more about it on the CMU Sphinx project page.

A four page PDF overview of the sphinx four system.
There are several versions, the latest being written in Java, which is what I’m going to walk through below.

Getting Sphinx

You can get the sphinx code or binaries from sourceforge. If you’re feeling really lucky, get the source, or check it out from subversion, but if you just want to use the engine for speech recognition, just download the binaries. It comes in a jar. We’ll step through that in this post.

When you go to the above sourceforce link, select sphinx4 then download the sphinx4 bin file. Once you download and unzip it, you’ll see a few jars in the bin folder, a demo folder, some documentation, and a lib folder (among other things). The lib folder has the sphinx4 jar in it. What you want is the entire lib folder because it has everything you need in it to do some speech recognition.

Running Sphinx

Get an IDE. You can use whatever IDE you want (IntelliJ, NetBeans) but I will step you through eclipse, which is free. You can read about how to get eclipse in my previous post. In eclipse, create a new Java project (file->new->Java Project) and give it a name (I called mine sphinx4). You’ll see that it made a src folder. Copy the lib folder from the sphinx4 download folder you just unzipped by pasting it into the root folder of the project. Also go into the demo folder and copy the wavfile folder and paste it into the src folder in your eclipse project.

There’s one more file you need. The jsapi.jar file is necessary, but it doesn’t show up anywhere. There is a legal issue about just downloading the jar file, so in the lib folder you’ll see the jsapi.exe file. Run that and the jaspi.jar file will magically appear in the same folder as the jsapi.exe file. In linux, run the jsapi.sh file and it should have the same result. If you can’t get it, Google for it and you should be able to find it. If all else fails, let me know and I’ll help you get it. It must be in your lib folder before we move on.

With the wavfile folder in the src folder and the lib folder under your project root (and with the jsapi.jar file in the lib folder),you can start to link in the jars that you will need to do some simple speech recognition. Expand the lib folder and you’ll see the following jar files in it:

js.jar
jsapi.jar
sphinx4.jar
TIDIGITS_.jar

Right-click on the sphinx4 jar->build path->add to build path. This adds (links) the jar to your build path allows the IDE to use code from the jar for your project. Do the same for all of the jars above.

When that is done, your folder structure should look something like this:

snapshot1.png

Notice that there are a few wav files, a .gram file, and a config.xml file. You’ll need to open the config.xml file (right-click on the file->open with->text editor otherwise it’ll open some xml editor that is hard to understand. Find the part of the file that looks like this:

<component name=”jsgfGrammar” type=”edu.cmu.sphinx.jsapi.JSGFGrammar”>
<property name=”dictionary” value=”dictionary”/>
<property name=”grammarLocation”
value=”resource:/demo.sphinx.wavfile.WavFile!/demo/sphinx/wavfile/”/>
<property name=”grammarName” value=”digits”/>
<property name=”logMath” value=”logMath”/>
</component>

It’s about half way into the file. You need to make some changes here. In stead it should look like this (you can paste this in or just remove the demo.sphinx from the first part and the /demo/sphinx from the second part of the middle line):

<component name=”jsgfGrammar” type=”edu.cmu.sphinx.jsapi.JSGFGrammar”>
<property name=”dictionary” value=”dictionary”/>
<property name=”grammarLocation”
value=”resource:/wavfile.WavFile!/wavfile/”/>
<property name=”grammarName” value=”digits”/>
<property name=”logMath” value=”logMath”/>
</component>

Save the file (ctrl-s). Now you’re ready to run sphinx and recognize some simple speech.

Go ahead and open up the wav file named 12345.wav and listen to it. Notice that the spoken words are just that: one two three four five. If all goes well, that’s what sphinx should recognize.

You can run the program by right-clicking on the WavFile.java file (located in the src/wavfile folder)->Run As->Java Application. This should run the recognizer. After a few seconds, you’ll see the text “one two three four five” show up in the Console portion of eclipse. If you got that far, nice work. You were able to perform some speech recognition with sphinx.

eclipse

For most of the things I am going to post about, I use eclipse, the open source IDE primarily used for Java. It can also be used for C, C++, PHP, JSP, and other languages. I recommend setting it up with some kind of CVS or subversion, which I will step through here.

1. Download eclipse. You can download it here and get whatever flavor you want, but I prefect the Europa version (the PHP and web programming flavor) for now. Once you download it, unzip it and put it where you’d like. It doesn’t have an installer. Go into the eclipse folder and open eclipse.exe. It’ll ask where you want your workspace. That doesn’t matter. Go into your workspace. At this point it’s waiting for you to make some projects.

2. Set up eclipse with subversion. Do this by clicking on Help->Software Updates->Find and Install. Select Search for New Features and click Next. Click on New Remote Site. Give it the name “Subclipse” and put http://subclipse.tigris.org/update_1.2.x in the url (If you find out that it won’t let you download because of some version issues, then change the update_1.2.x to update_1.0.x). Click OK. Make sure the Subclipse check box is checked. Click Finished. This will take you to decide which packages to get. Just select everything and download everything. If there is an end-user license agreement, then agree to it, etc. NOTE: This is version 1.2 of Subclipse. If you have an older version of eclipse, you’ll have to use the older version (just Google for subclipse).

It will ask you to restart eclipse. Do so. You can now make new projects with eclipse or through subversion.

Forced Alignment in Sphinx2

Getting and Installing Sphinx2 and Sphinx3
Sphinx2 was designed to run on any architecture that has a C compiler. For our purposes, we’ll use a Linux system. The steps below can be followed for Sphinx2 or Sphinx3 (at least the installation steps are the same).

1. Download the software from CMU Sphinx

2. Untar the file (tar -xvf filename) and set the permissions to run the config file (chmod a+x configure)

3. Run the config file: (./configure) from the proper directory. This will generate a make file

4. Run make with root privileges (just type make). This should install everything you need where things belong.

5. Run make install (make install) as root.

Preparing for Forced Alignment

1. You will need the following files:
-the audio files
-control
-corpus
-dictionary
-language model
-language model DMP
-run-batchalig.sh

NOTE!! Make sure you save your files in a linux format. In other words, either create the file in linux or open a file that does work in linux,save as something else, then paste in your information. If you don’t you may spend 3+ hours wondering why your ctl and cor files aren’t being read properly.

2. Create a folder where all of these files will go

3. Convert the audio files to raw format. If you are taking files from mp3 format to wav then to raw, BE SURE to remove all of the mp3 tag data or the recognizer will choke. There are many download-able programs that can remove the tag information. You can remove it by hand in a text editor if you’re careful. There are many tools out there that do this, but one useful tool is called sox. You can download it and run it from a Windows or Linux machine. Once you have it, run the following script (in Linux bash):

for f in *.wav; do sox “$f” “${f%.wav}.raw”; done

This will convert all of your wav files to raw and name them appropriately. It won’t move or delete the original files.

4. Put all of those converted files into your folder.

5. Create the control file. This file simply says where all of the audio files are. You don’t need to put the .raw extension. For example, if you have two raw files called file1.raw and file2.raw, your control file would look like this:

file1
file2

for argument, we’ll name the file mystuff.ctl

6. Create a corresponding corpus file. This is basically a transcript of what is said in the files. The lines of the transcript should correspond to the files in your control file. For example, if the file1 audio is “I like to run” and the file2 audio is “I ran 8 miles” then your corpus file should look like this (the first line must have *align_all*):

  • align_all*

I like to run
I ran 8 miles

Note that there is no punctuation and that the order of lines corresponds to the filenames in the control file. For argument, we’ll name this file mystuff.cor

7. Next, you need a dictionary. A dictionary file has each word in your corpus file (step 6) with annotated phonemes. That would be a pain to do by hand, especially if you had a lot of words to annotate. CMU has an online tool to help you:

http://www.speech.cs.cmu.edu/tools/lmtool-adv.html

This actually generates more than just the dictionary file. It also generates the language model file that you also need, and a sentence file. Just browse to your sentence corpus file (mystuff.cor). You can give it an optional exception dictionary, which is a dictionary file that already has some annotated words. This will override the default generated annotations for those words with the annotation you give in that exception dictionary. The other thing you can input is an additional words file, which is just a list of words (unannotated) that you want included in your dictionary.

You can tell it which phone set to use. Just use the default and click “Compile Knowledge Base” which will take a few seconds. When it’s done, paste in the text from the two files into your dictionary and language model files. For argument, they would be called mystuff.dic and mystuff.lm. You can save the sentence file also, which may come in handy sometime.

But you’re not done yet. You have to convert the dictionary file because sphinx2 won’t recognize the phoneme set given from the website. You need to run the following command:

stress2sphinx mystuff.dic >> mystuff.dic.conv

Now the latter file is the dictionary you will use.

8. Now you need the language model DMP file. This is basically your language model in binary form. Rename your file to an4.ug.lm otherwise it may not work. Then, invoke the program that came with Sphinx2 (or you may need to download the lm3g2dmp from CMU and run make):

lm3g2dmp <text LM file> <target directory>

(eg, lm3g2dmp an4.ug.lm /home/mydir)

This will create a dump file called an4.ug.lm.DMP. Copy that file into your folder.

9. CHECKPOINT: You should now have all the files you need except the run-batchalig.sh. See the files in step 1 above and make sure you have them all in your folder except that one file.

Preparing the Script File

1. The final file you’ll need is the run-batchalig.sh file. It’s what brings all of your files together and runs the recognizer on them. Create the file and paste in the following contents:

/usr/local/sphinx2/src/examples/sphinx2-batch \
-adcin TRUE \
-adcext raw \
-ctlfn /home/myplace/myfolder/mystuff.ctl \
-tactlfn /home/myplace/myfolder/mystuff.cor \
-ctloffset 0 \
-ctlcount 100000000 \
-datadir /home/myplace/myfolder/ \
-agcmax TRUE \
-langwt 6.5 \
-fwdflatlw 8.5 \
-rescorelw 9.5 \
-ugwt 0.5 \
-fillpen 1e-10 \
-silpen 0.005 \
-inspen 0.65 \
-top 1 \
-topsenfrm 3 \
-topsenthresh \
-70000 \
-beam 2e-06 \
-npbeam 2e-06 \
-lpbeam 2e-05 \
-lponlybeam 0.0005 \
-nwbeam 0.0005 \
-fwdflat FALSE \
-fwdflatbeam 1e-08 \
-fwdflatnwbeam 0.0003 \
-bestpath TRUE \
-kbdumpdir /home/myplace/myfolder/mystuff \
-dictfn /home/myplace/myfolder/mystuff/mystuff.dic.conv \
-noisedict /usr/local/sphinx2/hmm/hmm/6k/noisedict \
-phnfn /usr/local/sphinx2/hmm/hmm/6k/phone \
-mapfn /usr/local/sphinx2/hmm/hmm/6k/map \
-hmmdir /usr/local/sphinx2/hmm/hmm/6k \
-hmmdirlist /usr/local/sphinx2/hmm/hmm/6k \
-8bsen TRUE \
-sendumpfn /usr/local/sphinx2/hmm/hmm/6k/sendump \
-cbdir /usr/local/sphinx2/hmm/hmm/6k \
-verbose 9

Explanation: Beause this program takes a lot of command line arguments, it’s easiest to put those into a script file like this. The first line of this file is the actual program invocation. Your sphinx2-batch binary might be in a different place, so make sure you find it. The \’s in the file tell the operating system to continue reading the next line as if it’s on the same line, so you must have a \ at the end of each line.

Notice all the lines that read one of our mystuff files. For example, the -ctlfn flag looks at the file /home/myplace/myfolder/mystuff.ctl that we made earlier. The same goes for the -tactlfn and your .cor file, the -datadir flag and where all of your stuff is, the -kbdumpdir flag and where you have your DMP file, and the -dictfn flag that tells where your .dic file is. The rest of the args you can leave alone, but make sure the ones that look for a specific file or directory are looking in the right place. For a full description of the flags, look at this website: http://cmusphinx.sourceforge.net/sphinx2/doc/sphinx2.html

Running the Script

1. Run the script by making it executable (chmod a+e run-batchalig.sh) while in your folder. It should take a while to run. Keep your eye on it in case there are errors. You can pipe the results cleanly to a file like this:

./run-batchalig.sh > results.txt