Word-Level Forced Alignment in Sphinx4
If you’re not sure what forced alignment is, I posted previously on what it is and how do do it in sphinx 2 here. I’ve been working feverishly to find a way to do a phoneme-level alignment like sphinx2 can do, but I haven’t been able to without spending many hours deep in the code. Maybe our friends at CMU will make that available to us sometime in the future. For now, we have word-level alignment and if you must have phoneme-level alignment, refer to my original post.
It’s not much more work than setting up sphinx4 and then getting the right result. There is a ForcedAlignerGrammar that you should use along with the DynamicFlatLinguist. That means some changes to your config file. Add the following:
<component name=”forcedGrammar” type=”edu.cmu.sphinx.linguist.language.grammar.ForcedAlignerGrammar”>
<property name=”dictionary” value=”dictionaryWSJ”/>
<property name=”referenceText” value=”"/>
<property name=”addSilenceWords” value=”true”/>
<property name=”addFillerWords” value=”false”/>
</component>
<!– This might already be in your config file –>
<component name=”dynamicFlatLinguist”
type=”edu.cmu.sphinx.linguist.dflat.DynamicFlatLinguist”>
<property name=”logMath” value=”logMath”/>
<property name=”grammar” value=”forcedGrammar”/>
<property name=”acousticModel” value=”wsj”/>
<property name=”wordInsertionProbability”
value=”${wordInsertionProbability}”/>
<property name=”silenceInsertionProbability”
value=”${silenceInsertionProbability}”/>
<property name=”languageWeight” value=”${languageWeight}”/>
<property name=”unitManager” value=”unitManager”/>
</component>
You can download my full config file here. By the way, I set the log level to INFO. If you don’t like all the output, you can set it back to WARNING.
As for the code, you need to change a few things in the FileRecognizer class we made before. You have to bring in a lot of classes from the configuration manager because you have to allocate a few things by hand at certain times to get things working correctly. At least, this was my experience. You’ll have to add these lines next to where you get the recognizer from the configuration manager:
grammar = (ForcedAlignerGrammar) cm.lookup(”forcedGrammar”);
ling = (DynamicFlatLinguist) cm.lookup(”dynamicFlatLinguist”);
ling.allocate();
Notice that the grammar is the ForcedAlignerGrammar that we added to the config file. Be sure that class is being imported.
The next change is setting up the grammar in the call to the recognizer. Recall that The point here isn’t to transcribe spoken audio. The term “forced alignment” means you take the audio and the transcript and you find the timestamps where the audio aligns with the text. Therefore, you need to tell the recognizer what text to align the audio file with. In my case, I have an audio file that says “are you done” so I need to tell the grammar and the recognizer what the reference text is. You can set the reference text in the recognizer as you make the actual call to do the recognition:
grammar.setReferenceText(”are you done”);
recognizer.allocate();
Result result = recognizer.recognize(”are you done”);
Note also here that you allocate the recognizer just before you use it. The last thing you need to do is get the result that has the timestamps. That can be done with this line:
System.out.println(result.getTimedBestResult(true, true));
The two boolean parameters are for fillers (silences) and if you want the word token first. That is, the first one is true if you want to see where the pauses or silences start and finish between the words, which can be useful. The second one should probably be set to true because you want to know what word the timestamps are referring to.
When I run my recognition with “are you done” as the reference text, this is what I get:
<s>(0.0,0.32) are(0.32,0.53) <sil>(0.53,0.62) you(0.62,0.81) done(0.81,1.1) <sil>(1.1,1.47 )
The final java code can be found here. I know I retype the reference text and that’s not good programming practice. I’m not going to tell you where the file needs to go or any other eclipse-specific thing because if you’re wanting to do forced alignment, you probably know what you’re doing. But, if you have any questions even about that, please let me know. Good luck!
Leave a comment
You must be logged in to post a comment.