Skip to content

BLEU Statistical Significance

Sometimes BLEU scores in Machine Translation experiments are close together, and one would wish to see if the difference is significant or not. This tutorial can help you with that.

I have two MT result sets, stlm and srilm. I give the following example on how to find the statistical signicance, based on the tutorial:

perl generateSGMLfromText.pl stlm.hyp stlm /home/newssyscomb2009-src.en.sgm de > stlm.sgm

perl generateSGMLfromText.pl srilm.hyp srilm /home/newssyscomb2009-src.en.sgm de > srilm.sgm

perl generateLog-v11.pl -r /home/newssyscomb2009-ref.de.sgm -s /home/newssyscomb2009-src.en.sgm -t stlm.sgm > stlm.log

perl generateLog-v11.pl -r /home/newssyscomb2009-ref.de.sgm -s /home/newssyscomb2009-src.en.sgm -t srilm.sgm > stlm.log

perl -I /usr/local/share/perl/5.10.1/ bootstrapCompare-v11.2.pl srilm.log stlm.log

Sphinx4 Revisited

Sphinx4 has changed a lot since my original posts about it. The conversion to mp3 is the same, but the original post has changed a little bit. It’s easier than ever to simply download and use sphinx4 without too much effort.

The new and improved website: http://cmusphinx.sourceforge.net/

You can either checkout the svn, or just get the tar. You will also need some acoustic models.

If you download the tar, untar it (tar -xvf sphinx4.tar). You will need to go to the lib folder and extract the jsapi jar from either the .exe file or the .sh file (just run ./jsapi.sh and follow the instructions). You should also download the WSJ acoustic model and copy it into your lib folder if it isn’t already there. After that, you can navigate to the sphinx4 root folder and run the following command:

java -mx312m -jar bin/HelloWorld.jar

This will invoke the HelloWorld jar which uses a simple grammar and records your microphone input.

If you want to get it working in eclipse, you can checkout subversion or create your own project by including the sphinx4.jar and jsapi.jar. As before, you will need a config.xml file. If you check it out from svn, you will see several apps that you can build from src/apps (edu.cmu.sphinx.demo). Each has its own class and config file which you can use as an example.

For example, HelloWorld uses a grammar to determine which words to use, wheras HelloNGram uses a language model. One nice thing about this version of sphinx4 is that it separates the dictionary and language model from the acoustic model. For example, you can use the WSJ acoustic model, but create your own grammar or language model tailored to your specific needs. You can use any ARPA formatted language model file (you can specify the order that you want to use in the config file). You can always use the online tool which makes simple trigram language model files: http://www.speech.cs.cmu.edu/tools/lmtool-adv.html

Feel free to ask questions, but the documentation and demos make it easy to understand and begin to use.

 

MEMT

MEMT is a package by Kenneth Heafield (CMU) which can combine MT hypotheses, use them for tuning zmert, and return a final, better nbest and 1best hypothesis. I have noticed improvements in combining output from several systems by using this tool. I had to give someone specific instructions on how to install and use it, so I will include those here. I don’t guarantee that this will work for you, as your settings will be different, but this worked for me. You need about 4GB of ram and an language model that’s not terribly big to make it work (though if you have more memory, by all means, it works fine with huge language models as well). Keep in mind that this is only getting past the compilation of MEMT and the required libraries, the rest is straight forward from the README. I’m happy to answer questions about compilation and usage, but you should know that the creator has responded to my own questions in a timely and kind manner and might appreciate knowing that his tool is being used. You can replace vi with the editor of your choice.

From your home folder (in all cases, replace the ~ with the absolute path of your home folder):

wget http://kheafield.com/code/memt.tar.gz
tar -xvf memt.tar.gz
cd avenue
vi Jamroot (change <address-model>32 to <adress-model>64 if you are running on a 64-bit machine)
mkdir pre
cd install
vi ant.sh (change 1.7.1 to 1.8.2)
./install.sh ~/avenue/pre “icu ruby ant zmert boost_jam boost”
The last step takes a while. It downloads specific versions of those packages and compiles them and installs them into the ” ~/memt/avenue/pre” folder. If everything compiles correctly, then these are the final steps:
vi ~/avenue/pre/environment.bash
copy contents (it will be several lines that set environment variables….be sure to get the first line as well)
:q (quit vi)
paste into console (shift insert)
cd ~/avenue/MEMT
bjam release
./Alignment/compile.sh

I just tried these steps again and it worked fine. Of course, you’ll need things like gcc, java and python pre-installed (that is, in the PATH variable). Good luck!

*I did run into two problems. The first is that the joshua.zmert.ZMERT classpath could not be found. This can usually be fixed by opening the MEMT/scripts/zmert/zmert.rb file and on line 109 changing the class to just ZMERT (that is, remove joshua.zmert.). Another problem was the config file that was being outputted. To fix this, remove or comment out lines 52 and 52 (properties -passIt 1 and  -thrCnt #{CONCURRENCY})

Boost in Ubuntu Maverick

It’s easy to get boost in Ubuntu Maverick:

sudo apt-get install libboost-dbg libboost-dev libboost-all-dev libboost-doc libboost-date-time-dev libboost-filesystem-dev libboost-graph-dev libboost-iostreams-dev libboost-math-dev libboost-program-options-dev libboost-python-dev libboost-regex-dev libboost-serialization-dev libboost-signals-dev libboost-system-dev libboost-test-dev libboost-thread-dev libboost-wave-dev

Which installs boost 1.42. If you need an earlier version, then you might have to get the tarball and compile it yourself, which, if you can get past bjam, is pretty straight forward.

LREC 2010

I attended the LREC Conference in Malta back in May. I found it to be a pretty good conference with some useful talks and posters, as well as interesting new insights in the field. One can of course just find it on Google, but here’s a link anyway: http://www.lrec-conf.org/lrec2010/I haven’t added anything for a while now because I am quite busy with my masters program. It is very interesting and I have many things I could post about (and will post about) based on what I’ve learned. There are some updates to Sphinx, more useful things about Moses, some code I’ve written and used for different things that I will post about hopefully soon. I start my masters thesis in January and might have more time then.

Moses Machine Translation

Moses is, as stated in the title, a machine translation system. That is, it’s an open-source system that one can download and use to translate, potentially, from any language to any language. That’s saying a lot, but there’s also a lot to do in order to get it working.The moses website is actually quite good, so before you try my instructions, follow the instructions there: http://www.statmt.org/moses/But if you still don’t know what to do, I might be able to help. I’m going to step through getting the source code from an SVN repository via eclipse. You can get the source code via the website in a tar file, or you can just download the binary and run it without compiling, assuming your computer can handle the binary.<br>

The Environment: One could potentially use any means of an editor and svn, but, like I said, I will use eclipse. Eclipse was written for Java, but I find it, though not perfect, pretty functional with other languages, including C/C++ in which moses was written. So, you will need a version of eclipse that has the plugin for C/C++. If you don’t know how to install it, just download a fresh eclipse with it already included here:http://www.eclipse.org/cdt/downloads.phpThe other thing eclipse will need is a way to access svn. If you know of one, add it and make it work. Or, you can just install subclipse. Once you have your eclipse up and running with C/C++ capabilities, you can then go to help->install new software->add (type Subclipse for the name and the URL is: http://subclipse.tigris.org/svn/subclipse/tags/subclipse/1.6.5) then just continue on until it downloads and installs it. You’ll need to restart eclipse to have it take effect.Now, with C/C++ and SVN capabilities in eclipse, you are ready to get your hands dirty with moses.

Getting Moses: In eclipse, go to File->new->other->svn->Checkout Projects from SVN. Create a new location and click next. Add the following:https://mosesdecoder.svn.sourceforge.new/svnroot/mosesdecoderandhit next. It will take a second to download the information. There are many branches that people are working on (I currently use the config-switching branch for several reasons) but if you just want to get in and play, you can just click on “trunk” and then click next. Lucky for you, the people who maintain moses include eclipse project files, so it is quite easy to get set up with eclipse. Anyway, you may want to change the project name, or you can leave it as trunk, it doesn’t really matter. Make sure it sets it up as a C/C++ project, or things won’t work out right later. It should be an empty C++ project. Then click finish. It’ll take a few minutes to download. Once it’s done, all is not quite ready.

To Build: I went to Project and de-selected “Build Automatically” so I could tell it when to build (compile).There are a few things done be done before we can build. The makefiles that tell eclipse what to compile aren’t even there yet, we have to generate them. It’s quite easy, however. Open up a console and navigate to the directory where your moses code is. Then run:./regenerate-makefiles.shThis will take a few seconds and will generate your makefiles based on how your computer is configured. The only problem I ran into was this: possibly undefined macro: AC_PROG_LIBTOOL and I was able to fix it easily by installing libtool (in ubuntu: sudo apt-get install libtool -thanks to this site for the info) and then tried again and it worked fine. Now, this next step will separate those who just want to get their homework done from those who really want to use moses. You have your makefiles, but you also need to tell moses where to find a few things. If you already have the phrase tables, that is the data required to train the moses (statistical) machine translation engine, then you can simply type./configureand let it run. If you want to make your own phrase tables, you’ll need to install either srilm or irstlm, or both. These are separate pieces of software that do a lot of the data building necessary to make moses work (why reinvent the wheel?). Moses is nice to be compatible with different kinds, so pick the one you want. Installation for both can be tricky (I found irstlm much easier), but doable. Perhaps in a later post I’ll explain how to install them. Until then, be happy with the little bit that moses comes with.Now, go back to eclipse, right-click on your project and hit refresh. Then, click on Project-Properties->C/C++ build. I deselected “Generate Makefiles automatically” and then click on “Workspace” and just then clicked on the root workspace folder and clicked okay (something like ${workspace_loc:/moses} showed in the Build directory field, where moses was the name of my project). This tells it to look in the moses folder for a Makefile, which we generated earlier.Now, press Ctrl+B and it will take a few minutes to build. It’s compiling C and C++ code using make, so eclipse really isn’t doing much but calling it for you. You can click on the Console tab in the lower part of eclipse to see what it’s doing at any given time

To Run: When it’s done compiling, you can give it a test. Have fun. Just kidding, this is how you try it out: notice that you have a new list called “Binaries” in your project explorer. Expand that and you’ll see everything that was just compiled. Right-click on moses and run as local C/C++ Application. Then it will run, but not really. It just spits out the help information because you provided no command line arguments. The problem is, we don’t have any phrase tables for it to use to actually do translation. The moses website provides a sample one for testing to see if your compile worked. You can download it here:http://www.statmt.org/moses/download/sample-models.tgz

Now, by no means is this going to be what you use to actually do some translating. This is just a tiny sample that utilizes the moses MT system, but with a very small amount of training data and only select phrases to translate. Sorry, if you want data, you’ll have to make your own using some parallel corpora (something I hope to discuss later).Once you downloaded the sample_models.tgz file, you can open a command window and navigate to where it is and then run:tar -xzf sample-models.tgzand then go into the new sample-models directory that it just made. Then go into the phrase-tables directory. This is what you need. Open up the moses.ini file with an editor and change the line under [ttable-file] to the path where it currently is (in my case it was on my desktop under Desktop/sample-models/phrase-table/phrase-table) and then save it and close.We’re getting close. Now, you need to note where the moses.ini file is on your computer. Now, go to eclipse and then click on the green “Run” button (looks like a “Play” button”) make sure you hit the down-arrow part, and then click on Run Configurations. Under C/C++ Application, you’ll see moses (it’s there because you tried to run it before). Select it and go to the Arguments tab on the right. Then type in the following:-f {path to the moses.ini file}In my case it was something like

:-f /home/something/Desktop/phrase-table/phrase-model/moses.ini

That’s all you need. Now, click “Apply” and then “Run” and then you’ll notice that it gets into motion. After a few seconds of running, it stops. At this point it is waiting for input. As this is a German-English sample phrase table, you can type:das ist ein kleines hauspress enter, and you should see the translation:this is a small houseThat’s it. You’ve successfully used moses to translate something. Congratulations. Now, to actually use moses in a big way is up to you. You can look into the boost library and work with multi-threading, or you can get the srilm and create your own phrase tables to feed to moses to do your own translations. There is a lot you can do, so check the website and see what’s available.

Appendix:You may need some other things, so I included them here without descriptive steps:If you need tcl (srilm uses tcl):.sudo apt-get install tcl tcllib tcl-devTCL_INCLUDE, TCL_LIBRARY: to whatever is needed to find the Tcl header files and library. If Tcl is not available, set NO_TCL=X and leave the above variables empty.

I had to copy the /usr/include/tcl8.5/ files to the srilm/misc/src dir

Also, exclude LanguageModelRandLM from compile!
I also put the srilm directory in the same as moses
To get boost:

http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/dists/jaunty/all//etc/ld.so.confboost:sudo apt-get install libboost-date-time-dev libboost-date-time1.34.1 libboost-dev libboost-doc libboost-filesystem-dev libboost-filesystem1.34.1 libboost-graph-dev libboost-graph1.34.1 libboost-iostreams-dev libboost-iostreams1.34.1 libboost-program-options-dev libboost-program-options1.34.1 libboost-python-dev libboost-python1.34.1 libboost-regex-dev libboost-regex1.34.1 libboost-signals-dev libboost-signals1.34.1 libboost-test-dev libboost-test1.34.1 libboost-thread-dev libboost-thread1.34.1tUse flag for compiler:-std=c++0xhttp://www.52nlp.com/moses-support-digest-moses-compilation-problem-on-fedora-11/in sphinx, i removed the randlm related flags to the compilehttp://www.statmt.org/moses_steps.htmlbinarize: (make sure you use the .gz compressed version)processPhraseTableprocessLexicalTable

Mary Text to Speech

This is the second TTS system I’m posting about, not because I’m a TTS guy, but because this system was shown to me recently and I found it very well done and intuitive. The installation was so easy, that I’m not even going to post a step-by-step how-to. It comes with an installer and it was written in Java so it should theoretically run on any platform. It does install several different executables that serve different purposes. It’s one of the projects of DFKI.

http://mary.dfki.de/Download

Note

I’ve been getting a lot of comments, questions, and emails lately which is great, but this week is finals week for me in my master’s program. Therefore, I’m quite busy. I’ll have time to look at some of your comments in a week or two.

Python in Eclipse

There are many IDE’s available for any language, and Python has plenty to choose from. However, I’ve been programming in eclipse for Java, JSP/Servlets, Flex, and PHP for a while now and found it to be a solid IDE for at least those languages. I’ve found it to be quite good for Python, as well. If you already have something you like, then use that, of course. But for those of you computational linguists out there who are new to programming, but realize that you had better get with the computational side of things, including programming, this is where you can start.

First, download eclipse. You can either search in Google for “eclipse” and choose what you want, or you can click here. I would recommend the eclipse for PHP developers. WHY?!? Because it has built-in web tools that might come in handy later if you really get into Python. So, look to the right of that option and choose your operating system. Clicking on that link should take you to the download site which will choose the closest download mirror for you. It’s about 138 mb to download.

Once you’ve downloaded it, unzip it. If all is well, you should be able to just run it. In windows, just double-click on the eclipse file inside your eclipse folder. In linux, you may have to open a terminal, navtigate to the folder, and type: ./eclipse

If that worked, you should see the eclipse Galileo splash screen. If it doesn’t load up, the problem could be anything. One problem might be that you downloaded the wrong one for your OS. The other problem might be your java version (eclipse uses java to run). I don’t know what version it was built with, but my eclipse works fine and I am using java 1.6.16. You can check your java version in Linux (or windows) by opening a terminal (command window) and typing: java -version
If you type that and it says it can’t find java, then you don’t have it installed. So, install it. I won’t go over that here. If someone needs help with that, shoot me an email or leave a comment.

Now that your eclipse is open, you’re halfway there. Now, click on Help->Install New Software->Add. Type in “Python” in the name area and:

http://pydev.org/updates/

in the Location box and click OK. Now, you’ll see a drop-box to the left of the Add button. Click on that and find Python. Now, select the box next to “Pydev” and click Next. You may be taken to a place where you choose your mirror site (just choose any one), and you’ll need to read some licensing agreement. It will take a few minutes to download and it will ask if you want to restart eclipse when it is done. Yes, do restart eclipse.

Now, you have the ability to program in Python, but we’re not quite there yet. With eclipse open, click on File->New->Other. Scroll to Pydev, and expand the tree. Then choose Pydev Project and click Next. Type in the name of your project, anything will do. This will create a project folder for your code under the name you give. Notice that you can’t go on yet because you don’t have an interpreter. Click on the link to configure the interpreter. Now, click on Auto Config and then OK. You’ll see it spend some time looking through your computer for libraries. This means you won’t have to set your PYTHON_PATH variable, eclipse takes care of that (assuming you already installed the nltk).

When it’s done, click Next until it creates your project. You’ll see your project folder on the left. Now, expand your project folder and find the src folder. Right-click on that src folder, and select New->Pydev Module. Skip the package name and just put in a name for the file (eg, “test”). You’ll notice that it can fill in the template stuff for classes, etc., but you can just choose “none” and click OK. You should now see a new file.

Now let’s test it. First, type:
print “Hello World”

And press the green play button on top. It will ask you how you want to run the file. Just scroll down and choose “Python Run” and you can set it to autosave the file when you click run so you don’t have to. Then click OK and it should say “Hello World” in the output below.

To see if the nltk works, type:
import nltk
nltk.probability.demo()

And run it again. It should show some probabilities in the output. Now you know that you can use the nltk. You can make as many files as you’d like, classes, access those classes easily with eclipse, etc. Happy programming. Feel free to ask questions of any kind.

Python NLTK

The Python Natural Language Tool Kit has a lot of stuff to offer the DIY NLPer. It has a parser, POS tagger, lambda calculus, a chunker, a classifier, a tokenizer, even a WordNet interface, and much, much…..much more. It’s loaded and it’s not terribly difficult to use granted you know some python and at least a little bit about NLP.

First, you need python, the programming language in which that the tool kit is developed. Most distros of Linux will have python installed, but if you don’t have it you can go to www.python.org to get it and download it. If you don’t know how to do that, you’re hard-pressed to know how to use the tool kit anyway. So, spend some time learning python before you go crazy with the tool kit.

If you’re beyond that and you’re ready for the tool kit, you can go to www.nltk.org and download it. I tried a few different things and ended up just getting the zip file and downloading that, extracting it, then go into the directory with your console and type (as root or sudo):

python setup.py install

and you’re almost done. Run python by just typing:

python

and you’ll see the python command-line interpreter interface. Type:

include nltk
nltk.probability.demo()

and you should see some output with some frequency distributions. There are more tutorials on how to use the tools individually on the www.nltk.org website.

One more thing. You might want to include some of the other optional packages, like numpy. Go back to the same download site as the nltk and grab what you want.  Open a console and get to where you downloaded the file,then run:

tar -xvf numpy[ver]

Then go into your numpy directory and run (again, as root/sudo):

python setup.py install

It will take some time because it is also compiling a lot of c code. Best of luck. I’ll post more as I learn more about it.