Archive for the ‘Corpora’ Category.
April 27, 2009, 1:43 pm
There are many different versions of (Key-Word in Context” (KWIC) dictionaries out there, but for the most part they simply take a search string and lines up the search string in context as found in a given corpus. For example, I searched for “beef” in the Online BLC KWIC Concordance Dictionary and got the following results:
1 remises of our supplying less expensive beef and management know-how in running the
2 At the time, I proposed to supply dried beef continuously, but you made a counter of
3 antage of your supplying less expensive beef is considered, and we most reluctantly
4 cities in Japan; and 2) that you supply beef to the restaurant chain.
5 lp sell chicken and pork in addition to beef.
6 0 tons of Nebraska USDA choice corn-fed beef.
Some KWIC dictionaries are rich in features. For example, you can change the justification of the search string, add more words returned with each found result, etc. Is it useful? That depends on what you need to do. For me, I’ve been studying German and haven’t been able to see certain things in context. For example, I want to understand the difference, in context, between those prepositions that can take the accusitive or dative case depending on what you’re trying to say. I want to see examples, but can’t find any real German KWIC online (that responds faster than 10 minutes). So, I wrote a DIY KWIC.
What it requires: A URL and a search string. The search string you understand. The URL is going to be the corpus. When you click “Go” it will actually go to the URL, parse out the HTML and find links. It will dig several links and get the text from those sites, as well. It will then generate a single corpus of multi-page data will be your corpus that will then be searched through, looking for your search string. Just open up the link to the right called “KWIC” and just click “Go” with what’s there and see what happens. It’s not feature rich, nor is it pretty, but it gets the job done. I’m happy for feedback. Now, when I want to study German, I just throw in a Wikipedia article in German and search for a string.
Disclaimer: This is only to be used for personal purposes. It parses any website given, so you are responsible for the URL you search. You are not able to copy text, so it is read-only. The primary purpose is to help you with language study.
August 15, 2008, 2:11 pm
The International Language Resource and Evaluation Conference took place at the end of May this year in Marrakesh, Morocco. I was able to go for the same research we did on second language proficiency testing. We presented a poster in one of the poster sessions and had a lot of interested people ask many questions.
There was a big difference between the conference goers here and the ones at CALICO. The CALICO conference sported mostly educators looking for ways to improve language teaching in the classroom where LREC focused more on natural language processing. There would be more software engineers and linguists rather than educators. There were talks in the range from very in-depth statistical theory to corpora. I mostly sat in on what people were doing with machine translation or the Japanese language.
Now a word on corpora. For some naive reason, I thought that we had a pretty good amount of corpora for most purposes, like POS tagging, word chunking, parsing, etc. But, from this conference, I found that many organizations are working on new corpora all the time. There are general corpora like the Wall Street Journal spoken English to more specific corpora like the utterances of drunk people. Corpora is huge in NLP whether it’s statistical NLP or otherwise. The big corpora repositories are the LDC in the United States and ELRA in Europe. There are a few in Asia, as well. The problem is most useful corpora isn’t freely available. You can either 1. contribute or 2. pay for membership to get corpora. They will give corpora for free, but not typically to a hobbyist individual. They like to let universities use the data and they like to know why. That doesn’t mean the individual can’t have fun, he/she just has to be more creative.
Big companies like Microsoft presented some things at the conference, as well. Companies use NLP more and more these days even if they aren’t a specific NLP company like, say, Nuance. Microsoft can use NLP in MS Word. I worked for a company where we worked on developing a way to make a part of speech tagger to automatically tag new dialogs so someone wouldn’t have to go in and do it by hand- something that didn’t necessarily affect the end user. Cell phone companies, car companies, and many different software companies are using NLP more and more. This conference may not have the bleeding edge of NLP technology of our time, but it is a great conference for seeing what’s going on in the field and possibly finding a job doing NLP.