31 Mar 2009

New publication:

Active Learning for Statistical Phrase-based Machine Translation. Gholamreza Haffari, Maxim Roy and Anoop Sarkar. In Proceedings of the annual meeting of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT). Boulder, Colorado. May 31-June 5, 2009.


Statistical machine translation (SMT) models need large bilingual corpora for training, which are unavailable for some language pairs. This paper provides the first serious experimental study of active learning for SMT. We use active learning to improve the quality of a phrase-based SMT system, and show significant improvements in translation compared to a random sentence selection baseline, when test and training data are taken from the same or different domains. Experimental results are shown in a simulated setting using three language pairs, and in a realistic situation for Bangla-English, a language pair with limited translation resources.