26 Feb 2010

Time: Tuesday March 2nd, 2010 12:30 p.m.

Place: Bennett Library Rm 2020

Title: APPROACHES TO HANDLE SCARCE RESOURCES FOR BENGALI STATISTICAL MACHINE TRANSLATION

Abstract:

Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data.  In order to handle this incomplete data, a principled approach is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases.

We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems.

We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources.

Also due to different word orders in Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer.

Overall the main objective of this thesis is to make contribution towards Bengali language processing, provide a general foundation for conducting research in Bengali SMT and improve the quality of Bengali SMT.

Belorussian translation of the thesis abstract by Amanda Lynn.