Natural Language Laboratory
School of Computing Science
Simon Fraser University

Ph.D. Thesis Seminar- Maxim Roy

26 Feb 2010

Time: Monday March 1st, 2010 10:30 a.m.

Place: TASC1 9204 West

Title: APPROACHES TO HANDLE SCARCE RESOURCES FOR BENGALI STATISTICAL MACHINE TRANSLATION

Abstract:

Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize computer software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data. In order to handle this incomplete data, a principled approach to this problem is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases.

We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems.

We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources.

Also due to different word reordering between Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer.

Overall the main objective of this thesis is to make contribution towards Bengali language processing and provide a general foundation for conducting research in Bengali SMT.