Maryam Siahbani will do her PhD depth examination on Friday, August 17 2012.
Siahbani, M. Statistical machine translation without parallel data.
abstract: In this survey we consider statistical machine translation (SMT) models that can be trained using monolingual sources of data in the source and target language and thus do not need parallel data for training. Contemporary SMT systems obtain very good translation fluency by leveraging large amounts of parallel data in source and target languages. But such data is available only for a few language pairs and domains. Using human experts to annotate and curate new parallel corpora that is sufficient for building a good translation system is very expensive. On the other hand, there are readily available resources of monolingual text in many of the worlds’ languages. This has led to a natural research challenge in SMT: How can we train a statistical language translation system without parallel data? There has been a long line of research on learning translation from monolingual data, beginning with Rapp (1995). Many have focused on the simpler task of extracting a translation lexicon by mining monolingual resources of data. Recent work has extended this from translation l lexicons to full translation systems. We survey and categorize previously published approaches to SMT without parallel data based on the end product (translation lexicon or translation system) and type of available resources which can be used (limited parallel information or just monolingual data). We also analyze these methods in terms of quality of translation and scalability.