16 Sep 2021

At tomorrow’s lab meeting, Nishant will present recent work which uses multiple segmentations of a corpus to improve translation quality in low-resource languages.

Learning Improved Word Compositionalities for Low-Resource Languages

Abstract: We propose a novel technique that combines alternative subword tokenizations of a single source-target language pair that allows us to leverage multilingual neural translation training methods. These alternate segmentations function like related languages in multilingual translation. Overall this improves translation accuracy for low-resource languages and produces translations that are lexically diverse and morphologically rich. We also introduce a cross-teaching technique which yields further improvements in translation accuracy and cross-lingual transfer between high- and low-resource language pairs. Using these techniques, we surpass the previous state-of-the-art BLEU scores on three out of four low-resource languages from the multilingual TED-talks dataset with significantly better results on one. Compared to other strong multilingual baselines, our approach yields average gains of +1.7 BLEU across the four low-resource datasets. Our technique does not require additional training data or synthetic data or external resources, and is a drop-in improvement for any existing neural translation system for a single language pair.

Friday, 17 September at 13:30