At tomorrow’s lab meeting, Nishant will present recent work which uses multiple segmentations of a corpus to improve translation quality in low-resource languages.
Learning Improved Word Compositionalities for Low-Resource Languages
Abstract: We propose a novel technique that combines alternative subword tokenizations of a single source-target language pair that allows us to leverage multilingual neural translation training methods. These alternate segmentations function like related languages in multilingual translation. Overall this improves translation accuracy for low-resource languages and produces translations that are lexically diverse and morphologically rich. We also introduce a cross-teaching technique which yields further improvements in translation accuracy and cross-lingual transfer between high- and low-resource language pairs. Using these techniques, we surpass the previous state-of-the-art BLEU scores on three out of four low-resource languages from the multilingual TED-talks dataset with significantly better results on one. Compared to other strong multilingual baselines, our approach yields average gains of +1.7 BLEU across the four low-resource datasets. Our technique does not require additional training data or synthetic data or external resources, and is a drop-in improvement for any existing neural translation system for a single language pair.
Friday, 17 September at 13:30
At the lab meeting tomorrow Logan plans to present “Attention Flows are Shapley Value Explanations” by Kawin Ethayarajh and Dan Jurafsky at ACL 2021. https://aclanthology.org/2021.acl-short.8/
Attention Flows are Shapley Value Explanations
Abstract: Shapley Values, a solution to the credit assignment problem in cooperative game theory, are a popular type of explanation in machine learning, having been used to explain the importance of features, embeddings, and even neurons. In NLP, however, leave-one-out and attention-based explanations still predominate. Can we draw a connection between these different methods? We formally prove that — save for the degenerate case — attention weights and leave-one-out values cannot be Shapley Values. Attention flow is a post-processed variant of attention weights obtained by running the max-flow algorithm on the attention graph. Perhaps surprisingly, we prove that attention flows are indeed Shapley Values, at least at the layerwise level. Given the many desirable theoretical qualities of Shapley Values — which has driven their adoption among the ML community — we argue that NLP practitioners should, when possible, adopt attention flow explanations alongside more traditional ones.
Tuesday, Aug 10th, 09:30 a.m.
This week, Jetic will give us a practice talk of his depth exam on Memory Network. A zoom link will be sent tomorrow morning.
Abstract: Incorporating knowledge external to the natural language query in a Natural Language Processing (NLP) task has always been computationally challenging, not only because of useful information can come in great quantitiy but also great variety. External knowledge can take the form of documents, tables, figures, or even entire databases. Traditional NLP models using RNN or Transformer based encoders despite strong performance when the input query is limited to a sentence or aparagraph, face limitations when handling these type of information. With the introduction of memory network, a type of neural architecture that allows for separate memory components for storing these external knowledge, Neural NLP models have a much better chance of utilising structured knowledge such as a knowledge graph as well as performing complex dynamics.
Tuesday, July 20th, 09:30 a.m.
This week, Hassan will introduce a paper from Google about Subword units in Neural Machine Translation Models. A zoom link will be sent tomorrow morning.
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Abstract: Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
Tuesday, July 13th, 09:30 a.m.
This week, Nishant will give us a survey about Massively multilingual NMT. A zoom link will be sent tomorrow morning.
The Current State of Massively Multilingual NMT
Abstract: Massively multilingual NMT (MMNMT) models are capable of handling over a 100 languages and thousands of translation directions with a single trained model. Apart from scalability, zero-shot translations between languages as a result of the inherent transfer learning makes such models desirable. In this presentation, we will take a look at the preconditions and assumptions that are crucial to build an MMNMT, the widely accepted approaches and results, an in-depth analysis of various aspects that are critical to achieving a practical MMNMT model followed by some recent proposals for improving such models. To conclude, we will discuss some shortcomings and open problems in this direction.
Tuesday, July 6th, 09:30 a.m.