This week, Jetic will give us a practice talk of his depth exam on Memory Network. A zoom link will be sent tomorrow morning.
Abstract: Incorporating knowledge external to the natural language query in a Natural Language Processing (NLP) task has always been computationally challenging, not only because of useful information can come in great quantitiy but also great variety. External knowledge can take the form of documents, tables, figures, or even entire databases. Traditional NLP models using RNN or Transformer based encoders despite strong performance when the input query is limited to a sentence or aparagraph, face limitations when handling these type of information. With the introduction of memory network, a type of neural architecture that allows for separate memory components for storing these external knowledge, Neural NLP models have a much better chance of utilising structured knowledge such as a knowledge graph as well as performing complex dynamics.
Tuesday, July 20th, 09:30 a.m.
This week, Hassan will introduce a paper from Google about Subword units in Neural Machine Translation Models. A zoom link will be sent tomorrow morning.
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Abstract: Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT). While sentences are usually converted into unique subword sequences, subword segmentation is potentially ambiguous and multiple segmentations are possible even with the same vocabulary. The question addressed in this paper is whether it is possible to harness the segmentation ambiguity as a noise to improve the robustness of NMT. We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low resource and out-of-domain settings.
Tuesday, July 13th, 09:30 a.m.
This week, Nishant will give us a survey about Massively multilingual NMT. A zoom link will be sent tomorrow morning.
The Current State of Massively Multilingual NMT
Abstract: Massively multilingual NMT (MMNMT) models are capable of handling over a 100 languages and thousands of translation directions with a single trained model. Apart from scalability, zero-shot translations between languages as a result of the inherent transfer learning makes such models desirable. In this presentation, we will take a look at the preconditions and assumptions that are crucial to build an MMNMT, the widely accepted approaches and results, an in-depth analysis of various aspects that are critical to achieving a practical MMNMT model followed by some recent proposals for improving such models. To conclude, we will discuss some shortcomings and open problems in this direction.
Tuesday, July 6th, 09:30 a.m.
This week, Vincent will discuss a paper about Federated Learning in NLP field. A zoom link will be sent tomorrow morning.
FedNLP: A Research Platform for Federated Learning in Natural Language Processing
Abstract: Increasing concerns and regulations about data privacy, necessitate the study of privacy-preserving methods for natural language processing (NLP) applications. Federated learning (FL) provides promising methods for a large number of clients (i.e., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients, while allowing users to keep their data locally. To facilitate FL research in NLP, we present the FedNLP, a research platform for federated learning in NLP. FedNLP supports various popular task formulations in NLP such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling. We also implement an interface between Transformer language models (e.g., BERT) and FL methods (e.g., FedAvg, FedOpt, etc.) for distributed training. The evaluation protocol of this interface supports a comprehensive collection of non-IID partitioning strategies. Our preliminary experiments with FedNLP reveal that there exists a large performance gap between learning on decentralized and centralized datasets – opening intriguing and exciting future research directions aimed at developing FL methods suited to NLP tasks.
Tuesday, June 22nd, 09:30 a.m.
This week, Logan will discuss parts of his upcoming publication in Findings of ACL 2021, as well as his submission to EMNLP 2021.
Our recent progress about the Undeciphered Proto-Elamite Script
Abstract: Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models
We introduce a language modeling architecture which operates over sequences of images, or over multimodal sequences of images with associated labels. We use this architecture alongside other embedding models to investigate a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite script. We argue that CGs have meanings which are at least partly compositional, and we discover novel rules governing the construction of CGs. We find that a language model over sign images produces more interpretable results than a model over text or over sign images and text, which suggests that the names given to signs may be obscuring signals in the corpus. Our results reveal previously unknown regularities in proto-Elamite sign use that can inform future decipherment efforts, and our image-aware language model provides a novel way to abstract away from biases introduced by human annotators.
Creating a Signlist from Sign Images in an Undeciphered Script using Deep Clustering
We propose an architecture for revising transliterations of an undeciphered script by clustering sign images from that script. The clustering is optimized on a multi-part objective that includes unsupervised tasks such as entropy of the sign labels, visual similarity between signs and partial supervision that exploits existing transliterations for language modeling. This allows us to learn revised labelings for an undeciphered script which may be difficult for human annotators to transliterate since distinctions between signs may be relevant or irrelevant based on contextual information spread across the entire corpus. By automating this process we obtain a simplified signlist which we find to give better results than the existing transliterations on downstream tasks.
Tuesday, June 8th, 09:30 a.m.