A Quick Review of 3 Papers from EMNLP 2021
24 Nov 2021

This week’s lab meeting is moved to Wednesday, 24 Nov, 13:00. Ashkan has selected three papers to survey from EMNLP:

  • Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT ( This paper analyzes how NMT acquires different competencies during training and looks at the competencies related to three core SMT components (LM, Lexical translation, and Reordering).

  • RULEBERT: Teaching Soft Rules to Pre-Trained Language Models ( This paper introduce a classification task where, given facts and soft rules, pretrained language models should return a prediction with a probability for a given hypothesis.

  • Machine Translation Decoding beyond Beam Search ( This paper explores alternatives to beam search and tries to see if beam search can be replaced by a more powerful metric-driven search technique.

Wednesday, 24 November at 13:00

After this week we will have a short end-of-semester hiatus while lab members are traveling or absent.

All Word Embeddings from One Embedding
19 Nov 2021

At today’s lab meeting, Nishant will present work that reduces the number of parameters required to store word embeddings by representing all words as transformations of a single shared vector.

All Word Embeddings from One Embedding Abstract: In neural network-based models for natural language processing (NLP), the largest part of the parameters often consists of word embeddings. Conventional models prepare a large embedding matrix whose size depends on the vocabulary size. Therefore, storing these models in memory and disk storage is costly. In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding. The proposed method, ALONE (all word embeddings from one), constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable. Then, we input the constructed embedding into a feed-forward neural network to increase its expressiveness. Naively, the filter vectors occupy the same memory size as the conventional embedding matrix, which depends on the vocabulary size. To solve this issue, we also introduce a memory-efficient filter construction approach. We indicate our ALONE can be used as word representation sufficiently through an experiment on the reconstruction of pre-trained word embeddings. In addition, we also conduct experiments on NLP application tasks: machine translation and summarization. We combined ALONE with the current state-of-the-art encoder-decoder model, the Transformer, and achieved comparable scores on WMT 2014 English-to-German translation and DUC 2004 very short summarization with less parameters.

Friday, 19 November at 13:30

Autoregressive Entity Retrieval
12 Nov 2021

At Friday’s lab meeting, Hassan will present recent work on autoregressive entity retrieval.

Autoregressive Entity Retrieval Abstract: Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach leads to several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion and conditioned on the context. This enables us to mitigate the aforementioned technical issues since: (i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach, experimenting with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name. Code and pre-trained models at

Friday, 12 November at 13:30

NeuralNERE: Neural Named Entity Relationship Extraction for End-to-End Climate Change Knowledge Graph Construction
22 Oct 2021

At tomorrow’s lab meeting, Rylen will present Mishra and Mittal (2021) from the Tackling Climate Change with Machine Learning Workshop at ICML 2021.

NeuralNERE: Neural Named Entity Relationship Extraction for End-to-End Climate Change Knowledge Graph Construction Prakamya Mishra and Rohan Mittal

Abstract: This paper proposes an end-to-end Neural Named Entity Relationship Extraction model (called NeuralNERE) for climate change knowledge graph (KG) construction, directly from the raw text of relevant news articles. The proposed model will not only remove the need for any kind of human supervision for building knowledge bases for climate change KG construction (used in the case of supervised or dictionary-based KG construction methods), but will also prove to be highly valuable for analyzing climate change by summarising relationships between different factors responsible for climate change, extracting useful insights & reasoning on pivotal events, and helping industry leaders in making more informed future decisions. Additionally, we also introduce the Science Daily Climate Change dataset (called SciDCC) that contains over 11k climate change news articles scraped from the Science Daily website, which could be used for extracting prior knowledge for constructing climate change KGs.

Friday, 22 October at 13:30

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
08 Oct 2021

At tomorrow’s lab meeting, Logan will present Aghajanyan et al. (2021), one of the winners of the ACL 2021 Outstanding Paper Award. This work extends the concept of intrinsic dimension to fine-tuning problems, and shows that tuning on as few as 200 dimensions can be effective on some tasks. This result is accompanied by applications to model compression and bounds on a model’s ability to generalize.

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning Armen Aghajanyan, Sonal Gupta, Luke Zettlemoyer

Abstract: Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90% of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.

Friday, 8 October at 13:30

Recent Publications