This week, Hassan will give us a review on a paper about Error Analysis. This is one of the two “Best Long Paper Award” winners of the EACL 2021 which ended 3 days ago. A zoom link will be sent tomorrow morning.
Error Analysis and the Role of Morphology
Abstract: We evaluate two common conjectures in error analysis of NLP models: (i) Morphology is predictive of errors; and (ii) the importance of morphology increases with the morphological complexity of a language. We show across four different tasks and up to 57 languages that of these conjectures, somewhat surprisingly, only (i) is true. Using morphological features does improve error prediction across tasks; however, this effect is less pronounced with morphologically complex languages. We speculate this is because morphology is more discriminative in morphologically simple languages. Across all four tasks, case and gender are the morphological features most predictive of error.
Tuesday, Apr 27th, 09:30 a.m.
This week, Nishant will give us a review on a paper about multilingual BERT. A zoom link will be sent tomorrow morning.
How multilingual is multilingual BERT?
Abstract: This work by Pires et al. (2019) empirically investigates the degree to which pre-trained contextualized general-purpose linguistic representations generalize across languages. The key finding is that multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, authors present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.
Tuesday, Apr 20th, 09:30 a.m.
This week, Vincent will give us a review on a paper about interpretability of attention. A zoom link will be sent tomorrow morning.
The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
Abstract: There is a recent surge of interest in using attention as explanation of model predictions, with mixed evidence on whether attention can be used as such. While attention conveniently gives us one weight per input token and is easily extracted, it is often unclear toward what goal it is used as explanation. We find that often that goal, whether explicitly stated or not, is to find out what input tokens are the most relevant to a prediction, and that the implied user for the explanation is a model developer. For this goal and user, we argue that input saliency methods are better suited, and that there are no compelling reasons to use attention, despite the coincidence that it provides a weight for each input. With this position paper, we hope to shift some of the recent focus on attention to saliency methods, and for authors to clearly state the goal and user for their explanations.
Tuesday, Apr 6th, 09:30 a.m.
This week, Logan will give a practice talk for his depth exam. The talk will discuss recent work on analogy tasks, with a focus on the pitfalls surrounding their use in model evaluation and interpretation. After the practice talk, we will discuss Ethayarajh 2019: Rotate King to get Queen: Word Relationships as Orthogonal Transformations in Embedding Space. This paper suggests that vector offsets are not the most appropriate way to perform analogical reasoning, and that embedding models learn more complex relations between words than can be expressed by simple vector translation.
Rotate King to get Queen: Word Relationships as Orthogonal Transformations in Embedding Space
Abstract: A notable property of word embeddings is that word relationships can exist as linear substructures in the embedding space. For example, ‘gender’ corresponds to v_woman - v_man and v_queen - v_king. This, in turn, allows word analogies to be solved arithmetically: v_king - v_man + v_woman = v_queen. This property is notable because it suggests that models trained on word embeddings can easily learn such relationships as geometric translations. However, there is no evidence that models exclusively represent relationships in this manner. We document an alternative way in which downstream models might learn these relationships: orthogonal and linear transformations. For example, given a translation vector for ‘gender’, we can find an orthogonal matrix R, representing a rotation and reflection, such that R(v_king) = v_queen and R(v_man) = v_woman. Analogical reasoning using orthogonal transformations is almost as accurate as using vector arithmetic; using linear transformations is more accurate than both. Our findings suggest that these transformations can be as good a representation of word relationships as translation vectors.
Tuesday, Mar 23rd, 09:30 a.m.
In our lab meeting tomorrow, Hassan will give us a review of the “Best Overall Paper” of ACL 2020.
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Abstract: Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a taskagnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Tuesday, Mar 16th, 09:30 a.m.