News

Investigations into the Value of Labeled and Unlabeled Data in Biomedical Entity Recognition and Word Sense Disambiguation
02 Mar 2021

In our lab meeting tomorrow, Golnar will practice her Phd seminar talk on NER and Word disambiguation.

Investigations into the Value of Labeled and Unlabeled Data in Biomedical Entity Recognition and Word Sense Disambiguation

Abstract: Human annotations, especially in highly technical domains, are expensive and time consuming to gather, and can also be erroneous. As a result, we never have sufficiently accurate data to train and evaluate supervised methods.In this thesis, we address this problem by taking a semi-supervised approach to biomedical named entity recognition (NER), and by proposing an inventory-independent evaluation framework for supervised and unsupervised word sense disambiguation.Our contributions are as follows: • We introduce a novel graph-based semi-supervised approach to named entity recognition(NER) and exploit pre-trained contextualized word embeddings in several biomedical NER tasks. • We propose a new evaluation framework for word sense disambiguation that permits a fair comparison between supervised methods trained on different sense inventories as well as unsupervised methods without a fixed sense inventory.

Tuesday, Mar 2nd, 09:30 a.m.

Multilingual Unsupervised Word Alignment Models and Their Application
23 Feb 2021

In our lab meeting tomorrow, Anahita will practice her thesis defence on Word Alignment.

Multilingual Unsupervised Word Alignment Models and Their Application

Abstract: Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection.

First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality.

Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network.

Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines.

Tuesday, Feb 23rd, 09:30 a.m.

Translation-based Supervision for Policy Generation in Simultaneous Neural Machine Translation
09 Feb 2021

In our lab meeting tomorrow, Ashkan will introduce his project on Machine Translation.

Translation-based Supervision for Policy Generation in Simultaneous Neural Machine Translation

Abstract: In simultaneous machine translation, finding optimal segments on the source and target side of each sentence pair that maintain translation quality while minimizing delay remains challenging. We propose a supervised learning approach for training an Agent that can detect the minimum number of reads required for generating each target token. By decoupling the training procedure of our Agent we can apply our policy on NMT components trained in various ways.

Tuesday, Feb 9th, 09:30 a.m.

Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models
02 Feb 2021

In our lab meeting tomorrow, Logan will introduce his research on the Undeciphered Proto-Elamite Script.

Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models

Abstract: We introduce a language modeling architecture which operates over sequences of images, or over multimodal sequences of images with associated labels. We use this architecture alongside other embedding models to investigate a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite script. We argue that CGs have meanings which are at least partly compositional, and we demonstrate quantifiable differences in the distribution of two categories of signs used in CGs. We find that a language model over sign images produces more interpretable results than a model over text or over sign images and text, which suggests that the names given to signs may be obscuring signals in the corpus. Our results indicate the presence of previously unknown regularities in proto-Elamite sign use, and our image-aware language model provides a novel way to abstract away from biases introduced by human annotators.

Tuesday, Feb 2nd, 09:30 a.m.

Towards better substitution-based word sense induction
19 Jan 2021

In our lab meeting tomorrow, Golnar will give us a review on a paper of word sense induction.

Towards better substitution-based word sense induction

Abstract: Word sense induction (WSI) is the task of unsupervised clustering of word usages within a sentence to distinguish senses. Recent work obtain strong results by clustering lexical substitutes derived from pre-trained RNN language models (ELMo). Adapting the method to BERT improves the scores even further. We extend the previous method to support a dynamic rather than a fixed number of clusters as supported by other prominent methods, and propose a method for interpreting the resulting clusters by associating them with their most informative substitutes. We then perform extensive error analysis revealing the remaining sources of errors in the WSI task.

https://arxiv.org/pdf/1905.12598.pdf

Tuesday, Jan 19th, 09:30 a.m.

Recent Publications