Simon Fraser University

Welcome to the SFU Natural Language Laboratory

Natural Language Processing (NLP) is the automatic analysis of human languages such as English, Korean, etc. by computer algorithms. Unlike programming languages where the structure and meaning of programs is easy to encode, human languages provide an interesting challenge, both in terms of its analysis and the learning of language from observations.

Our lab is active in areas of Information extraction, Machine translation, Summarization, and Statistical parsing. We have worked with many languages including: Arabic, Chinese, Czech, English, French, Hindi, Korean and Spanish.

We have about 10 to 15 students that are members of the Natural Language Lab at SFU. We have a strong relationship with the natural language industry in Canada, the National Research Council, and various research groups around the world and within SFU. In 1999, lab researchers formed a company, Axonwave Software Inc., which uses language technology software.

Interested in machines that can learn language, understand language, and translate language? We are always looking for talented graduate students. We also actively explore connections to industrial/commercial applications of our research through collaborative grants and/or internships.

Google Research Award

January 26th, 2010
Analysis of statistical machine translation into Finnish

Analysis of statistical machine translation into Finnish

Associate Professor Anoop Sarkar recently won a Google Research Award for research into statistical machine translation.

The Google Research Awards are for faculty around the world who pursue innovative research that is relevant to Google’s mission, to “organize the world’s information and make it universally accessible and useful.” The award consists of substantial grant money, and the opportunity to share in Google’s facilities and give talks related to the research.

Thanks to Google’s generous funding, members of the Natural Language Lab at SFU will be tackling the problem of translation into languages that are very different from English. Lab members, Ph.D. student Baskaran Sankaran and Masters student Ann Clifton, are currently in the process of developing such methods as part of their thesis work.

Many of the recently developed statistical machine translation systems focus on translation into English. The machine learning algorithms used for machine translation are implicitly biased towards producing output in English and similar languages. They are much weaker at dealing with translation from English into a target language with very different grammar and word formation rules. New algorithms and models that augment existing statistical machine translation models are needed to deal with these issues.

Accepted for publication: Latent SVMs for Semantic Role Labeling using LTAG Derivation Trees

January 26th, 2010

The following paper was accepted for publication as a long paper at the North American ACL conference: NAACL HLT 2010.

Latent SVMs for Semantic Role Labeling using LTAG Derivation Trees

Authors: Yudong Liu, Gholamreza Haffari and Anoop Sarkar

Yudong Liu defends her Ph.D. thesis

January 20th, 2010

Yudong defends her Ph.D. thesis

Yudong defends her Ph.D. thesis

Yudong Liu successfully defended her Ph.D. thesis on November 30th, 2009. Her thesis is entitled Semantic Role Labeling using Tree Adjoining Grammars. Her external examiner was Kristina Toutanova. The internal examiner was Fred Popowich. Her senior supervisor was Anoop Sarkar and her supervisory committee included: Eugenia Ternovska. Jim Delgrande chaired the defense.

Here is the abstract of her thesis:

For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).

Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs. Our contributions include:

  1. We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly.
  2. We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling.
  3. We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks.

Ph.D. Thesis Seminar- Yudong Liu

November 20th, 2009

Time: Friday November 27th, 2009 1:30 p.m.
Place: TASC1 9204 West

Title: SEMANTIC ROLE LABELING USING LEXICALIZED TREE ADJOINING GRAMMARS

Abstract:
For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).

Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs.  Our contributions include: (1)We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly. (2)We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling. (3)We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks.  In addition, the empirical success of our work is theoretically intriguing in terms of the advantage that the online learning presented under our LSVM framework over the batch learning.

Keywords: semantic role labeling (SRL), Lexicalized Tree Adjoining Grammars (LTAG), features, Latent Support Vector Machines (LSVM)


M. Sc. Thesis Defence and Seminar- Ajeet Grewal

November 20th, 2009

Time: Tuesday, November 24, 2009 10:45 a.m.

Place: TASC1   9204 West

Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION

Abstract:

Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.

First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.

Ajeet’s M.Sc. Defence Practice Talk

November 20th, 2009

Time: November 23rd at 12 noon, Monday
Place: TASC  9204

Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION

Abstract:
Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.

First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.

M.Sc. Thesis Defence and Seminar - Milan Tofiloski

November 20th, 2009

Time: Wednesday, November 25, 2009 1:00 p.m
Place: TASC 1 9204 West

Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE

Abstract:
Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.

Milan’s M.SC. THESIS Practice Talk

November 20th, 2009

-Time: 1:30pm November 20, Friday
-Place: TASC 9204 (west side)

Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE

Abstract of the talk:

Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.

Reza defends his Ph.D. Thesis

September 6th, 2009

Reza defends his Ph.D. thesis

Gholamreza Haffari successfully defended his Ph.D. thesis on August 17th, 2009. His thesis is entitled Machine Learning Approaches To Dealing with Limited Training Data in Statistical Machine Translation. His external examiner was Kevin Knight. His internal examiner was Oliver Schulte. His senior supervisor was Anoop Sarkar and his supervisory committee included: Greg Mori, Shaojun Wang, and Valentine Kabanets. Torsten Möller chaired the defense.

Here is the abstract of his thesis:

Statistical Machine Translation (SMT) models learn how to translate by examining a bilingual parallel corpus containing sentences aligned with their human-produced translations. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target languages. There are a large number of languages that are considered low-density, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This thesis covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited.

The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The complex nature of machine translation task poses severe challenges to most of the algorithms developed in machine learning community for these two learning scenarios. In this thesis, I develop semi-supervised learning as well as active learning algorithms to deal with the shortage of bilingual training data for Statistical Machine Translation task, specific to cases where there is shortage of bilingual training data.

I assume that we are given access to a monolingual corpus containing large number of sentences in the source language, in addition to a small or moderate sized bilingual corpus. The idea is to take advantage of this readily available monolingual data in building a better SMT model using bootstrapping-style methods: By selecting an important subset of these monolingual sentences, preparing their translations, and using them together with the original sentence pairs to re-train the SMT model effectively. When preparing the translation of the selected sentences, if we use a human annotator, then the framework fits into the Active Learning scenario in machine learning. Instead if we use the (system generated) translations from the output of the SMT model itself, then the framework fits into the semi-supervised learning scenario in machine learning. The key points that I address throughout this thesis are (1) how to choose the important sentences, (2) how to provide their translations (possibly with as little effort as possible), and (3) how to use the newly collected information in training the SMT model.

Lab meeting: GEAF 2009 (ACL workshop) practice talk

July 30th, 2009

In this week’s lab meeting, Yudong Liu give a practice talk for her paper to appear in the Grammar Engineering Across Frameworks (GEAF) 2009 workshop at ACL-IJCNLP 2009.

-Time: 11:30am July 30, Thursday
-Place: our usual meeting room
-Abstract of the talk:

Title: Exploration of the LTAG-Spinal Formalism and Treebank for Semantic Role Labeling

LTAG-spinal is a novel variant of traditional Lexicalized Tree Adjoining Grammar (LTAG) introduced by (Shen, 2006). The LTAG-spinal Treebank (Shen et al.,2008) combines elementary trees extracted from the Penn Treebank with Propbank annotation. In this paper, we present a semantic role labeling (SRL) system based on this new resource and provide an experimental comparison with CCGBank and a state-of-the-art SRL system based on Treebank phrase-structure trees. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling.