Natural Language Processing (NLP) is the automatic analysis of human languages such as English, Korean, etc. by computer algorithms. Unlike programming languages where the structure and meaning of programs is easy to encode, human languages provide an interesting challenge, both in terms of its analysis and the learning of language from observations.
Our lab is active in areas of Information extraction, Machine translation, Summarization, and Statistical parsing. We have worked with many languages including: Arabic, Chinese, Czech, English, French, Hindi, Korean and Spanish.
We have about 10 to 15 students that are members of the Natural Language Lab at SFU. We have a strong relationship with the natural language industry in Canada, the National Research Council, and various research groups around the world and within SFU. In 1999, lab researchers formed a company, Axonwave Software Inc., which uses language technology software.
Interested in machines that can learn language, understand language, and translate language? We are always looking for talented graduate students. We also actively explore connections to industrial/commercial applications of our research through collaborative grants and/or internships.
Posted in News | No Comments »
November 20th, 2009
Time: Friday November 27th, 2009 1:30 p.m.
Place: TASC1 9204 West
Title: SEMANTIC ROLE LABELING USING LEXICALIZED TREE ADJOINING GRAMMARS
Abstract:
For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).
Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs. Our contributions include: (1)We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly. (2)We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling. (3)We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks. In addition, the empirical success of our work is theoretically intriguing in terms of the advantage that the online learning presented under our LSVM framework over the batch learning.
Keywords: semantic role labeling (SRL), Lexicalized Tree Adjoining Grammars (LTAG), features, Latent Support Vector Machines (LSVM)
Posted in Events | No Comments »
November 20th, 2009
Time: Tuesday, November 24, 2009 10:45 a.m.
Place: TASC1 9204 West
Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION
Abstract:
Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.
First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.
Posted in Events | No Comments »
November 20th, 2009
Time: November 23rd at 12 noon, Monday
Place: TASC 9204
Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION
Abstract:
Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.
First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.
Posted in Uncategorized | No Comments »
November 20th, 2009
Time: Wednesday, November 25, 2009 1:00 p.m
Place: TASC 1 9204 West
Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE
Abstract:
Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.
Posted in Events | No Comments »
November 20th, 2009
-Time: 1:30pm November 20, Friday
-Place: TASC 9204 (west side)
Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE
Abstract of the talk:
Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.
Posted in Events | No Comments »
September 6th, 2009

Gholamreza Haffari successfully defended his Ph.D. thesis on August 17th, 2009. His thesis is entitled Machine Learning Approaches To Dealing with Limited Training Data in Statistical Machine Translation. His external examiner was Kevin Knight. His internal examiner was Oliver Schulte. His senior supervisor was Anoop Sarkar and his supervisory committee included: Greg Mori, Shaojun Wang, and Valentine Kabanets. Torsten Möller chaired the defense.
Here is the abstract of his thesis:
Statistical Machine Translation (SMT) models learn how to translate by examining a bilingual parallel corpus containing sentences aligned with their human-produced translations. However, high quality translation output is dependent on the availability of massive amounts of parallel text in the source and target languages. There are a large number of languages that are considered low-density, either because the population speaking the language is not very large, or even if millions of people speak the language, insufficient online resources are available in that language. This thesis covers machine learning approaches for dealing with such situations in statistical machine translation where the amount of available bilingual data is limited.
The problem of learning from insufficient labeled training data has been dealt with in machine learning community under two general frameworks: (i) Semi-supervised Learning, and (ii) Active Learning. The complex nature of machine translation task poses severe challenges to most of the algorithms developed in machine learning community for these two learning scenarios. In this thesis, I develop semi-supervised learning as well as active learning algorithms to deal with the shortage of bilingual training data for Statistical Machine Translation task, specific to cases where there is shortage of bilingual training data.
I assume that we are given access to a monolingual corpus containing large number of sentences in the source language, in addition to a small or moderate sized bilingual corpus. The idea is to take advantage of this readily available monolingual data in building a better SMT model using bootstrapping-style methods: By selecting an important subset of these monolingual sentences, preparing their translations, and using them together with the original sentence pairs to re-train the SMT model effectively. When preparing the translation of the selected sentences, if we use a human annotator, then the framework fits into the Active Learning scenario in machine learning. Instead if we use the (system generated) translations from the output of the SMT model itself, then the framework fits into the semi-supervised learning scenario in machine learning. The key points that I address throughout this thesis are (1) how to choose the important sentences, (2) how to provide their translations (possibly with as little effort as possible), and (3) how to use the newly collected information in training the SMT model.
Posted in News | No Comments »
July 30th, 2009
In this week’s lab meeting, Yudong Liu give a practice talk for her paper to appear in the Grammar Engineering Across Frameworks (GEAF) 2009 workshop at ACL-IJCNLP 2009.
-Time: 11:30am July 30, Thursday
-Place: our usual meeting room
-Abstract of the talk:
Title: Exploration of the LTAG-Spinal Formalism and Treebank for Semantic Role Labeling
LTAG-spinal is a novel variant of traditional Lexicalized Tree Adjoining Grammar (LTAG) introduced by (Shen, 2006). The LTAG-spinal Treebank (Shen et al.,2008) combines elementary trees extracted from the Penn Treebank with Propbank annotation. In this paper, we present a semantic role labeling (SRL) system based on this new resource and provide an experimental comparison with CCGBank and a state-of-the-art SRL system based on Treebank phrase-structure trees. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling.
Posted in Lab Meeting | No Comments »
July 27th, 2009
Reza will give a practice talk for his ACL-IJCNLP 2009 presentation.
Where: TASC1 9408
Date and Time: Tue 7/28 1pm
Active Learning for Multilingual Statistical Machine Translation.
Gholamreza Haffari and Anoop Sarkar.
In Proceedings of the 47th annual meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009). Singapore, August 2-7, 2009.
Abstract:
Statistical machine translation (SMT) models require bilingual corpora for training, and these corpora are often multilingual with parallel text in multiple languages simultaneously. We introduce an active learning task of adding a new language to an existing multilingual set of parallel text and constructing high quality MT systems, from each language in the collection into this new target language. We show that adding a new language using active learning to the EuroParl corpus provides a significant improvement compared to a random sentence selection baseline. We also provide new highly effective sentence selection methods that improve AL for phrase-based SMT in the multilingual and single language pair setting.
Posted in Lab Meeting | No Comments »
July 15th, 2009
The natlang lab meeting on Thu 7/16/2009 will be in TASC1 9408 at 11:30am.
Manaal Faruqui will be presenting the work he has done during his summer-long MITACS Globalink internship in our lab. Manaal is an undergraduate student in IIT Kharagpur and spent the summer with our lab working on machine translation, collaborating with Baskaran Sankaran and Anoop Sarkar on model adaptation for hierarchical phrase-based machine translation.
Title: Model Adaptation in Statistical Machine Translation for Synchronous Context‐Free Grammars
Abstract:
Hierarchical phrase based translation uses context free grammar rules for decoding source sentences into target language sentences. We derive rules for translation from a parallel corpus which are then exploited to translate sentences. Thus in effect we have a large table containing source and target language phrase pairs composed of terminals and non-terminals.
In this talk we go deeper and explore the case where we have a limited amount of in-domain parallel text for deriving the rules of the given language pair. We assume the availability of a very large amount of out-of-domain parallel text for the same language pair. So we have a large parallel corpus for out-of-domain data and a small parallel corpus for the in-domain data. We carry out experiments to improve the translation score by training our system on the rules from the two domains.
[Presentation slides]
Posted in Lab Meeting | No Comments »
July 8th, 2009
We will meet in the lab meeting room (TASC1 9408) on Thu 7/9 to discuss the NAACL HLT 2009 papers. Please bring 1-2 papers that you wish to highlight with an overview and discussion of the significance of the paper.
Posted in Lab Meeting | No Comments »