Natural Language Processing (NLP) is the automatic analysis of human languages such as English, Korean, etc. by computer algorithms. Unlike programming languages where the structure and meaning of programs is easy to encode, human languages provide an interesting challenge, both in terms of its analysis and the learning of language from observations.
Our lab is active in areas of Information extraction, Machine translation, Summarization, and Statistical parsing. We have worked with many languages including: Arabic, Chinese, Czech, English, French, Hindi, Korean and Spanish.
We have about 10 to 15 students that are members of the Natural Language Lab at SFU. We have a strong relationship with the natural language industry in Canada, the National Research Council, and various research groups around the world and within SFU. In 1999, lab researchers formed a company, Axonwave Software Inc., which uses language technology software.
Interested in machines that can learn language, understand language, and translate language? We are always looking for talented graduate students. We also actively explore connections to industrial/commercial applications of our research through collaborative grants and/or internships.
Posted in News | No Comments »
February 26th, 2010
Time: Tuesday March 2nd, 2010 12:30 p.m.
Place: Bennett Library Rm 2020
Title: APPROACHES TO HANDLE SCARCE RESOURCES FOR BENGALI STATISTICAL MACHINE TRANSLATION
Abstract:
Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data. In order to handle this incomplete data, a principled approach is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases.
We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems.
We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources.
Also due to different word orders in Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer.
Overall the main objective of this thesis is to make contribution towards Bengali language processing, provide a general foundation for conducting research in Bengali SMT and improve the quality of Bengali SMT.
Posted in Events | No Comments »
February 26th, 2010
Time: Monday March 1st, 2010 10:30 a.m.
Place: TASC1 9204 West
Title: APPROACHES TO HANDLE SCARCE RESOURCES FOR BENGALI STATISTICAL MACHINE TRANSLATION
Abstract:
Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize computer software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data. In order to handle this incomplete data, a principled approach to this problem is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases.
We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems.
We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources.
Also due to different word reordering between Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer.
Overall the main objective of this thesis is to make contribution towards Bengali language processing and provide a general foundation for conducting research in Bengali SMT.
Posted in Events | No Comments »
January 26th, 2010

Analysis of statistical machine translation into Finnish
Associate Professor Anoop Sarkar recently won a Google Research Award for research into statistical machine translation.
The Google Research Awards are for faculty around the world who pursue innovative research that is relevant to Google’s mission, to “organize the world’s information and make it universally accessible and useful.” The award consists of substantial grant money, and the opportunity to share in Google’s facilities and give talks related to the research.
Thanks to Google’s generous funding, members of the Natural Language Lab at SFU will be tackling the problem of translation into languages that are very different from English. Lab members, Ph.D. student Baskaran Sankaran and Masters student Ann Clifton, are currently in the process of developing such methods as part of their thesis work.
Many of the recently developed statistical machine translation systems focus on translation into English. The machine learning algorithms used for machine translation are implicitly biased towards producing output in English and similar languages. They are much weaker at dealing with translation from English into a target language with very different grammar and word formation rules. New algorithms and models that augment existing statistical machine translation models are needed to deal with these issues.
Posted in News | No Comments »
January 26th, 2010
The following paper was accepted for publication as a long paper at the North American ACL conference: NAACL HLT 2010.
Latent SVMs for Semantic Role Labeling using LTAG Derivation Trees
Authors: Yudong Liu, Gholamreza Haffari and Anoop Sarkar
Posted in Publications | No Comments »
January 20th, 2010

Yudong defends her Ph.D. thesis
Yudong Liu successfully defended her Ph.D. thesis on November 30th, 2009. Her thesis is entitled
Semantic Role Labeling using Tree Adjoining Grammars. Her external examiner was
Kristina Toutanova. The internal examiner was
Fred Popowich. Her senior supervisor was
Anoop Sarkar and her supervisory committee included:
Eugenia Ternovska.
Jim Delgrande chaired the defense.
Here is the abstract of her thesis:
For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).
Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs. Our contributions include:
- We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly.
- We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling.
- We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks.
Posted in Uncategorized | No Comments »
November 20th, 2009
Time: Friday November 27th, 2009 1:30 p.m.
Place: TASC1 9204 West
Title: SEMANTIC ROLE LABELING USING LEXICALIZED TREE ADJOINING GRAMMARS
Abstract:
For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).
Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs. Our contributions include: (1)We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly. (2)We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling. (3)We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks. In addition, the empirical success of our work is theoretically intriguing in terms of the advantage that the online learning presented under our LSVM framework over the batch learning.
Keywords: semantic role labeling (SRL), Lexicalized Tree Adjoining Grammars (LTAG), features, Latent Support Vector Machines (LSVM)
Posted in Events | No Comments »
November 20th, 2009
Time: Tuesday, November 24, 2009 10:45 a.m.
Place: TASC1 9204 West
Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION
Abstract:
Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.
First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.
Posted in Events | 1 Comment »
November 20th, 2009
Time: November 23rd at 12 noon, Monday
Place: TASC 9204
Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION
Abstract:
Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.
First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.
Posted in Events | No Comments »
November 20th, 2009
Time: Wednesday, November 25, 2009 1:00 p.m
Place: TASC 1 9204 West
Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE
Abstract:
Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.
Posted in Events | 1 Comment »
November 20th, 2009
-Time: 1:30pm November 20, Friday
-Place: TASC 9204 (west side)
Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE
Abstract of the talk:
Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.
Posted in Events | Comments Off