Simon Fraser University

Welcome to the SFU Natural Language Laboratory

Natural Language Processing (NLP) is the automatic analysis of human languages such as English, Korean, etc. by computer algorithms. Unlike programming languages where the structure and meaning of programs is easy to encode, human languages provide an interesting challenge, both in terms of its analysis and the learning of language from observations.

Our lab is active in areas of Information extraction, Machine translation, Summarization, and Statistical parsing. We have worked with many languages including: Arabic, Chinese, Czech, English, French, Hindi, Korean and Spanish.

We have about 10 to 15 students that are members of the Natural Language Lab at SFU. We have a strong relationship with the natural language industry in Canada, the National Research Council, and various research groups around the world and within SFU. In 1999, lab researchers formed a company, Axonwave Software Inc., which uses language technology software.

Interested in machines that can learn language, understand language, and translate language? We are always looking for talented graduate students. We also actively explore connections to industrial/commercial applications of our research through collaborative grants and/or internships.

Ph.D. Thesis Defense- Maxim Roy

February 26th, 2010

Time: Tuesday March 2nd, 2010 12:30 p.m.
Place: Bennett Library Rm 2020

Title: APPROACHES TO HANDLE SCARCE RESOURCES FOR BENGALI STATISTICAL MACHINE TRANSLATION

Abstract:

Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data.  In order to handle this incomplete data, a principled approach is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases.

We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems.

We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources.

Also due to different word orders in Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer.

Overall the main objective of this thesis is to make contribution towards Bengali language processing, provide a general foundation for conducting research in Bengali SMT and improve the quality of Bengali SMT.

Ph.D. Thesis Seminar- Maxim Roy

February 26th, 2010

Time: Monday March 1st, 2010 10:30 a.m.
Place: TASC1 9204 West

Title: APPROACHES TO HANDLE SCARCE RESOURCES FOR BENGALI STATISTICAL MACHINE TRANSLATION

Abstract:

Machine translation (MT) is a hard problem because of the highly complex, irregular and diverse nature of natural language. MT refers to computerized systems that utilize computer software to translate text from one natural language into another with or without human assistance. It is impossible to accurately model all the linguistic rules and relationships that shape the translation process, and therefore MT has to make decisions based on incomplete data. In order to handle this incomplete data, a principled approach to this problem is to use statistical methods to make optimum decisions given incomplete data. Statistical machine translation (SMT) uses a probabilistic framework to automatically translate text from one language to another. Using the co-occurrence counts of words and phrases from the bilingual parallel corpora where sentences are aligned with their translation, SMT learns the translation of words and phrases.

We apply SMT techniques for translation between Bengali and English. SMT systems requires a significant amount of bilingual data between language pairs to achieve significant translation accuracy. However, being a low-density language, such resources are not available in Bengali. So in this thesis, we investigate different language independent and dependent techniques which can be helpful to improve translate accuracy of Bengali SMT systems.

We explore the transliteration module, prepositional and Bengali compound word handling module in the context of Bengali to English SMT. Further we look into semi-supervised techniques and active learning techniques in Bengali SMT to deal with scarce resources.

Also due to different word reordering between Bengali and English, we also propose different syntactic phrase reordering techniques for Bengali SMT. We also contributed toward Bengali SMT by creating a new test set, lexicon and by developing Bengali text processing tools such as tokenizer, sentence segmenter, and morphological analyzer.

Overall the main objective of this thesis is to make contribution towards Bengali language processing and provide a general foundation for conducting research in Bengali SMT.

Google Research Award

January 26th, 2010
Analysis of statistical machine translation into Finnish

Analysis of statistical machine translation into Finnish

Associate Professor Anoop Sarkar recently won a Google Research Award for research into statistical machine translation.

The Google Research Awards are for faculty around the world who pursue innovative research that is relevant to Google’s mission, to “organize the world’s information and make it universally accessible and useful.” The award consists of substantial grant money, and the opportunity to share in Google’s facilities and give talks related to the research.

Thanks to Google’s generous funding, members of the Natural Language Lab at SFU will be tackling the problem of translation into languages that are very different from English. Lab members, Ph.D. student Baskaran Sankaran and Masters student Ann Clifton, are currently in the process of developing such methods as part of their thesis work.

Many of the recently developed statistical machine translation systems focus on translation into English. The machine learning algorithms used for machine translation are implicitly biased towards producing output in English and similar languages. They are much weaker at dealing with translation from English into a target language with very different grammar and word formation rules. New algorithms and models that augment existing statistical machine translation models are needed to deal with these issues.

Accepted for publication: Latent SVMs for Semantic Role Labeling using LTAG Derivation Trees

January 26th, 2010

The following paper was accepted for publication as a long paper at the North American ACL conference: NAACL HLT 2010.

Latent SVMs for Semantic Role Labeling using LTAG Derivation Trees

Authors: Yudong Liu, Gholamreza Haffari and Anoop Sarkar

Yudong Liu defends her Ph.D. thesis

January 20th, 2010

Yudong defends her Ph.D. thesis

Yudong defends her Ph.D. thesis

Yudong Liu successfully defended her Ph.D. thesis on November 30th, 2009. Her thesis is entitled Semantic Role Labeling using Tree Adjoining Grammars. Her external examiner was Kristina Toutanova. The internal examiner was Fred Popowich. Her senior supervisor was Anoop Sarkar and her supervisory committee included: Eugenia Ternovska. Jim Delgrande chaired the defense.

Here is the abstract of her thesis:

For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).

Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs. Our contributions include:

  1. We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly.
  2. We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling.
  3. We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks.

Ph.D. Thesis Seminar- Yudong Liu

November 20th, 2009

Time: Friday November 27th, 2009 1:30 p.m.
Place: TASC1 9204 West

Title: SEMANTIC ROLE LABELING USING LEXICALIZED TREE ADJOINING GRAMMARS

Abstract:
For a natural language sentence, its meaning is typically conveyed through the event/action and its involved participants; the event is expressed as a verb (predicate), and the participants involved are expressed as the arguments of the verb. The task of semantic role labeling (SRL) is to identify the predicate-argument structures (PAS) and label the relations between the predicate and each of its arguments. It is an important intermediate step towards many natural language processing (NLP) applications, such as text summarization, question answering, and machine translation. Lexicalized Tree Adjoining Grammars (LTAGs), a tree rewriting formalism, has been made desirable for the SRL task by its property of Extended Domain of Locality (EDL).

Our work in this thesis is mainly focused on the development and learning of the state of the art discriminative SRL systems with LTAGs.  Our contributions include: (1)We proposed the use of LTAG formalism as an important additional source of features for the semantic role labeling task. Our experiments show that compared with the best known set of features that are used in state of the art SRL systems, LTAG-based features can improve SRL performance significantly. (2)We explored a novel LTAG formalism — LTAG-spinal and its treebank for SRL task and demonstrated the utility of this new resource for SRL. Deep linguistic information such as predicate-argument relationships that are either implicit or absent from the original Penn Treebank are made explicit and accessible in the LTAG-spinal Treebank, which we show to be a useful resource for semantic role labeling. (3)We applied a novel learning framework - Latent Support Vector Machines (LSVMs) to SRL task by treating LTAG derivation trees as latent structures of Penn Treebank derived trees and further improvement has been gained in the SRL accuracy. Our work widens the possibility for the general applicability of LSVMs to other NLP tasks.  In addition, the empirical success of our work is theoretically intriguing in terms of the advantage that the online learning presented under our LSVM framework over the batch learning.

Keywords: semantic role labeling (SRL), Lexicalized Tree Adjoining Grammars (LTAG), features, Latent Support Vector Machines (LSVM)


M. Sc. Thesis Defence and Seminar- Ajeet Grewal

November 20th, 2009

Time: Tuesday, November 24, 2009 10:45 a.m.

Place: TASC1   9204 West

Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION

Abstract:

Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.

First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.

Ajeet’s M.Sc. Defence Practice Talk

November 20th, 2009

Time: November 23rd at 12 noon, Monday
Place: TASC  9204

Title: MODEL ADAPTATION FOR STATISTICAL MACHINE TRANSLATION

Abstract:
Statistical machine translation (SMT) systems use statistical learning methods to learn how to translate from large amounts of parallel training data. Unfortunately, SMT systems are tuned to the domain of the training data and need to be adapted before they can be used to translate data in a different domain.

First, we consider a semi-supervised technique to perform model adaptation. We explore new feature extraction techniques, feature combinations and their effects on performance. In addition, we introduce an unsupervised variant of Minimum Error Rate Training (MERT), which can be used to tune the SMT model parameters. We do this by using another SMT model that translates in the reverse direction. We apply this variant of MERT to the model adaptation task. Both of the techniques we explore in this thesis produce promising results in exhaustive experiments we performed for translation from French to English in different domains.

M.Sc. Thesis Defence and Seminar - Milan Tofiloski

November 20th, 2009

Time: Wednesday, November 25, 2009 1:00 p.m
Place: TASC 1 9204 West

Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE

Abstract:
Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.

Milan’s M.SC. THESIS Practice Talk

November 20th, 2009

-Time: 1:30pm November 20, Friday
-Place: TASC 9204 (west side)

Title: EXTENDING CENTERING THEORY FOR THE MEASURE OF ENTITY COHERENCE

Abstract of the talk:

Centering Theory determines the coherence of a text by identifying the most salient entities in adjacent utterances and observing how they change in salience. This thesis extends Centering Theory by tracking all the entities of the utterance in order to improve the measure of a text’s coherence. Accounting for all entities allows the utterance window to be expanded beyond adjacent utterances, which eliminates the difficulties and compromises associated with choosing either a sentence or a clause as the unit of analysis. Experiments show that tracking all entities instead of a single entity improves the evaluation of text coherence compared to traditional measures of coherence. The model proposed is motivated by linguistic principles and does not require training, alleviating the need for costly training data. The entity coherence model was evaluated on two tasks: sentence ordering and summarization. The sentence ordering experiment involved identifying the original text amongst a collection of its permutations.