Simon Fraser University

Welcome to the SFU Natural Language Laboratory

Natural Language Processing (NLP) is the automatic analysis of human languages such as English, Korean, etc. by computer algorithms. Unlike programming languages where the structure and meaning of programs is easy to encode, human languages provide an interesting challenge, both in terms of its analysis and the learning of language from observations.

Our lab is active in areas of Information extraction, Machine translation, Summarization, and Statistical parsing. We have worked with many languages including: Arabic, Chinese, Czech, English, French, Hindi, Korean and Spanish.

We have about 10 to 15 students that are members of the Natural Language Lab at SFU. We have a strong relationship with the natural language industry in Canada, the National Research Council, and various research groups around the world and within SFU. In 1999, lab researchers formed a company, Axonwave Software Inc., which uses language technology software.

Interested in machines that can learn language, understand language, and translate language? We are always looking for talented graduate students. We also actively explore connections to industrial/commercial applications of our research through collaborative grants and/or internships.

Northwest NLP 2012

May 15th, 2012

We had several items at the Pacific Northwest Regional NLP Workshop this year.

Oral presentation:

Posters:

Ann Clifton’s PhD depth examination

May 15th, 2012

Ann Clifton did her PhD depth examination on April 12, 2012.

Title: Multilingual Statistical Machine Translation
Abstract:

We examine approaches towards using multilingual information for statistical machine translation (SMT). Multilingual information has been successfully leveraged for a variety of monolingual natural language processing tasks, such as word-sense disambiguation. SMT is an inherently multilingual task (since model training requires parallel data between the source and target language), but most work has focused on training translation models in the bilingual setting only. This has meant that robust SMT systems largely only exist for the few language pairs for which extensive parallel corpora are available. However, by injecting multilingual information into the models, we can hope to exploit the orthogonality of ambiguity in different language sources to improve SMT models; in particular, we can make it possible to train viable translation models for resource-poor language pairs.

We categorize multilingual translation methods based on the level at which the multilingual information is combined: at the word alignment level, the phrasal level, or the source-side sentence level. We characterize the methods at each level of combination in terms of their robustness to data sparsity, as well as their extensibility to broader multilingual settings.

Young-chan Kim’s MSc thesis defence

May 15th, 2012

Youngchan Kim successfully defended his MSc thesis on April 2, 2012.

Title: Bidirectional Segmentation for English-Korean Machine Translation
Abstract:

Unlike English or Spanish, which has each word clearly segmented, morphologically rich languages, such as Korean, do not have clear optimal word boundaries for machine translation (MT). Previous work has shown that segmenting such languages by incorporating information available from parallel corpus can improve MT results. In this paper we show that this can be improved further by segmenting both source and target languages and present improvement in BLEU scores for English-Korean translation.

Lab meeting: 5/16 11:00am

May 15th, 2012

The next lab meeting will be on Wednesday May 16, 2012 at 11:00am, in TASC1 9408. We will be watching a talk, followed by discussion.

First lab meetings of Summer 2012

May 15th, 2012

We have already had two lab meetings this semester: On May 2, Anoop Sarkar and Ravikiran Vadlapudi presented work on visualizing Wikipedia text for human history. On May 9, we did practice talks and poster presentations for NW-NLP.

Lab meeting: 4/18 3:30pm

April 17th, 2012

The next lab meeting will be on Wednesday April 18, 2012 at 3:30pm, in TASC1 9408.

Majid Razmara will be presenting the following paper: Model Combination for Machine Translation, John DeNero, Shankar Kumar, Ciprian Chelba, and Franz Och, In proceedings of NAACL, 2010.

Machine translation benefits from two types of decoding techniques: consensus decoding over multiple hypotheses under a single model and system combination over hypotheses from different models. We present model combination, a method that integrates consensus decoding and system combination into a unified, forest-based technique. Our approach makes few assumptions about the underlying component models, enabling us to combine systems with heterogenous structure. Unlike most system combination techniques, we reuse the search space of component models, which entirely avoids the need to align translation hypotheses. Despite its relative simplicity, model combination improves translation quality over a pipelined approach of first applying consensus decoding to individual systems, and then applying system combination to their output. We demonstrate BLEU improvements across data sets and language pairs in large-scale experiments.

Lab meeting: 4/11 3:30pm

April 10th, 2012

The next lab meeting will be on Wednesday April 11, 2012 at 3:30pm, in TASC1 9408.

Baskaran Sankaran will be presenting the following paper: Mark Johnson, Thomas L. Griffiths and Sharon Goldwater, Bayesian Inference for PCFGs via Markov Chain Monte Carlo, in NAACL-HLT 2007.

This paper presents two Markov chain Monte Carlo (MCMC) algorithms for Bayesian inference of probabilistic context free grammars (PCFGs) from terminal strings, providing an alternative to maximum-likelihood estimation using the Inside-Outside algorithm. We illustrate these methods by estimating a sparse grammar describing the morphology of the Bantu language Sesotho, demonstrating that with suitable priors Bayesian techniques can infer linguistic structure in situations where maximum likelihood methods such as the Inside-Outside algorithm only produce a trivial grammar.

Lab meeting: 4/4 3:30pm

April 3rd, 2012

The next lab meeting will be on Wednesday April 4, 2012 at 3:30pm, in TASC1 9408.

Marzieh Razavi will be presenting the following paper: Huang and Sagae, Dynamic Programming for Linear-Time Incremental Parsing.

Incremental parsing techniques such as shift-reduce have gained popularity thanks to their efficiency, but there remains a major problem: the search is greedy and only explores a tiny fraction of the whole space (even with beam search) as opposed to dynamic programming. We show that, surprisingly, dynamic programming is in fact possible for many shift-reduce parsers, by merging “equivalent” stacks based on feature values. Empirically, our algorithm yields up to a five-fold speedup over a state-of-the-art shift-reduce dependency parser with no loss in accuracy. Better search also leads to better learning, and our final parser outperforms all previously reported dependency parsers for English and Chinese, yet is much faster.

Lab meeting: 3/28 3:30pm

March 27th, 2012

The next lab meeting will be on Wednesday March 28, 2012 at 3:30pm, in TASC1 9408.

Youngchan Kim will be giving a practice talk for his thesis defence:

Unlike English or Spanish, which has each word clearly segmented, morphologically rich languages, such as Korean, do not have clear optimal word boundaries for machine translation (MT). Previous work has shown that segmenting such languages by incorporating information available from parallel corpus can improve MT results. In this paper we show that this can be improved further by segmenting both source and target languages and present improvement in BLEU scores for English-Korean translation.

2012 Publications (so far)

March 24th, 2012

Our lab has published the following papers in 2012 (so far). Links to camera ready versions to follow.

Bootstrapping via Graph Propagation.
Max Whitney and Anoop Sarkar. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-2012). July 9-11, 2012. Jeju Island. R. of Korea.
Mixing Multiple Translation Models in Statistical Machine Translation.
Majid Razmara, George Foster, Baskaran Sankaran and Anoop Sarkar. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-2012). July 9-11, 2012. Jeju Island. R. of Korea.
Improved Reordering for Shallow-n Grammar based Hierarchical Phrase-based Translation.
Baskaran Sankaran and Anoop Sarkar. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2012).
Montreal, Quebec, Canada. June 3-8, 2012.
Kriya – an end-to-end hierarchical phrase-based MT system. (draft version)
Baskaran Sankaran, Majid Razmara and Anoop Sarkar. The Prague Bulletin of Mathematical Linguistics. March 2012.
Domain Adaptation Techniques for Machine Translation and their Evaluation in a Real-World Setting.
Baskaran Sankaran, Majid Razmara, Atefeh Farzindar, Wael Khreich, Fred Popowich and Anoop Sarkar. In Proceedings of the 25th Canadian Conference on Artificial Intelligence. York University, Ontario. May 28-30, 2012.