02 Feb 2016

On Feburary 4th, Mahdi Soleimani will talk about his MSc thesis research on bootstrapping classifiers with limited amounts of training data during our lab meeting.

Abstract:

For many NLP tasks a large amount of unlabelled data is available while labelled data is hard to attain. Bootstrapping techniques have been shown to be very successful on different NLP tasks using only a small amount of supervision (labelled data) alongside a large set of unlabelled data. While most of the previous research and algorithms are done on the parameter estimation step in bootstrapping, here we have studied the decoding step (classification using the estimated parameters). We show that by using different decoding techniques, similar to decoding step in Yarowsky algorithm, simple EM algorithm can achieve same results as more complicated learning approaches.

18 Nov 2015

On Tuesday, November 25th, 10:00 a.m. at TASC1 9204 WEST, Golnar Sheikhshabbafghi will give her PHD Depth Examination talk on

“GRAPH-BASED SEMI-SUPERVISED LEARNING”.

Graph-based semi-supervised learning (SSL) is based on the assumption that similar data points should have similar labels. A graph is constructed whose vertices represent data points and whose edge-weights represent how strongly we believe the adjacent vertices (data points) should get the same label. The graph will connect labeled and unlabeled data points and each vertex is associated with a label distribution that represents the current belief about its label. Having this graph that encodes the similarities between data points, the goal is to find label distributions for all vertices so that 1) for any labeled vertex v, the associated label distribution is as close as possible to its reference distribution obtained from the labeled data based on the number of times each data (point, label) pair appeared together; 2) adjacent vertices in the graph have similar label distributions; 3) the label distributions of all vertices comply with the prior knowledge if such knowledge exists.

17 Nov 2015

On Tuesday, November 17th, 3:30 p.m. at TASC1 9204 WEST, Te Bu will defend his M.SC thesis on the topic of “Joint Prediction of Word Alignment and Alignment Types for Statistical Machine Translation”.

Here is the abstract of his thesis:

Learning word alignments between parallel sentence pairs is an

important task in Statistical Machine Translation. Existing models

for word alignment have assumed that word alignment links are

untyped. In this work, we propose new machine learning models that

use linguistically informed link types to enrich word alignments.

We use 11 different alignment link types based on annotated data

released by the Linguistics Data Consortium. We first provide a

solution to the sub-problem of alignment type prediction given an

aligned word pair and then propose two different models to

simultaneously predict word alignment and alignment types. Our

experimental results show that we can recover alignment link types

with an F-score of 81.5%. Our joint model improves the word alignment

F-score by 0.9% over a baseline that does not use typed alignment

links. We expect typed word alignments to benefit SMT and other NLP

tasks that rely on word alignments.

16 Nov 2015

On November 18th, Golnar will give a practise talk about Graph-Based Semi-Supervised Learning during the lab meeting.

Here is the abstract:

Graph-based semi-supervised learning (SSL) is based on the assumption that similar data points should have similar labels. A graph is constructed whose vertices represent data points and whose edge-weights represent how strongly we believe the adjacent vertices(data points) should get the same label. The graph will connect labeled and unlabeled data points and each vertex is associated with a label-distribution that represents the current belief about its label. Having this graph that encodes the similarities between data points, the goal is to find label distributions for all vertices so that 1) for any labeled vertex v, the associated label-distribution is as close as possible to its reference distribution obtained from the labeled data based on the number of times each data (point, label) pair appeared together; 2) adjacent vertices in the graph have similar label-distributions; 3) the label-distributions of all vertices comply with the prior knowledge if such knowledge exists.

There are two different settings in graph-based SSL: transductive and inductive. In transductive settings, the graph is constructed over train and test sets and the most probable label for any test data-point is chosen after propagation. For new test data, the graph should be re-constructed and labels should be propagated again. In inductive settings on the other hand, a model such as a conditional random field is trained and can assign labels to new data-points so there won’t be any need for constructing a graph or propagating labels through one. Graph-based SSL has been applied to many NLP applications. In these applications data-points are usually n-grams and edge weights are computed based on context features.

08 Nov 2015

On Tuesday, November 10th, Te Bu will give a practice talk on “Joint prediction of word alignment and alignment types for statistical machine translation”. The talk will be in TASC1 9408 at 3:30pm.

Learning word alignments between parallel sentence pairs is an important task in Statistical Machine Translation. Existing models for word alignment have assumed that word alignment links are untyped. In this work, we propose new machine learning models that use linguistically informed link types to enrich word alignments. We use 11 different alignment link types based on annotated data released by the Linguistics Data Consortium. We first provide a solution to the sub-problem of alignment type prediction given an aligned word pair and then propose two different models to simultaneously predict word alignment and alignment types. Our experimental results show that we can recover alignment link types with an F-score of 81.4%. Our joint model improves the word alignment F-score by 4.6% over a baseline that does not use typed alignment links. We expect typed word alignments to bene t SMT and other NLP tasks that rely on word alignments.