18 Nov 2015

On Tuesday, November 25th, 10:00 a.m. at TASC1 9204 WEST, Golnar Sheikhshabbafghi will give her PHD Depth Examination talk on

“GRAPH-BASED SEMI-SUPERVISED LEARNING”.

Graph-based semi-supervised learning (SSL) is based on the assumption that similar data points should have similar labels. A graph is constructed whose vertices represent data points and whose edge-weights represent how strongly we believe the adjacent vertices (data points) should get the same label. The graph will connect labeled and unlabeled data points and each vertex is associated with a label distribution that represents the current belief about its label. Having this graph that encodes the similarities between data points, the goal is to find label distributions for all vertices so that 1) for any labeled vertex v, the associated label distribution is as close as possible to its reference distribution obtained from the labeled data based on the number of times each data (point, label) pair appeared together; 2) adjacent vertices in the graph have similar label distributions; 3) the label distributions of all vertices comply with the prior knowledge if such knowledge exists.

17 Nov 2015

On Tuesday, November 17th, 3:30 p.m. at TASC1 9204 WEST, Te Bu will defend his M.SC thesis on the topic of “Joint Prediction of Word Alignment and Alignment Types for Statistical Machine Translation”.

Here is the abstract of his thesis:

Learning word alignments between parallel sentence pairs is an

important task in Statistical Machine Translation. Existing models

for word alignment have assumed that word alignment links are

untyped. In this work, we propose new machine learning models that

use linguistically informed link types to enrich word alignments.

We use 11 different alignment link types based on annotated data

released by the Linguistics Data Consortium. We first provide a

solution to the sub-problem of alignment type prediction given an

aligned word pair and then propose two different models to

simultaneously predict word alignment and alignment types. Our

experimental results show that we can recover alignment link types

with an F-score of 81.5%. Our joint model improves the word alignment

F-score by 0.9% over a baseline that does not use typed alignment

links. We expect typed word alignments to benefit SMT and other NLP

tasks that rely on word alignments.

16 Nov 2015

On November 18th, Golnar will give a practise talk about Graph-Based Semi-Supervised Learning during the lab meeting.

Here is the abstract:

Graph-based semi-supervised learning (SSL) is based on the assumption that similar data points should have similar labels. A graph is constructed whose vertices represent data points and whose edge-weights represent how strongly we believe the adjacent vertices(data points) should get the same label. The graph will connect labeled and unlabeled data points and each vertex is associated with a label-distribution that represents the current belief about its label. Having this graph that encodes the similarities between data points, the goal is to find label distributions for all vertices so that 1) for any labeled vertex v, the associated label-distribution is as close as possible to its reference distribution obtained from the labeled data based on the number of times each data (point, label) pair appeared together; 2) adjacent vertices in the graph have similar label-distributions; 3) the label-distributions of all vertices comply with the prior knowledge if such knowledge exists.

There are two different settings in graph-based SSL: transductive and inductive. In transductive settings, the graph is constructed over train and test sets and the most probable label for any test data-point is chosen after propagation. For new test data, the graph should be re-constructed and labels should be propagated again. In inductive settings on the other hand, a model such as a conditional random field is trained and can assign labels to new data-points so there won’t be any need for constructing a graph or propagating labels through one. Graph-based SSL has been applied to many NLP applications. In these applications data-points are usually n-grams and edge weights are computed based on context features.

08 Nov 2015

On Tuesday, November 10th, Te Bu will give a practice talk on “Joint prediction of word alignment and alignment types for statistical machine translation”. The talk will be in TASC1 9408 at 3:30pm.

Learning word alignments between parallel sentence pairs is an important task in Statistical Machine Translation. Existing models for word alignment have assumed that word alignment links are untyped. In this work, we propose new machine learning models that use linguistically informed link types to enrich word alignments. We use 11 different alignment link types based on annotated data released by the Linguistics Data Consortium. We first provide a solution to the sub-problem of alignment type prediction given an aligned word pair and then propose two different models to simultaneously predict word alignment and alignment types. Our experimental results show that we can recover alignment link types with an F-score of 81.4%. Our joint model improves the word alignment F-score by 4.6% over a baseline that does not use typed alignment links. We expect typed word alignments to bene t SMT and other NLP tasks that rely on word alignments.

04 Nov 2015

Our paper on “Learning Segmentations that Balance Latency versus Quality in Spoken Language Translation” by Hassan S. Shavarani, Maryam Siahbani, Ramtin Mehdizadeh Seraj and Anoop Sarkar was accepted for publication at the 12th International Workshop on Spoken Language Translation: IWSLT 2015 to be held in Da Nang, Vietnam from December 3-4, 2015.

Abstract: Segmentation of the incoming speech stream and translating segments incrementally is a commonly used technique that improves latency in spoken language translation. Previous work has explored creating training data for segmentation by finding segments that maximize translation quality with a user-defined bound on segment length. In this work, we provide a new algorithm, using Pareto-optimality, for finding good segment boundaries that can balance the trade-off between latency versus translation quality. Our experimental results show that we can provide qualitatively better segments that improve latency without substantially hurting translation quality.