In the lab meeting today, Nishant will give a talk about handling Out-of-Vocabulary (OOV) words in Machine Translation. Here is a brief description of his talk:
Out-of-vocabulary (OOV) words - words that appear in the recognition task at hand, but not in the training set - are a ubiquitous and difficult problem in machine translation. Data-driven machine translation systems are able to translate words that have been seen in the training corpora, however translating unseen words is still a bottleneck for even the best performing systems. In general, the amount of parallel data is finite which results in infrequent terms to be absent in the training parallel corpora. This lack of information can potentially produce incomplete, erroneous and disfluent translations. In this discussion, we will investigate the different approaches of handling OOVs in Statistical Machine Translation leading up to Neural Machine Translation.