Juri Ganitkevitch visited Natlang on 22nd October and gave a talk about large-scale paraphrasing.
We present the methods and infrastructure behind the paraphrase database PPDB, a collection of over 220 million syntactically labeled English paraphrase pairs. PPDB is extracted from over 100 million sentences of bilingual parallel data and further scored using distributional similarity information drawn from vast amounts of monolingual text. We discuss the pivoting approach used to extract syntactic paraphrases from bilingual corpora, present challenges and solutions in scaling extraction and application methods to data of this size, and give an overview of our current and forthcoming efforts to expand PPDB’s breadth of coverage and depth of annotation.
Juri is a terminal-stage Ph.D. student at the Center for Language and Speech Processing at Johns Hopkins University, advised by Chris Callison-Burch. His main research interest is in scaling paraphrase extraction techniques to large amounts of data, as well as pushing paraphrase applications towards natural language understanding.