Augmenting Bilingual Lexicons and Statistical Machine Translation Models Using Distributional Paraphrases and Hybrid Semantic Distance Measures

Friday, February 17, 2012 - 11:00am - 12:00pm EST

  • Location:2 MetroTech Center, 10th Floor, 10.099
    NY, US
  • Contact:Lisa Hellerstein
    hstein@poly.edu
    718 260 3689

Speaker: Yuval Marton, IBM Research Abstract: Many semantic distance measures estimate how close in meaning two words or phrases (or larger text units) are, using human knowledge in the form of a lexical resource: a dictionary, thesaurus, or taxonomy (e.g., WordNet). Distributional semantic distance measures rely instead only on word distributions in a large corpus of non-annotated text. I will present some hybrid measures, and their use for paraphrase generation, which in turn, is useful for NLP tasks such as statistical machine translation (SMT), information retrieval, summarization and language generation. Distributional measures have been surpassed by adding human knowledge (e.g., using WordNet). However, such knowledge/corpus-based hybrid work is not suitable for specialized domains, resource-poor ("low density") languages, or non-classical semantic relations (roughly, "relatedness" other than "is-a" relations). Previous hybrid work that handled these issues did so only to some extent, by using shallow thesaurus-based "concepts" (lists of related words) for defining a coarse-grained aggregated distributional representation. I will show that finer granularity, in hybrid models, can benefit from concept information while retaining high-coverage word-based distributional representation. I will then present a largely language-independent paraphrase generation method for augmenting bilingual lexicons and translation phrase tables, using the above-mentioned hybrid semantic measures, evaluated in a state-of-the-art SMT framework. Bio: Yuval Marton is a postdoctoral researcher at IBM, NY, following his postdoctoral research scientist position at Columbia University. His research interests focus on computational linguistics, especially combining linguistic knowledge with statistical and corpus-based machine-learning techniques. He is currently involved in research on syntactic dependency parsing for morphologically rich languages such as Arabic, and the use of morphological and syntactic knowledge in SMT. He is also involved in research on distributional and hybrid semantic distance measures, and paraphrase generation. He co-organizes the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012), and is the publication chair for the 2012 NAACL-collocated *SEM conference. Yuval received his Ph.D. in Linguistics from University of Maryland, Fall 2009, concentrating on computational linguistics, with a Neuroscience and Cognitive Science (NACS) Program Certificate. He received his Masters in Computer Science from NYU-Poly in 2004.