Crawling a Parallel Corpus from the Web

Computer Science and Engineering
NYU Community Event

Speaker: Luciano Barbosa, AT&T Labs - Research


Parallel texts are translations of the same text in different languages.
Parallel text acquisition from the Web has received increased attention in
recent years, especially for machine translation and cross-language
information retrieval. For many years, the European Parliament proceedings
and official documents of countries with multiple languages were the only
widely available parallel texts. Although these are high-quality corpora,
they have some limitations: (1) they tend to be domain specific (e.g.,
government-related texts); (2) they are available in only a few languages;
and (3) sometimes they are not free or there is some restriction on using
them. On the other hand, Web data is free and comprises data from different
languages and domains. In this talk, I will present a two-step strategy to
collect parallel corpus from the Web. First, the solution focuses on
automatically identifying bilingual Web sites (the inter-site crawler), and
subsequently on the parallel pages within the bilingual sites (the intra-site
crawler). Experimental results using this proposed methodology for acquiring
parallel text show significant improvements in machine translation accuracy.


Luciano Barbosa is a Member of Technical Staff at AT&T Labs - Research. He
joined AT&T in 2009, working in the Speech and Language group. He obtained
his Ph.D. at the School of Computing, University of Utah, in 2009 under the
supervision of Prof. Juliana Freire. His dissertation focused on creating
algorithms and techniques to locate, identify, and organize hidden-web sources.
Prior to joining the PhD program, he worked for 4 years as a lead developer
at RADIX, one of the first Brazilian search engines. His research interests
include Web mining, in particular Web crawling, social media mining, and NLP.