Data Summarization at Scale

Lecture / Panel
For NYU Community

Speaker: Amin Karbasi, Yale University

For the last several years, we have witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, and usually high-resolution data does not allow for effective data-intensive inference. In this regard, data summarization is a compelling (and sometimes the only) approach that aims at both exploiting the richness of large-scale data and being computationally tractable; Instead of operating on complex and large data directly, carefully constructed summaries not only enable the execution of various dataanalytics tasks but also improve their efficiency and scalability.

A systematic way for data summarization is to turn the problem into selecting a subset of data elements optimizing a utility function that quantifies “representativeness” of the selected set. Often-times, these objective functions satisfy submodularity, an intuitive notion of diminishing returns stating that selecting any given element earlier helps more than selecting it later. Thus, many problems in data summarization require maximizing submodular set functions subject to cardinality and massive data means we have to solve these problems at scale.

In this talk, I will present our recent efforts in developing practical schemes for datasummarization.  In particular, I will first discuss Stochastic- Greedy, currently the fastest centralized algorithm for data summarization whose query complexity is only linear in data size. However, to truly summarize data at scale we need to opt for streaming or distributed solutions.  To this end, I will then present Sieve-Streaming, the first streaming algorithm for data summarization with a constant-factor approximation guarantee. Finally, I will talk about GreeDi, the first distributed approach towards data summarization. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference and exemplar-based clustering, on tens of millions of data points using Hadoop.

Amin Karbasi is currently an professor of Electrical Engineering and Computer Science at Yale University and a faculty member of the Yale Institute for Network Science (YINS). He leads the Inference, Information and Decision (I.I.D) Systems Group. His research lies at the intersection of learning theory, large-scale networks, and optimum information processing. He obtained his PhD from EPFL and spent a year as a post-doctoral scholar at ETH Zurich. He is the recipient of Patrick Denantes Memorial Prize 2013 for the best Ph.D. thesis from the School of Computer and Communication Sciences at EPFL, and the winner of the ETH Fellowship 2013. He is the co-recipient of several paper awards including AISTATS best student paper award 2015, IEEE Data Storage Best Student Paper Award 2013, ICASSP Best Student Paper Award 2011, ACM SIGMETRICS Best Student Paper Award 2010, and ISIT Best Student Paper Award runner-up 2010.

For more information please contact Prof. Lisa Hellerstein.