Robust Analytics on Data Streams
Speaker: Flip Korn, AT&T Labs
How can one make sense of fast and voluminous data? How can the value of Big Data be extracted when data is "noisy"? As the quantity of digitized data explodes, the quality of this data can be poor when generated by fallible users (e.g., crowd sourcing) and unreliable hardware (e.g., sensors), sent across volatile networks (e.g., wireless) and stored in complex systems (e.g., "the cloud"). Existing data cleansing techniques are aimed at solving specific problems, such as record linkage, but it is unknown data quality problems that are hardest to detect and often the most pernicious. Hence, analytics queries must be applied robustly to avoid misleading answers.
In this talk I will first discuss how to perform robust, complex analytics, built from quantiles and frequent items primitives, for IP network traffic data at streaming speeds. These queries are implemented as a library of user-defined aggregate functions (UDAFs) in a Data Stream Management System developed at AT&T called GS Tool. In the second part, I will discuss an exploratory approach to data quality where the user poses hypotheses to test, in the form of constraints such as functional dependencies, and the system performs multidimensional analysis to summarize when and where the data satisfies (or fails) the hypotheses. This approach is only useful if implemented scalably, at interactive speeds. I will describe a fast lazy evaluation strategy for this. Then I will mention novel constraints (e.g., Sequential Dependencies and Conservation Rules) that exploit structural properties found in many data warehouses to discover potential errors.
Flip Korn is a member of the Database Research Department at AT&T Shannon Labs. His Ph.D. is from the University of Maryland, College Park. His background is in data mining and data streams, and the current focus of his research endeavors is in the area of data quality.