Scaling Big Data Mining Infrastructure: The Twitter Experience

For NYU Community

Speaker: Jimmy Lin, University of Maryland and Twitter


The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, I'll discuss the evolution of Twitter's infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, I'll discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing".

This talk has two goals: For practitioners, I hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, I hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.


Jimmy Lin is an Associate Professor in the iSchool at the University of Maryland, affiliated with the Department of Computer Science and the Institute for Advanced Computer Studies. He graduated with a Ph.D. in computer science from MIT in 2004. Lin's research lies at the intersection of information retrieval and natural language processing, and he has done work in a variety of areas, including question answering, medical informatics, and bioinformatics. Lin's current research focuses on massively-distributed data analytics in cluster-based environments.

Recently, Lin just completed an extended sabbatical at Twitter, where from 2010-2012 he worked on services designed to surface relevant content for users, and on the distributed infrastructure that supports mining relevance signals from massive amounts of data.

Please contact Torsten Suel ( for more information.