Deep Curation: Putting Open Science Data to Work

Lecture / Panel
For NYU Community

Speaker:  Bill Howe, University of Washington


Data in public repositories and in the scientific literature remains remarkably underused despite significant investments in open data and open science. Making data available online turns out to be the easy part; making the data usable for data science requires new services to support longitudinal, multi-dataset analysis rather than just settling for keyword search.

In this talk, I'll describe a suite of services my group has been building to improve the utility of public data.

In the Deep Curation project, we have developed a variant of distant supervision and co-learning that can automatically label datasets with zero training data.  We have applied this approach to curate gene expression data and identify figures in the scientific literature, outperforming state-of-the-art supervised methods that rely on human-provided labels.

In the Wide Open project, we use a simple text-based approach to identify datasets referenced in the scientific literature that are overdue for publication; our results led to the public release of 400 datasets in a one-week period.

In the Claim Verification project, we extract limited forms of scientific claims from the literature and automatically perform reproducibility experiments against data in public repositories.

In the Viziometrics project, we are developing a platform for large-scale information extraction from the figures in the scientific literature.  We have used this platform to automatically build a database of phylogenetic information from tens of thousands of tree diagrams in the literature.

Finally, in the Query2Vec project, we are designing vector embeddings of SQL query logs to automate database administration tasks such as index recommendation and result caching.

Our vision is to provide a richer set of services to make data-intensive science more robust and reproducible, and ultimately improve public trust in science.


Bill Howe is Associate Professor in the Information School, Adjunct Associate Professor in Computer Science & Engineering, and Associate Director of the UW eScience Institute. His research interests are in data management, curation, analytics, and visualization in the sciences. Howe played a leadership role in the Data Science Environment program at UW through a $32.8 million grant awarded jointly to UW, NYU, and UC Berkeley. With support from the MacArthur Foundation and Microsoft, Howe directs the Urbanalytics group at UW and UW's participation in the Cascadia Urban Analytics Cooperative with the University of British Columbia, where he focuses on data-intensive urban science. He founded the UW Data Science Masters Degree and serves as its inaugural Program Director and Faculty Chair. He has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, has had two papers selected for VLDB Journal's "Best of Conference" issues (2004 and 2010), and co-authored what are currently the most-cited papers from both VLDB 2010 and SIGMOD 2012. Howe serves on the program and organizing committees for a number of conferences in the area of databases and scientific data management, developed a first MOOC on data science that attracted over 200,000 students across two offerings, and founded UW's Data Science for Social Good program. He has a Ph.D. in Computer Science from Portland State University and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.