Publications

Automated hypothesis testing with large scientific data repositories

Abstract

The automation of important aspects of scientific data analysis would significantly accelerate the pace of science and innovation. Although there has been a lot of work done towards that automation, the hypothesize-test-evaluate discovery cycle is still largely carried out by hand by researchers. This introduces a significant human bottleneck, which leads to inefficiencies, potential errors, and incomplete explorations of the hypothesis and data analysis space. We introduce a novel approach to automate the hypothesize-test-evaluate discovery cycle with an intelligent system that a scientist can task to test hypotheses of interest against a data repository. Our approach captures three types of data analytics knowledge: 1) common data analytic methods represented as semantic workflows; 2) meta-analysis methods that aggregate those results, represented as meta-workflows; and 3) data analysis strategies that specify for a type of hypothesis what data and methods to use, represented as lines of inquiry. Given a hypothesis specified by a scientist, appropriate lines of inquiry are triggered, which lead to retrieving relevant datasets, running relevant workflows on that data, and finally running meta-workflows on workflow results. The scientist is then presented with a level of confidence on the initial hypothesis, a revised hypothesis, and possibly with new hypotheses. We have implemented this approach in the DISK system, and applied it to multi-omics data analysis.

Metadata

publication
Proceedings of the Fourth Annual Conference on Advances in Cognitive Systems …, 2016
year
2016
publication date
2016/6
authors
Yolanda Gil, Daniel Garijo, Varun Ratnakar, Rajiv Mayani, Ravali Adusumilli, Hunter Boyce, Parag Mallick
link
https://dgarijo.com/papers/acs2016.pdf
resource_link
https://dgarijo.com/papers/acs2016.pdf
journal
Proceedings of the Fourth Annual Conference on Advances in Cognitive Systems (ACS)
volume
2
pages
4