Publications

Toward an end-to-end framework for modeling, monitoring and anomaly detection for scientific workflows

Abstract

Modern science is often conducted on large scale, distributed, heterogeneous and high-performance computing infrastructures. Increasingly, the scale and complexity of both the applications and the underlying execution platforms have been growing. Scientific workflows have emerged as a flexible representation to declaratively express complex applications with data andcontrol dependences. However, it is extremely challengingfor scientists to execute their science workflows in a reliable and scalable way due to a lack of understanding of expected and realistic behavior of complex scientific workflows on large scale and distributed HPC systems. This is exacerbated by failures and anomalies in largescale systems and applications, which makes detecting, analyzing and acting on anomaly events challenging. In this work, we present a prototype of an end-to-end system for modeling and diagnosing the runtime …

Metadata

publication
2016 IEEE International Parallel and Distributed Processing Symposium …, 2016
year
2016
publication date
2016/5/23
authors
Anirban Mandal, Paul Ruth, Ilya Baldin, Dariusz Krol, Gideon Juve, Rajiv Mayani, Rafael Ferreira Da Silva, Ewa Deelman, Jeremy Meredith, Jeffrey Vetter, Vickie Lynch, Ben Mayer, James Wynne, Mark Blanco, Chris Carothers, Justin Lapre, Brian Tierney
link
https://ieeexplore.ieee.org/abstract/document/7530026/
resource_link
http://scitech.isi.edu/wordpress/wp-content/papercite-data/pdf/mandal-lspp-2016.pdf
conference
2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
pages
1370-1379
publisher
IEEE