A Taxonomy of Error Sources in HPC I/O Machine Learning Models

Mihailo Isakov, Mikaela Currier, Eliakin Del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Glenn K. Lockwood, Michel A. Kinsy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

I/O efficiency is crucial to productivity in scientific computing, but the growing complexity of HPC systems and applications complicates efforts to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2022
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781665454445
DOIs
StatePublished - 2022
Event2022 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2022 - Dallas, United States
Duration: Nov 13 2022Nov 18 2022

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2022-November
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2022 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2022
Country/TerritoryUnited States
CityDallas
Period11/13/2211/18/22

Keywords

  • High performance computing
  • I/O
  • machine learning
  • storage

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'A Taxonomy of Error Sources in HPC I/O Machine Learning Models'. Together they form a unique fingerprint.

Cite this