TY - GEN
T1 - Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
AU - Zhao, Mark
AU - Agarwal, Niket
AU - Basant, Aarti
AU - Gedik, Bugra
AU - Pan, Satadru
AU - Ozdal, Mustafa
AU - Komuravelli, Rakesh
AU - Pan, Jerry
AU - Bao, Tianshu
AU - Lu, Haowei
AU - Narayanan, Sundaram
AU - Langman, Jack
AU - Wilfong, Kevin
AU - Rastogi, Harsha
AU - Wu, Carole Jean
AU - Kozyrakis, Christos
AU - Pol, Parik
N1 - Funding Information:
We would like to thank the many engineers in the numerous infrastructure and hardware teams that build, support, and maintain the systems and hardware that compose Meta’s DSI pipeline. We also thank Daniel Ford, Dheevatsa Mudigere, Chunqiang Tang, Matei Zaharia, and the anonymous reviewers for their feedback on this paper. Christos Kozyrakis is supported by the Stanford Platform Lab and its affiliate members.
Publisher Copyright:
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2022/6/18
Y1 - 2022/6/18
N2 - Datacenter-scale AI training clusters consisting of thousands of domain-specifc accelerators (DSA) are used to train increasinglycomplex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efciency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efciency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily flter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.
AB - Datacenter-scale AI training clusters consisting of thousands of domain-specifc accelerators (DSA) are used to train increasinglycomplex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efciency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efciency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how hundreds of models are collaboratively trained across geo-distributed datacenters via diverse and continuous training jobs. These training jobs read and heavily flter massive and evolving datasets, resulting in popular features and samples used across training jobs. We measure the intense network, memory, and compute resources required by each training job to preprocess samples during training. Finally, we synthesize key takeaways based on our production infrastructure characterization. These include identifying hardware bottlenecks, discussing opportunities for heterogeneous DSI hardware, motivating research in datacenter scheduling and benchmark datasets, and assimilating lessons learned in optimizing DSI infrastructure.
KW - Data ingestion
KW - Data storage
KW - Databases
KW - Distributed systems
KW - Machine learning systems
UR - http://www.scopus.com/inward/record.url?scp=85132841006&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132841006&partnerID=8YFLogxK
U2 - 10.1145/3470496.3533044
DO - 10.1145/3470496.3533044
M3 - Conference contribution
AN - SCOPUS:85132841006
T3 - Proceedings - International Symposium on Computer Architecture
SP - 1042
EP - 1057
BT - ISCA 2022 - Proceedings of the 49th Annual International Symposium on Computer Architecture
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 49th IEEE/ACM International Symposium on Computer Architecture, ISCA 2022
Y2 - 18 June 2022 through 22 June 2022
ER -