GPU-enabled Function-as-a-Service for Machine Learning Inference

Ming Zhao; Kritshekhar Jha; Sungho Hong

doi:10.1109/IPDPS54959.2023.00096

GPU-enabled Function-as-a-Service for Machine Learning Inference

Ming Zhao, Kritshekhar Jha, Sungho Hong

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.

Original language	English (US)
Title of host publication	Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	918-928
Number of pages	11
ISBN (Electronic)	9798350337662
DOIs	https://doi.org/10.1109/IPDPS54959.2023.00096
State	Published - 2023
Event	37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023 - St. Petersburg, United States Duration: May 15 2023 → May 19 2023

Publication series

Name	Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

Conference

Conference	37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023
Country/Territory	United States
City	St. Petersburg
Period	5/15/23 → 5/19/23

Keywords

Caching
Function-as-a-Service
GPU scheduling
Machine learning inference

ASJC Scopus subject areas

Artificial Intelligence
Computer Networks and Communications
Hardware and Architecture
Information Systems

Access to Document

10.1109/IPDPS54959.2023.00096

Cite this

Zhao, M., Jha, K., & Hong, S. (2023). GPU-enabled Function-as-a-Service for Machine Learning Inference. In Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023 (pp. 918-928). (Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPS54959.2023.00096

GPU-enabled Function-as-a-Service for Machine Learning Inference. / Zhao, Ming; Jha, Kritshekhar; Hong, Sungho.
Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023. Institute of Electrical and Electronics Engineers Inc., 2023. p. 918-928 (Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Zhao, M, Jha, K & Hong, S 2023, GPU-enabled Function-as-a-Service for Machine Learning Inference. in Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023. Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, Institute of Electrical and Electronics Engineers Inc., pp. 918-928, 37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, St. Petersburg, United States, 5/15/23. https://doi.org/10.1109/IPDPS54959.2023.00096

Zhao M, Jha K, Hong S. GPU-enabled Function-as-a-Service for Machine Learning Inference. In Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023. Institute of Electrical and Electronics Engineers Inc. 2023. p. 918-928. (Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023). doi: 10.1109/IPDPS54959.2023.00096

Zhao, Ming ; Jha, Kritshekhar ; Hong, Sungho. / GPU-enabled Function-as-a-Service for Machine Learning Inference. Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023. Institute of Electrical and Electronics Engineers Inc., 2023. pp. 918-928 (Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023).

@inproceedings{3bd9872ab04e4b58accffe37b328c161,

title = "GPU-enabled Function-as-a-Service for Machine Learning Inference",

abstract = "Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.",

keywords = "Caching, Function-as-a-Service, GPU scheduling, Machine learning inference",

author = "Ming Zhao and Kritshekhar Jha and Sungho Hong",

note = "Funding Information: This work is partly supported by National Science Foundation awards CNS-1955593 and OAC-2126291. Publisher Copyright: {\textcopyright} 2023 IEEE.; 37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023 ; Conference date: 15-05-2023 Through 19-05-2023",

year = "2023",

doi = "10.1109/IPDPS54959.2023.00096",

language = "English (US)",

series = "Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "918--928",

booktitle = "Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023",

}

TY - GEN

T1 - GPU-enabled Function-as-a-Service for Machine Learning Inference

AU - Zhao, Ming

AU - Jha, Kritshekhar

AU - Hong, Sungho

PY - 2023

Y1 - 2023

N2 - Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.

AB - Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.

KW - Caching

KW - Function-as-a-Service

KW - GPU scheduling

KW - Machine learning inference

UR - http://www.scopus.com/inward/record.url?scp=85166675105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85166675105&partnerID=8YFLogxK

U2 - 10.1109/IPDPS54959.2023.00096

DO - 10.1109/IPDPS54959.2023.00096

M3 - Conference contribution

AN - SCOPUS:85166675105

T3 - Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

SP - 918

EP - 928

BT - Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 37th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

Y2 - 15 May 2023 through 19 May 2023

ER -

GPU-enabled Function-as-a-Service for Machine Learning Inference

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this