Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Zhaoxia Deng; Jongsoo Park; Ping Tak Peter Tang; Haixin Liu; Jie Yang; Hector Yuen; Jianyu Huang; Daya Khudia; Xiaohan Wei; Ellie Wen; Dhruv Choudhary; Raghuraman Krishnamoorthi; Carole Jean Wu; Satish Nadathur; Changkyu Kim; Maxim Naumov; Sam Naghshineh; Mikhail Smelyanskiy

doi:10.1109/MM.2021.3081981

Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Zhaoxia Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu, Jie Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole Jean Wu, Satish Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Mikhail Smelyanskiy

Research output: Contribution to journal › Article › peer-review

7 Scopus citations

Abstract

Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.

Original language	English (US)
Pages (from-to)	93-100
Number of pages	8
Journal	IEEE Micro
Volume	41
Issue number	5
DOIs	https://doi.org/10.1109/MM.2021.3081981
State	Published - Sep 1 2021
Externally published	Yes

ASJC Scopus subject areas

Software
Hardware and Architecture
Electrical and Electronic Engineering

Access to Document

10.1109/MM.2021.3081981

Cite this

Deng, Z., Park, J., Tang, P. T. P., Liu, H., Yang, J., Yuen, H., Huang, J., Khudia, D., Wei, X., Wen, E., Choudhary, D., Krishnamoorthi, R., Wu, C. J., Nadathur, S., Kim, C., Naumov, M., Naghshineh, S., & Smelyanskiy, M. (2021). Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale. IEEE Micro, 41(5), 93-100. https://doi.org/10.1109/MM.2021.3081981

@article{a3f98cb0be4042c588f0d6037321a2be,

title = "Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale",

abstract = "Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.",

author = "Zhaoxia Deng and Jongsoo Park and Tang, {Ping Tak Peter} and Haixin Liu and Jie Yang and Hector Yuen and Jianyu Huang and Daya Khudia and Xiaohan Wei and Ellie Wen and Dhruv Choudhary and Raghuraman Krishnamoorthi and Wu, {Carole Jean} and Satish Nadathur and Changkyu Kim and Maxim Naumov and Sam Naghshineh and Mikhail Smelyanskiy",

note = "Publisher Copyright: {\textcopyright} 1981-2012 IEEE.",

year = "2021",

month = sep,

day = "1",

doi = "10.1109/MM.2021.3081981",

language = "English (US)",

volume = "41",

pages = "93--100",

journal = "IEEE Micro",

issn = "0272-1732",

publisher = "IEEE Computer Society",

number = "5",

}

TY - JOUR

T1 - Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

AU - Deng, Zhaoxia

AU - Park, Jongsoo

AU - Tang, Ping Tak Peter

AU - Liu, Haixin

AU - Yang, Jie

AU - Yuen, Hector

AU - Huang, Jianyu

AU - Khudia, Daya

AU - Wei, Xiaohan

AU - Wen, Ellie

AU - Choudhary, Dhruv

AU - Krishnamoorthi, Raghuraman

AU - Wu, Carole Jean

AU - Nadathur, Satish

AU - Kim, Changkyu

AU - Naumov, Maxim

AU - Naghshineh, Sam

AU - Smelyanskiy, Mikhail

PY - 2021/9/1

Y1 - 2021/9/1

N2 - Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.

AB - Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.

UR - http://www.scopus.com/inward/record.url?scp=85107209568&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85107209568&partnerID=8YFLogxK

U2 - 10.1109/MM.2021.3081981

DO - 10.1109/MM.2021.3081981

M3 - Article

AN - SCOPUS:85107209568

SN - 0272-1732

VL - 41

SP - 93

EP - 100

JO - IEEE Micro

JF - IEEE Micro

IS - 5

ER -

Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this