Loading [MathJax]/extensions/TeX/boldsymbol.js
Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving | IEEE Journals & Magazine | IEEE Xplore

Exploiting Structured Feature and Runtime Isolation for High-Performant Recommendation Serving


Abstract:

Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommenda...Show More

Abstract:

Recommendation serving with deep learning models is one of the most valuable services of modern E-commerce companies. In production, to accommodate billions of recommendation queries with stringent service level agreements, high-performant recommendation serving systems play an essential role in meeting such daunting demand. Unfortunately, existing model serving frameworks fail to achieve efficient serving due to unique challenges such as 1) the input format mismatch between service needs and the model's ability and 2) heavy software contentions to concurrently execute the constrained operations. To address the above challenges, we propose RecServe, a high-performant serving system for recommendation with the optimized design of structured features and SessionGroups for recommendation serving. With structured features, RecServe packs single-user-multiple-candidates inputs by semi-automatically transforming computation graphs with annotated input tensors, which can significantly reduce redundant network transmission, data movements, and useless computations. With session group, RecServe further adopts resource isolations for multiple compute streams and cost-aware operator scheduler with critical-path-based schedule policy to enable concurrent kernel execution, further improving serving throughput. The experiment results demonstrate that RecServe can achieve maximum performance speedups of 12.3\boldsymbol{\times} and 22.0\boldsymbol{\times} compared to the state-of-the-art serving system on CPU and GPU platforms, respectively.
Published in: IEEE Transactions on Computers ( Volume: 73, Issue: 11, November 2024)
Page(s): 2474 - 2487
Date of Publication: 28 August 2024

ISSN Information:

Funding Agency:


I. Introduction

Recommendation models have become indispensable intelligent models in our daily life, which exhibit substantial commercial values for areas such as retailing [1], [2], media [3], advertisements [4], and social networks [5]. Specifically, recommendation systems generate approximately 35% of purchases for Amazon [6] and over $1B per year for Netflix [3]. Meanwhile, Meta reported that recommendation models consume over 79% of their AI inference cycles [7]. Typically, as shown in Fig. 1, a typical serving workflow of a deep learning recommendation system (DLRS) majorly consists of three phases, including filtering item candidates with match and recall, predicting the click-through rate (CTR) of all item candidates for each user query, and choosing the candidates with the top few CTR predictions for the recommendation. Among the serving procedures, the recommendation model (a.k.a., CTR model) inference is the most critical for serving latency. To allow serving, the recommendation models often need three types of features, including user, query, and item, where query features are only required for recommendations within search engines. In production, Baidu has announced that their search engine needs to handle billions of advertisement candidates (namely item) for each user query for recommendation [4].

Contact IEEE to Subscribe

References

References is not available for this document.