I. Introduction
Recommendation models have become indispensable intelligent models in our daily life, which exhibit substantial commercial values for areas such as retailing [1], [2], media [3], advertisements [4], and social networks [5]. Specifically, recommendation systems generate approximately 35% of purchases for Amazon [6] and over $1B per year for Netflix [3]. Meanwhile, Meta reported that recommendation models consume over 79% of their AI inference cycles [7]. Typically, as shown in Fig. 1, a typical serving workflow of a deep learning recommendation system (DLRS) majorly consists of three phases, including filtering item candidates with match and recall, predicting the click-through rate (CTR) of all item candidates for each user query, and choosing the candidates with the top few CTR predictions for the recommendation. Among the serving procedures, the recommendation model (a.k.a., CTR model) inference is the most critical for serving latency. To allow serving, the recommendation models often need three types of features, including user, query, and item, where query features are only required for recommendations within search engines. In production, Baidu has announced that their search engine needs to handle billions of advertisement candidates (namely item) for each user query for recommendation [4].