Preference-Aware Dynamic Bitrate Adaptation for Mobile Short-Form Video Feed Streaming

Short-form video feed becomes increasingly popular among younger generations, where mobile users watch videos of a few seconds one-by-one in a session or a pre-defined list. The common solution to improve the quality of experience (QoE) for short-video feed is to implement dynamic adaptive streaming strategies to decide bitrate for each video. However, legacy bitrate adaptation strategies fail to differentiate the weights of videos in the list according to the user preferences, leading to the bandwidth waste on downloading those video content skipped immediately by the users without being displayed. In this paper, we propose RecDASH, which consists of an attention-based user modeling module leveraging advanced recommendation algorithms and a reinforcement learning (RL) based bitrate adaptation module. Specifically, we use gated recurrent unit (GRU) network with attention mechanism to encode the session into a representation vector. Then, RL module combines the representation of session and other observations within playback, and yields the appropriate bitrate for the next short-form video for optimizing a given QoE objective. The low space and time complexity of the model enables it to be easily deployed on mobile devices. Trace-driven emulations verify the efficiency of RecDASH compared to several state-of-the-art streaming strategies with at least 5%-15% improvement on the video quality under various QoE objectives.


I. INTRODUCTION
According statistics from Cisco [1], 75% of all mobile data will be consumed by videos by 2020. Given abundant bandwidth of access networks today, communications using video contents has never been easier over feed-based mobile social platforms (e.g., Facebook and Instagram). Compared to the long-form video, the short-form video feed is more popular for attention capturing (especially for young generation)to capture the user attention right away and consolidate the message within a few seconds in the video. Many video service providers have launched short-form video services such as TikTok, Vine, and Lasso, which allow users to upload and share self-filmed video clips of a few seconds.
In most of short-form video feed applications, there are usually many scenarios where users follow the predefined video list, rather than search the videos one-by-one, such as The associate editor coordinating the review of this manuscript and approving it for publication was Xiaochun Cheng.
(1) a billboard chart ordered by the popularity of the videos; (2) a list of videos from specific creator sorted by the released time; (3) the list-based recommendation to avoid overloading of servers from the highly-frequent requests. It cannot be guaranteed that all short-form videos attract users when they are displayed to the users. Users have two options in the short-form video feed applications: (1) watch the entire short video if interesting, or (2) quickly skip the videos in the list against their interests. Apparently, requesting high bitrate for those videos skipped by users leads to a waste of bandwidth. Hence, to improve the quality of experience(QoE) for users under the limited and dynamic bandwidth resources, video service providers should assign preference-aware bitrates to the videos in the list for a specific user along with the playback.
The past few years have witnessed the success of longform video bitrate adaptation strategies, guiding the clients to request an appropriate bitrate for future video chunks based on the playback statistics. However, these bitrate adaptation VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ strategies may not work well on streaming short-form video feed due to the following challenges.
• Differentiated weights of short videos. The longform video bitrate adaptation strategies equally treat the chunks of the entire video. However, it is not lucrative to request high bitrate for the short-form videos that may probably be skipped by users immediately, or the bandwidth will be wasted on downloading content without being displayed.
• Balance between efficiency and accuracy. Since the short-form video services are mainly aimed for mobile devices, it is unwise to use complex model pursuing little elevation in the accuracy of recommendation and bitrate allocation. We must balance between the efficiency and accuracy, using models with low space and time complexity and yielding a satisfactory performance.
• QoE of short-form video feed. Strategies of streaming long-form video have reached a consensus on maximizing bitrate, minimizing rebuffering and maximizing smoothness. However, the optimal streaming strategies could not guarantee the quality of specific video chunks, while the QoE should assign rewards to user-preferred short-form videos with high quality. To achieve better user experience, it is essential for shortform video's recommendation module to deliver videos that user prefers. With frequent user behavior towards different videos, session-based recommendation for short-form videos appears adequate and easy at first glance. However, due to the short length of each video, the recommendation module generates a video list that user may favor based on user's feed back on previous video lists (same as the list-wise recommendation in the news-feed scenario [2]), instead of reacting instantly to deliver one single video for next. This means videos in the current playing video list are fixed. When the recommendation is not accurate or more likely, the user preference migrates during the playback of the list, some videos in the current list but not played can be identified as unattractive. These disliked videos have to be played since the video list is fixed, but allocating high bitrate to them is an obvious waste of bandwidth.
Our research findings indicate that recent advances in recommender system and reinforcement-learning (RL) based streaming strategies hold the promise of addressing these challenges. In the literature, many methods have been proposed for capturing users' interest and click-through-rate (CTR) prediction, which model the users and items (i.e., short videos in this paper) into embedding vectors and predict whether a certain group of users would click (or fully watch) the items in the list with the impressions (i.e., displays). Though the order of the short-form video feed list is usually generated by the state-of-the-art prediction models, few works focus on integrating the intermediate or final recommendation results to the bitrate allocation scheme. Meanwhile, due to the various preferences on QoE for videos among users, the RL based streaming strategies are proposed to adaptively learn the optimal policies for maximizing the given QoE objective in any form. Priority attempt is presented to ensure the quality of annotated video chunks with cascaded RL design. In the short-form video feed streaming task, we are expected to dynamically allocate appropriate bitrates to the short-form videos in the list, along with the prediction of whether the users would fully watch the next short-form video.
In this paper, we propose RecDASH, a pull-based video streaming strategy which leverages advanced recommendation algorithm for preference prediction and adaptively requests bitrate for the next short-form video. RecDASH applies the attention mechanism over a gated recurrent unit (GRU) network to identify the user preferences based on the previous records of the current session and yields a session-level representation to model the user behavior. A bitrate adaptation module is then employed to differentiate the value of each video to the user and decide the bitrate to be requested for the next video over the observation of the playback and the session-level representation. The RL strategy, i.e., Q Actor-Critic, is proposed in bitrate adaptation, which adaptively optimizes the given objective with various weights on QoE metrics. We construct a dataset by the real-world records in the Avazu CTR prediction dataset and the public bandwidth traces from HSDPA to examine the performance of RecDASH. Compared to several state-of-the-art bitrate adaptation strategies for longform videos, RecDASH could improve the video quality by at least 5%-15%. Extensive analysis reveals that RecDASH does improve the quality of viewing the short-form videos for the interests of users, indicating that RecDASH is practical to be implemented in the real-world short-form video feed applications.
We summarize the contribution of this paper as follows: • We propose a pull-based video streaming strategy for mobile short-form video feed with state-of-the-art recommendation technique and RL model, which adaptively determines bitrate for the next short-form video based on the records in the current session.
• The proposed RecDASH can adapt to QoE objectives with various weights of the QoE metrics in the shortform video feed by integrating the session-level representation learned by the attention-based user behavior modeling method.
• We analyze the complexity of RecDASH, and find that RecDASH can be executed on the off-the-shelf mobile devices in real-time with periodically updating the parameters of the model.
have agreed on the standard of Dynamic Adaptive Streaming over HTTP (DASH) [3] to make bitrate adaptation decision based on the playback statistics of the long-form videos. In general, the common bitrate adaptation strategy attempts to optimize the QoE by maximizing the quality of video chunks, minimizing rebuffering duration and maximizing smoothness between the consecutive video chunks [4]- [7]. These strategies can be divided into two categories of QoEsensitive and QoE-insensitive by whether the strategies perform differently given the QoE objective in various formats. For example, buffer-based (BB) rate adaptation [8] is a typical QoE-insensitive strategy to avoid rebuffering based on the mobile devices' playback buffer condition. Regarding the QoE-sensitive video streaming strategies, theoretical control, such as MPC [5], convex optimization, such as BOLA [6], and RL techniques, such as Pensieve [7], are three most common methods which can work well in specific scenarios.
Recently deep learning methods [9]- [11] have also been leveraged to improve the performance of adpative video streaming. Beyond that, much efforts have been made on implementing collaborative information to facilitate video streaming, e.g., CS2P [12] which focuses on improving the wireless network bandwidth estimation for bitrate adaptation, SDNDASH [13] which introduces central control to coordinate the usage of the limited bandwidth resources. Although the previous bitrate adaptation strategies attempt to optimize the overall QoE of the playback, they are not appropriate for the short-form video feed as they cannot guarantee the high quality of specific short-form videos for interests of users. This paper presents a task-aware QoE objective for short-form video feed, and proposes a pull-based strategy with RL model for the optimization.

B. SESSION-BASED RECOMMENDATION OF VIDEOS
Recommender systems are one of the most successful and widespread applications of machine learning technologies in business, creating a delightful user experience while driving incremental revenue. In terms of the online services such as video watching, the session-based recommendation is essential to predict the user preferences only with the records in the session. Many researchers transform the sessions into sequences and capture the user interests within the sequences by recurrent neural network such as GRU4Rec [14]. Furthermore, some works introduce attention mechanism [15] and self-attention mechanism [16] to enhance the learning capability of the recurrent models. Compared to the recommending common items, recommender systems of online videos can retrieve both video-relevant features and collaborative features [17], [18] for accurate recommendation. Usually, content providers would deploy list-wise recommendation for feed-like content [2] such as videos. In this case, clickthrough rate (CTR) and fully-view-rate are two practical indicators that help companies decide which short-form videos should be collected in the list and in what order the videos should be transferred to user's mobile devices [19]. However, most of these recommender systems should be executed on the server side due to the complex models. RecDASH learns the session-level representation via a GRU network with attention mechanism to predict whether a user will fully view a short-form video on the mobile device side, considering both accuracy and efficiency.

A. SHORT-FORM VIDEO FEED STREAMING
In the scenario of short-form video feed streaming, users watch the short-form videos one-by-one following a predefined or selected watch list v = (v 1 , v 2 , . . . , v N ), where each short-form video v i is composed of a series of frames with length l i and N is the length of the watch list. The watch lists are generated by the recommender system or selected by the users themselves. The lists are fixed throughout the watching processes unless reaching the end, and then users can choose to leave or watch the short-form videos in another list. During the playback, users can watch the entire short-form videos they like or quickly skip the videos against their interests.   1 provides an illustration for the collabaration between bitrate adaptation module and recommendation module inside client. When the current video list is about to finish, for example, 8 out of 10 videos in the list have been watched,the recommendation module generates list-wise video recommendation, e.g., the next video list to be forwarded to bitrate adaptation module. The bitrate adaptation module takes user preference information and bandwidth condition together into consideration to provide client with decisions on the bitrate allocation for videos which haven't been downloaded yet.
Meanwhile, the client deployed on the mobile device repeatedly sends the requests for downloading the shortform videos in the watch list. Since each short-form video v i may have multiple sizes corresponding to different bitrates, the client needs to determine a bitrate br i ∈ BR for v i from the candidate bitrate set BR before the request. The downloaded videos are stored in the playback buffer, and the buffer is consumed by the user playbacks. We will introduce the client in detail in the following sections.

1) SHORT-FORM VIDEO REQUEST
Different from the long-form video streaming that divides the videos into chunks (short segments of consecutive frames) or VOLUME 8, 2020 tiles (spatial portions in a chunk), the client requests an entire video in a single request in the short-form video feed streaming setting. Suppose the client begins to send the request for downloading the i-th video v i of bitrate br i at timestamp t i . Since the network condition is dynamic, we assume that the bandwidth varies with time and its value at timestamp t is N (t). Then, the download time t d i of v i can be calculated by: where q i,br i denotes the quality (size) of video i of br i . Currently on short-form video platforms, videos are mostly home-made vlogs (short videos sharing daily life), which makes it feasible to use the video size to represent the video quality in short video streaming scenario. Because of this fact and the purpose of simplifying mathematical notations, we assume that the videos of the same length and bitrate have same content size. Note that there may be a short delay t i between the i-th and (i + 1)-th short-form videos. Hence, the relationship between t i and t i+1 can be expressed by: 2) BUFFER UPDATE Assume the size of the playback buffer is b f , where the unit size is measured by one second of playback time. Define B(t) ∈ [0, b f ] as the remaining playback time of the video content in the buffer, i.e., buffer occupancy, at timestamp t. The buffer occupancy decreases along with video playback, and increases by l i when the i-th short-form video has been downloaded completely. Let B i = B(t i ) be the buffer occupancy when the request for v i is sent. To overcome the cold-start problem, the client first downloads n short-form videos before the playback begins. After the warm-up stage, the buffer occupancy changes according to both the downloading and playback of the users. In addition, the client drops the remaining playback in the buffer when the users try to skip the current video. Therefore, the buffer occupancy at t i+1 could be calculated by: where t s i represents the total length of the user-skipped videos between t i and t i+1 , and ρ(x) = max(x, 0) which guarantees that buffer occupancy is no less than 0.
During the playback, there are two special events worth noting: • Rebuffering given empty buffer. When the buffer occupancy reaches 0, the playback suffers from rebuffering, and the user needs to wait until the current short-form video is downloaded into the buffer. The rebuffering would hurt the user experiences.
• Downloading suspended given full buffer. When there is no buffer space for a new short-form video v i , i.e., B i + l i > b f , the downloading process should suspend for a short delay of t i :

3) BITRATE ALLOCATION
When the (i − 1)-th video has been completely downloaded into the buffer, the client allocates bitrate br i from the candidate bitrate set for the i-th video. The bitrates are usually generated by an algorithm based on the environmental state for optimizing several QoE objectives. Formally, given the environmental state s i , including bandwidth, buffer occupancy, requested video length, etc., the allocated bitrate br i for v i is calculated by br i = f (s i ) where f is the proposed allocation algorithm.

B. PREFERENCE-AWARE DYNAMIC BITRATE ALLOCATION
To adapt the changing environment, we follow the DASH (Dynamic Adaptive Streaming over HTTP) framework proposed for the long-form video streaming, and introduce RecDASH which adaptively allocates bitrates and optimizes the QoE objectives over the playback statistics during the short-form video feed streaming. Specifically, we utilize the reinforcement learning in RecDASH because it is inherently suitable for portraying the dynamic action-picking process (bitrate allocations) in the changing environment (video streaming) to optimize the accumulated rewards (QoEs).

1) PREFERENCE-AWARE BITRATE ALLOCATION
In the short-form video feed streaming, the watch list is fixed for each user. However, users may not be interested in all the videos in the predefined watch list, and they will skip the shows against their interests. It is always beneficial to take user preferences into consideration during the bitrate allocation. For example, the client may allocate low bitrates to the short-form videos that are against user interests under limited bandwidth.
In this paper, we propose a preference prediction module in RecDASH. Let I i denote the indicator function of whether the user is interested in v i : The preference prediction module predicts the user preference I i (1 ≤ i ≤ n) based on the historical watch list before t i . Then, the user preference results, as well as the playback features are fed into RecDASH for dynamic bitrate allocation.

2) ANALYSIS OF USER WATCHING TIME
Intuitively, it is reasonable to allocate low bitrates to the shortform videos that users may not like. We show the analytic results of the actual watching time of the videos against user interests in this section. Assume users watch the entire videos that meet their interests, while videos that do not attract users are only partially watched. Suppose the short-form videos in the watch list are all of the same length L. Following the statistical analysis in [20], we assume that the watch time T of each short-form video fits the truncated exponential distribution, where the density function is: where 0 ≤ t ≤ L, and λ is the rate parameter such that a larger λ indicates that the users tend to skip the videos in the early stage. Then, we have the following proposition: Proposition 1: Assume the user prefers kN short-form videos out of N , where k ∈ (0, 1]. Then, the expected playback time against the user interests accounts for less than Proof: The expectation of watching time T can be calculated by: Since the user only prefers kN videos in the watch list, the proportion of the playback time that against user interests is at most: Hence, the proposition is proved.
In the real-world scenario, λ is usually large. Hence, we could infer that the actual playback which is against user interests is very short such that we could neglect the effect of low bitrate allocation on those videos.

C. QUALITY OF EXPERIENCE METRICS
The quality of experience (QoE) is a measure of the delight or annoyance of users' experiences with the applications, which is the optimization objective of the short-form video feed streaming system. There are many kinds of QoE components under different scenarios. However, those QoE measurements are mainly designed for long videos being split into consecutive chunks, whereas do not measure users' satisfaction indicated by the viewing duration of short videos. In this paper, we focus on the following metrics that significantly affect the user experiences for video v i (1 ≤ i ≤ N ): 1) Effective Video Quality. According to the numerical analysis in the previous section, we only consider the quality of the effective videos, i.e., videos in the watch list that conform to the user preferences, in the watch list. The effective video quality could be computed as: 2) Rebuffering. Since the rebuffering greatly hurts the user experience, we also take the rebuffering time into consideration: 3) Bitrate Fluctuation. During the video playback, the frequent bitrate changes (fluctuation) have a negative impact on the user experience. Hence, it is necessary to minimize the fluctuation, which can be measured by: where specially we assume br 0 = br 1 . Then, the QoE objective of v i could be formulated by: where µ = (µ 1 , µ 2 , µ 3 ) are non-negative weighting parameters. Note that different settings of µ result in different optimization targets. We follow [5], [7] to use the additive form of QoE objectives which can adapt to different user preferences flexibly. For example, if users have a strong preference for high-quality videos, µ 1 should be set relatively larger than µ 2 and µ 3 . On the other hand, if users do not tolerate delays but can bear low bitrates of the videos, it would be better to set µ 2 larger than the other QoEs.
To provide high QoE for users, we propose a preferenceaware dynamic bitrate allocation mechanism named RecDASH, which determines the bitrates in real time and maximizes the overall QoE objective: Problem 1: arg max

IV. PREFERENCE-AWARE DYNAMIC BITRATE ALLOCATION
RecDASH model consists of two major modules, as illustrated in Fig. 2. The Preference Prediction Module (Part I) VOLUME 8, 2020 predicts the user preference over the watch list, while the Bitrate Adaptation Module (Part II) performs dynamic bitrate allocation with deep reinforcement learning based on the monitored statistics in the playback environment. The target of RecDASH is to maximize users' quality of experience.
the preference prediction model outputs the probability I i that user would be interested in short-form video v i . In the short-form video feed streaming, we assume that the complete watching of the videos signifies the preferences of the users, while skipping behaviors indicates that the short-form videos are against user interests. The network architecture of preference prediction module is shown in Fig. 3.

1) GATED RECURRENT UNIT BASED SEQUENCE ENCODER
In RecDASH, we utilize the Gated Recurrent Unit (GRU) [21] to encode the sequential watching behavior. GRU is a variant of RNN that aims at dealing with the vanishing gradient problem [22]. Specifically, it adds extra reset gate and update gate to explicitly determine how much information should be discarded from previous state, and how much information should be retained from current input. GRU is widely used [14], [15], [23], [24] in session-based recommendation, light-weighted to be deployed on mobile devices and capable of capturing user's most recent interest, which is crucial in user preference prediction in short-form video scenario. Hence we follow the convention and adopt GRU as the base model of our preference prediction module.
Denote x i as the embedding of video v i , 1 ≤ i ≤ N . The activation of GRU g i can be computed by: where the update gate z i is given by The candidate activation functionĝ i is computed as: where the reset gate re i is given by: We simply use g i as the GRU-based representation for v i .

2) ATTENTIVE SEQUENCE ENCODER
To capture the latest information in the watch list, we also involve the attention mechanism [25], which allows the network to dynamically select and combine different parts of the input sequence, and generate attentive representation c i of sequence v i : where α ij is the normalized weighted factor that determines which part of the activation sequence should be emphasized or ignored in the prediction,

3) USER PREFERENCE DECODER
We utilize both GRU-based representation and attentive representation for the user preference prediction E(I i ). Specifically, The preference prediction model is trained on cross entropy loss between I i and E(I i ).

B. BITRATE ADAPTATION MODULE
During playback, before the request of video v i , the bitrate adaptation module determines the bitrate br i of v i from the playback statistics. To improve the overall QoE of users, we utilize reinforcement learning (RL) to conduct and optimize dynamic bitrate allocation.

1) BITRATE ALLOCATION WITH REINFORCEMENT LEARNING
In the standard RL settings, an agent interacts with an environment E over a number of discrete time steps. At each timestep t i , the agent receives an observation o i , takes an action a i according to its policy π, and receives a scalar reward r i . The discounted return R i = N j=i γ j−i r j is the total accumulated reward from time step t i with discount factor γ ∈ (0, 1]. The goal of reinforcement learning is to maximize the expected return by optimizing the policy. Apparently, the bitrate allocation can be catered to the RL framework. Specifically, the client requesting the playback can be seen as the agent, and everything else can be regarded as the environment. When the client is about to allocate bitrate for video v i , it collects the playback statistics as the observation o i . The client updates the previous state s i−1 with o i and gets the current state s i . Then, it chooses a bitrate (action) a i = br i from discrete bitrate set BR based on the learnable allocation policy π θ (a i |s i ) (parameterized by θ), and receives the user QoE as the reward r i = QoE i . The goal of the bitrate allocation here is to learn the optimal policy π * θ that maximizes the expectation of the total discounted QoE: where is the action-value function, i.e., the expected return under from state s i and action a i , under policy π θ . Note that the Eqn. (19) is a transformation of Problem 1. Following the above process, we could embed the bitrate allocation into the framework of reinforcement learning. From theoretical proof in [26], the optimal policy in RL is deterministic. Thus, after training the model with RL algorithms, i.e., Q Actor-Critic [27] in RecDASH, the client only needs to pick the bitrate with the highest probability during playback.

2) BITRATE ALLOCATION MODEL
In RecDASH, the client collects video id i, video buffer state B i , and video length l i from the playback. The client also records the bandwidth of last 10 seconds before t i , i.e., N i = (N (t i − 9), N (t i − 8), . . . , N (t i )), to capture the dynamic information, where N (t i ) denotes the bandwidth at timestamp t i . For the consideration of playback fluctuation, the client should also be aware of the previous bitrate allocation, i.e., br i−1 . Besides, to give preference-aware allocation, the preference prediction I i as well as the output of the GRU encoder g i and attentive encoder c i are also added to the observation. Thus, the observation o i can be described as the following feature collection: To combine the historical observations, we utilize another GRU H η . The state s i can then be calculated by: where s 0 is set to zero. Note that the observations are the consequences of previous allocations, so the actions are implicitly included in the state. We conduct the bitrate allocation based on the policy π θ (a i |s i ). In concrete, π θ utilizes a linear project transformation and a softmax layer to convert state s i to the discrete action distribution. Besides, as RecDASH uses the Q Actor-Critic algorithm to optimize the model, an extra Q-network (realized by a fully-connected layer) is applied on the state to generate the estimation of the action-value function Q w (s i , a i ) for each bitrate a i .

3) Q ACTOR-CRITIC BASED MODEL TRAINING
RecDASH exploits Q Actor-Critic algorithm [27] to train the bitrate adaptation module for optimizing the QoE objectives. The Q Actor-Critic algorithm is fully online and incremental, which can be updated during playback.
Specifically, at each iteration i, the client receives the sequence of o i , s i , a i , r i , o i+1 , s i+1 , and computes the Q-target: Then, the gradient of w is calculated by minimizing the L2 loss between y i and Q w (s i , a i ): Note that the Q-target y i and Q w (s i , a i ) are both the estimations of action-value function, where Q-target is a more precise estimation due to the one-step look ahead. Thus, the Q-network will be more close to the actual action-value function after parameter update. After that, the client calculates the Q-error: which is an estimation of the advantage. Based on δ i , the gradient of θ can be deduced by: It is worth mentioning that the Q-error is equal to the gradient of the L2 loss between y i and Q w (s i , a i ) on Q function, while the update of θ is equal to the gradient ascent on the expected value of Q π θ (s i , a i ). Besides, GRU H η is also updated at the same time when Q w (s i , a i ) and π θ (s i , a i ) are updated. The whole training methodology of RecDASH is shown in Algorithm 1. It first trains the parameters of preference prediction module based on the user behaviors (line 2). Afterward, the model repeatedly simulates the video playback events and optimizes the bitrate allocation model (line [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17]. The parameters are trained based on the Q Actor-Critic algorithm and eventually, return the optimal policy π * θ .

4) COMPLEXITY ANALYSIS
Denote L s as the sequence length. Denote I as the input size of GRU-based RNN, e.g., the embedding size of the video feature. And denote H , A L , as the hidden size of GRU and attention vector length, respectively. 1) Space Complexity. Note one GRU has three gates.The size of the attention parameters is N att = H * A L . So space complexity of the preference prediction module can be calculated as: The space complextity of the RL model consists of the parameters of Actor-Critic Network and the GRU.
adaptation module, and O s as the size of observation. Denote A as the size of fully-connected layer. Its space complexity can be calculated as: Typically H and H differs slightly, and A = O(H ).
To simplify notations we use O(H ) to replace H and A. The overall space complexity will then be: The model size would only occupy up to several megabytes. 2) Time Complexity. We take FLOPS (float point operations per second) to measure time consumption. The time to determine bitrate allocation to one video is Thus, this design yields a light weight model which can be deployed on mobile devices conveniently.

V. IMPLEMENTATION
The architecture of the system implementation is illustrated in Fig. 4, including the server modules and the client modules.

5) SERVER
The server modules are implemented on a cluster with Gigabit-NIC, which contains three types of modules. The first module is a list-wise recommender system to provide a list of short videos according to the interactions from clients to the previous list. We develop an interface which can be developed by any list-wise recommender system for the online manner, while for emulation, we just make recommendation based on the dataset. The second one is an offline module, which prepares the short-form videos with all bitrates, and stores them into the short video database. The third one is an online module, i.e., request handler, which provides the interface for the request from the clients. For each request from the client, the server retrieves the videos with the requested bitrates from the short-form video database and sends the videos to the client through HTTP/2.0 connection.

6) CLIENT
We embed the preference prediction module and RL-based bitrate adaptation module of RecDASH into a client application by PyTorch 1 which can be easily deployed on mobile devices. At the beginning of playback, the client downloads a configuration file which describes the information of the current watch list. During the playback, the client collects the statistics in the playback, and dynamically allocates bitrate via both bitrate adaptation module and preference prediction module at each time step. The selected bitrate is injected into the request for the next video, which is then inserted to the request queue of the download module. The download module sends the requests one-by-one. The downloaded videos are fed into the video decoder and then stored into the video buffer waiting for access from the video player.

7) EMULATION PLATFORM
An emulation platform could not only evaluate the performance of specific streaming strategies, but also help RL module conduct training on the provided dataset. Given a bandwidth trace and a sequence of short videos with watching duration, the whole playback procedures can be simulated in a fast manner. In this case, we develop an emulation platform by Python, with RecDASH developed by PyTorch. We could then monitor the detailed statistics of the playback, which could both help evaluate performance and guide the training of RecDASH. When the training is finished, the model would be dumped to the mobile devices, and executed in the realworld short-video streaming applications.

VI. EVALUATION
In this section, we evaluate the performance of RecDASH over several real-world scenarios and compare it with some state-of-the-art algorithms.
A. SETTINGS

1) DATASET GENERATION
We randomly select watch lists and bandwidth traces from public datasets and combine them as the final dataset for the evaluation.
• Videos. We collect ten movies encoded in 4K, and randomly segment them into 2000 short-form videos of 10 to 20 seconds length. We assume that there are five candidate bitrates in total. Thus, we encode these shortform videos by FFMPEG with X.264 under the average bitrate (ABR) mode, where each video is assigned by 2Mbps (360p), 8Mbps (720p), 17Mbps (1080p), 31Mbps (2K) and 50Mbps (4K).
• Watch lists. We select 3000 watch lists of length 20 from the real-world Avazu CTR prediction dataset. 2 We pair the IDs in the CTR dataset with the randomly generated short-form videos, and thus each short-form video is assigned with an unique ID. ''0'' in the dataset represents that the user is not interested in the short-form video, and we assume that the user will skip this video following the truncated exponential distribution. ''1'' in the dataset represents that the user is interested in the short-form video, and will fully watch that video. We plot the distribution of the proportion of user interested videos within the single watch list in Fig. 5, where we find that most of the user (nearly 90%) would not watch more than 60% of the entire short videos. For the evaluation, 10% of watch lists are selected as the test set, which are guaranteed not to appear in the training procedure.
• Bandwidth traces. We choose 100 bandwidth traces with various bandwidth patterns from public datasets [28], [29] to simulate different network bandwidth conditions. The average bandwidth (available bandwidth for short-form video streaming) ranges from 8Mbps to 20Mbps, and 80% of bandwidth falls between 10Mbps and 16Mbps. Similar to watch lists, we randomly select 10 bandwidth traces as the test set for evaluation.

2) IMPLEMENTATION DETAILS OF RecDASH
In the preference prediction module, the size of input embedding is set to 96, i.e., video id embedding of 32 units and the video features of 64 units, and the hidden size of GRU is set to 64. Different from preference prediction module, since the values of different features in bitrate allocation have various scales, we conduct feature scaling on all the features except the output of GRU encoder and attentive encoder. Thus, the size of each observation is 142 and we set the hidden size of GRU as 128 in bitrate allocation module. We utilize the Adam optimizer [30] to train both modules, where the additional linear decay and gradient clipping are also applied to prevent overfitting. The buffer capacity b f is set to 30 seconds of playback time, which is a common setting in real-world applications. Besides, we download one short-form video in the buffer before playback to overcome the cold-start problem.

3) BASELINE METHODS
We propose five baseline models for comparison.
• Bandwidth based strategy (BB): This Bandwidth based strategy merely evaluates the bandwidth condition and allocates the highest bitrate based on the predicted bandwidth condition.
• Popularity based strategy (POP): POP supposes that the video preferred by most users in the previous playback traces are likely to be popular for other people in the future. Thus, it collects the CTR for each video and the videos with higher CTRs, i.e., within top 20%, will be allocated with a higher bitrate, while the other videos will be allocated with a low bitrate.
• Preference-based strategy (PB): PB prefers more on the videos that have higher predicted user preference. Specifically, it allocates higher bitrate to the videos when the predicted preference is larger than 0.5, while the other videos are allocated with a low bitrate.
• Smoothness based strategy (SB): SB aims to avoid temporal fluctuations between adjacent videos by allocating similar bitrate with the previous video.
• Pensieve: Pensieve is originally designed to make bitrate adaptation for streaming long-form videos by RL model. In this paper, we modify it to be used in the short-form video scenario by treating each short-form video as a video chunk. Note that whether the user would fullyview the specific short-form video is not considered in Pensieve. It just takes bandwidth conditions and playback statistics into consideration, as conventional DASH streaming systems do.

4) PERFORMANCE METRICS
We select three sets of parameters of µ to represent three QOE objectives with various preferences. First, we examine the performance of the prediction model for user preference. We check the Receiver Operating Characteristic (ROC) for each method to evaluate the prediction accuracy. As shown in Fig. 6, our proposed preference prediction model exhibits better performance over the POP algorithm (i.e., recommending the most popular short-form videos to users), indicating that the proposed algorithm is qualified for this task. The ROC of the compared prediction models on the preference prediction task (TPR represents the true positive rate, and FPR represents the false positive rate).

C. INTEGRATION OF RECOMMENDATION WITH RL
In this section, we investigate whether the designed RL structure in RecDASH can take full advantage of the preference prediction with recurrent neural network. To analyze the significance of the preference prediction module, we compare the performance of Pensieve and RecDASH. Further, to show the effective of the user preference feature, we drop the input of GRU encoder and attentive encoder in the observations of RecDASH and yield another compared algorithm, i.e., 01-DASH. The comparison between the three methods is plotted in Fig. 7 under three QoE objectives. The results reveal that the preference prediction greatly improves the QoE (i.e., 01-DASH and RecDASH). Moreover, since Rec-DASH could appropriately integrate the user preference features in the state, it receives higher QoE than the 01-DASH.

D. COMPARISON WITH EXISTING APPROACHES
We compare the proposed RecDASH model with the baseline and other state-of-the-art algorithms on all QoE objectives in the settings. Fig. 8 shows the detailed CDF results under three QoE objectives for all the algorithms. It is obvious that RecDASH outperforms the compared algorithms on all QoE metrics over the test traces, and obtains about 10%-20% improvement in average. In particular, compared to the state-of-the-art streaming systems originally designed for long-form videos, RecDASH improves the average video quality by 5%-15% when seeking to maximize the average quality in the QoE objective, i.e., (2, 0.2, 0.1). This improvement implies that RecDASH is aware of the QoE objective and thus allocates high bitrates to the preferred videos and lower bitrates to the unpreferred ones. When considering most on the bitrate temporal variations in the QoE objective, i.e.,(1.8, 0.4, 0.1), RecDASH harmonizes the priorities among the three metrics. In addition, RecDASH suffers from less rebuffering time than the compared approaches when focusing more on the rebuffering time in QoE objective, i.e., (2.7, 0.1, 0.45). Beyond that, we also examine the distribution of the allocated bitrates for the preferred videos, and depict the result in Fig. 9. We can conclude from the result that RecDASH is capable of allocating high bitrate to the preferred short-form videos compared to both the preference-aware methods (i.e., PB and POP) and the preference-unaware methods (i.e., BB, SB, and Pensieve).

E. DISCUSSION OF RecDASH
We then dive into the analysis of why RecDASH works well on the short-form video feed streaming task. Specifically, there are three main observations from the above experiments. First, the RecDASH improves the quality of the user-interested videos much more than the videos against user interests. Since the users would not watch too many videos in the given list (as shown in Fig. 5), the RecDASH can thus allocates high bitrates only to the user-interested short-form videos and results in high QoEs. Second, since it is obvious that different QoE objectives require inherently different streaming strategies, the algorithms of SB, BB, and POP which employ fixed control strategies struggle to optimize different QoE objectives. Therefore, the RL-based methods, e.g., the proposed RecDASH, could adapt to any QoE objectives, and thus outperforms the other methods no  matter how the objective changes. Third, although Pensieve and PB algorithms have the capability to adapt to the various QoE objectives, they do not consider the preference of the users such that they can not utilize the bandwidth resources properly compared with the proposed RecDASH. Based on the above analysis, our RecDASH system integrates the user preference in the RL-based bitrate allocation, which could overcome the problems in the other compared methods.

VII. CONCLUSION
In this paper, we present an RL-based preference-aware short-form video feed streaming system with a focus on not only efficiently utilizing the limited bandwidth resources to improve the QoE of users but also making full use of the playback history by recommendation. The proposed RecDASH system has a video-content independent model, which can dynamically change features in the environments, and thus is capable for optimizing various QoE objectives. The model leverages GRU-based RNN with the attention mechanism to predict the user's preference to the video, and RL-based bitrate adaptation module to dynamic allocates the bitrates based on the generated preference predictions for shortform videos. Trace-driven evaluations show that the proposed RecDASH outperforms state-of-the-art algorithms over three QoE metrics, indicating the importance of combining users' preference and dynamic bitrate allocation in short-form video feed.