QoS-Aware Co-Scheduling for Distributed Long-Running Applications on Shared Clusters

To achieve a high degree of resource utilization, production clusters need to co-schedule diverse workloads – including both batch analytic jobs with short-lived tasks and long-running applications (LRAs) that execute for a long time frame from hours to months – onto the shared resources. Microservice architecture advances the manifestation of distributed LRAs (DLRAs), comprising multiple interconnected microservices that are executed in long-lived distributed containers and serve massive user requests. Detecting and mitigating QoS violation become even more intractable due to the network uncertainties and latency propagation across dependent microservices. However, current resource managers are only responsible for resource allocation among applications/jobs but agnostic to runtime QoS such as latency at application level. The state-of-the-art QoS-aware scheduling approaches are dedicated for monolithic applications, without considering the temporal-spatio performance variability across distributed microservices. In this paper, we present Toposch, a new scheduling and execution framework to prioritize the QoS of DLRAs whilst balancing the performance of batch jobs and maintaining high cluster utilization through harvesting idle resources. Toposch tracks footprints of every single request across microservices and uses critical path analysis, based on the end-to-end latency graph, to identify microservices that have high risk of QoS violation. Based on microservice and node level risk assessment, we intervene the batch scheduling by adaptively reducing the visible resources to batch tasks and thus delaying their execution to give way to DLRAs. We propose a prediction-based vertical resource auto-scaling mechanism, with the aid of resource-performance modeling and fine-grained resource inference and access control, for prompt recovery of QoS violation. A cost-effective task preemption is leveraged to ensure a low-cost task preemption and resource reclamation during the auto-scaling. Toposch is integrated with Apache YARN and experiments show that Toposch outperforms other baselines in terms of performance guarantee of DLRAs, at an acceptable cost of batch job slowdown. The tail latency of DLRAs is merely 1.12x of the case of executing alone on average in Toposch with a 26% JCT increase of Spark analytic jobs.

Abstract-To achieve a high degree of resource utilization, production clusters need to co-schedule diverse workloads -including both batch analytic jobs with short-lived tasks and long-running applications (LRAs) that execute for a long time frame from hours to months -onto the shared resources. Microservice architecture advances the manifestation of distributed LRAs (DLRAs), comprising multiple interconnected microservices that are executed in long-lived distributed containers and serve massive user requests. Detecting and mitigating QoS violation become even more intractable due to the network uncertainties and latency propagation across dependent microservices. However, current resource managers are only responsible for resource allocation among applications/jobs but agnostic to runtime QoS such as latency at application level. The state-of-the-art QoS-aware scheduling approaches are dedicated for monolithic applications, without considering the temporal-spatio performance variability across distributed microservices. In this paper, we present TOPOSCH, a new scheduling and execution framework to prioritize the QoS of DLRAs whilst balancing the performance of batch jobs and maintaining high cluster utilization through harvesting idle resources. TOPOSCH tracks footprints of every single request across microservices and uses critical path analysis, based on the end-to-end latency graph, to identify microservices that have high risk of QoS violation. Based on microservice and node level risk assessment, we intervene the batch scheduling by adaptively reducing the visible resources to batch tasks and thus delaying their execution to give way to DLRAs. We propose a prediction-based vertical resource auto-scaling mechanism, with the aid of resource-performance modeling and fine-grained resource inference and access control, for prompt recovery of QoS violation. A cost-effective task preemption is leveraged to ensure a low-cost task preemption and resource reclamation during the auto-scaling. TOPOSCH is integrated with Apache YARN and experiments show that TOPOSCH outperforms other baselines in terms of performance guarantee of DLRAs, at an acceptable cost of batch job slowdown. The tail latency of DLRAs is merely 1.12x of the case of executing alone on average in TOPOSCH with a 26% JCT increase of Spark analytic jobs.
Index Terms-Resource scheduling, cluster management, QoS, tail latency, datacenters Ç 1 INTRODUCTION P RODUCTION clusters are increasingly consumed by various workloads mainly including batch jobs for data analytics [1], [2], [3], [4] and long-running applications (LRAs) for online cloud services (e.g., Storm [5], Flink [6], HBase [7], MongoDB [8], Tensorflow [9], etc.) for transaction analytics, streaming process, and data store and query. By co-managing diverse workloads onto the same host server, workload co-location has become a common practice in improving resource utilization and cost efficiency. As opposed to batch analytic jobs that usually consist of a large number of short-lived tasks and are measured by the end-to-end job completion time, LRAs have now become another mainstream workloads in production clusters (Google [10], Microsoft [11], Alibaba [12]). LRAs are latency-critical -the stringent quality-of-service (QoS) such as response latency and throughput is of the upmost criticality and must be met to deliver the business promise in the face of network jitters or load spikes. For example, the 95th percentile of requests need to complete within a latency threshold.
Microservice architecture is an approach to constructing a single application as a set of small interconnected services. Each microservice runs individually and communicates with each other mostly using remote procedure calls (RPC) [13]. In this context, a Distributed Long Running Application (DLRA) is referred to as the microservice-based application with microservices executed in the long-lived distributed containers. Compared to monolithic applications, request latency is prone to any network turbulence that will coherently affect massive communications in the DLRA. Pinpointing the QoS violation (e.g., mean or tail latency over a threshold) is ever-increasingly intricate because the latency in a single microservice can promptly propagate across all dependent microservices and ultimately result in the entire performance slowdown [13].
However, traditional cluster managers [3], [14], [15], [16] are originally designated for short-running batch jobs. The central resource manager (RM) is only responsible for resource allocation among applications/jobs, yet leave all application-specific logic to application managers (AMs). This means that RM is completely unaware of the runtime QoS requirements of the interactive and latency-sensitive applications. Other workload co-location solutions either diminish the performance interference through resource partition and isolation [17], [18], [19] or minimize the performance interference when co-locating different workloads [20], [21], [22]. Nevertheless, they are exclusively devised for monolithic applications and cannot be directly applied to tackle the sophisticated component dependencies and latency variations when substantial and dynamic requests manifest in the constituent microservices of the DLRA.
In this paper we present TOPOSCH, a QoS-centric resource management and runtime execution framework that can prioritize the QoS of DLRAs whilst balancing the performance of batch jobs and maintaining high cluster utilization. TOPOSCH encompasses two coherent stages to tackle the QoS violation: (i) In QoS violation containment phase, we first exploit the instrumentation to trace footprints of each request across different microservices to localize the QoS violation. We take into account timing information -including sojourn time on individual microservice and transmission time between microservices -to establish a latency graph, and periodically perform the critical path analysis to ascertain the chain of invocations with the longest end-to-end latency. The microservices on the critical path are recognized as the victim microservices with higher risks of QoS violation. Based on microservice-level and node-level risk assessment, a riskaware mechanism is proposed for adjusting the resource reservation for DLRAs and the visibility to batch tasks. We can therefore intervene the scheduling of batch tasks by preventing packing excessive batch tasks onto saturated nodes without exacerbating the QoS violation of DLRAs. (ii) In QoS violation mitigation phase, we perform prediction-based vertical auto-scaling by learning the QoS sensitivity of long-running containers -particularly those risky microservices such as core databases or data streaming components in the DLRAsto multi-resources and devising low-cost task preemption and resource reclamation. We infer the proper resource to be vertically scaled based on the QoS-resource model to reach the targeted QoS of the victim microservices. Multi-dimensional resource isolation (CPU cores, caches, main memory, memory bandwidth, etc.) is enforced to precisely control the resource binding and runtime usage. As opposed to the mandatory kill-based preemption that lead to substantial termination of running tasks, we propose a new task preemption mechanism for gradual resource reclamation from low-priority opportunistic batch tasks and leverage multiple pluggable preemption strategies to determine the tasks to be preempted.
TOPOSCH is integrated with the Resource Manager and Node Manager of Hadoop YARN. Experiments show TOPOSCH outperforms other baselines in QoS assurance. The tail latency of DLRAs when co-locating with Spark-based batch jobs is merely 1.12x of the case of executing alone on average and batch jobs experience 26% JCT increase on average when compared with the case of running in native YARN. If the QoSdriven auto-scaling mechanism is disabled, the tail latency of the variant TOPOSCH-n is 1.27x -with less QoS assurance -but the JCT is only increased by 17%. This indicates a performance balance when the proposed auto-scaling design comes into effect. Additionally, the proposed gradual preemption schemes can reduce the JCT by 26.3% and 15.1% as opposed to the kill-based scheme and the least-preempted scheme.
This paper makes the following contributions: proposing a mechanism for QoS violation assessment based on critical path analysis which is conducted upon the breakdown of end-to-end request latency among constituent microservices of DLRAs. devising an adaptive co-scheduling approach that delays the scheduling of batch tasks according to the runtime risk of QoS violation. developing a new mechanism for mitigating QoS violation through prediction-based resource inference and cost-effective auto-scaling of key microservices. We expand upon our previous work [23] that only focused on the basic scheduling and QoS protection strategy in the containment phase, by (1) scheduling framework redesign to underpin co-scheduling (centralized and decentralized) of DLRAs, batch tasks and opportunistic tasks; (2) significantly augmented scheduling framework to support prediction-based and on-demand mitigation phase, with the particular aid of an enhanced node agent for precise QoS prediction and runtime multi-resource inference and management, and a new resource autoscaler in the DLRA Master for cost-effective QoS recovery and resource reclamation; (3) more comprehensive experimental study with an additional set of workload co-location compositions and with different preemption strategies and the state-of-the-art approaches for comparison.
Organization. Background and solution overview are presented in Section 2 and Section 3. We show the technical details in Section 4 and Section 5 before the evaluation in Section 6. We discuss related work in Section 7 and conclude the paper in Section 8.

Resource Management for Shared Clusters
Cluster scheduling systems typically separate the resource management layer from the job-level logical execution plans. YARN[24] and Fuxi [3] share the following components: Resource Manager (RM) is the centralized resource manager, tracking resource usage, node aliveness, enforcing resource quotas among tenants through either capacity or fairness control. Application Master (AM) is an applicationlevel scheduler which coordinates the logical plan of a single job by requesting resources from the RM, generating a plan from received resources, and coordinating task execution. Node Manager (NM) is a daemon process within each cluster node and responsible for managing task life-cycle and monitoring node information. Traditional workloads in clusters include the data batch analytic jobs (abbr. batch jobs) [1], [2], [3], [4] -with short-lived tasks typically in the order of seconds -and the long-running applications (LRAs): LRAs are instantiated by long-standing containers or executors to enable iterative computations in memory or unceasing request-response. Examples of LRAs include applications using streaming processing frameworks (Storm [5], Flink [6], Kafka streams [25]), latency-sensitive database applications (HBase [7] and MongoDB [8]), and data-intensive in-memory computing framework (Tensorflow [9]). Response latency and throughput are the key performance indicators and applications must meet strict QoS.
For the batch workloads, there are typically two classes: regular jobs/tasks and opportunistic jobs/tasks (aka. besteffort or speculative in other systems [10], [16], [26], [27]). Regular tasks are submitted and managed by the centralized resource scheduler, while the opportunistic tasks are managed in a decentralized manner and used for resource oversubscription and high resource utilization -they are submitted to fill in the slack left by LRAs and regular tasks.

Distributed Long-Running Applications (DLRAs)
In the nature of component decoupling and distributed execution, a DLRA typically comprises multiple microservices, which are deployed on multiple nodes subject to their resource requirements. Multiple transactions within a DLRA have strong dependencies across multiple microservices. However, temporal-spatio load variability manifests over time and across nodes [28], [29], [30], [31]. A user request (e.g., an application request, a database query, a file access operation) will transverse a collection of microservices before being responded. Therefore, end-to-end (E2E) response latency is broadly used to indicate the execution time of any operation to complete. Fig. 1 exemplifies a typical DLRA for online e-commerce store [13] which consists of nine business microservices (ranging from account related services to order management services) and seven data warehouse microservices. The arrow represents the calling dependency. After logging in the system, users can browse the inventory through catalogue or add items into the cart before finishing an order. Shipping service will also be connected with the order service so that one can check the shipping status of a given order. All information needs to be queried and fetched from underlying database services.

End-to-End (E2E) Latency in DLRAs
We use PiggyMetrics [32], a financial advisor app built upon microservice-based architecture, to showcase how an increase of end-to-end latency can break down into, and attribute to, the individual microservices.
Motivating Example. Fig. 2a shows a test case that covers microservices associated with account and statistics. The detailed calling chain is as follows: a user first launches a request to the system, and the request is then reversely routed to the Account-service (AcS) via the Gateway (GW). AS is largely dependent upon the authentication in Authentication-service (AuS) to complete the account verification. To obtain the relevant account information, it needs to access the local database service Account-mongodb (ADb). Once logged in, the user can then obtain the required statistics by initializing another requests to the Statistics-service (SS) and querying the back-end database Statistics-mongodb (SDb).
Latency Increase and Its Breakdown. Failing to handle spikes of users and requests is one of the common root causes to the latency increase. To emulate this scenario, we conduct a case study by ramping up the number of users. We track the holistic request processing chain and measure the 95 percentile latency increase ratio of each individual microservice. As depicted in Fig. 2b, different microservices exhibit different sensitivity to the growing number of users. Noticeably, two database microservices are the dominating factor to the E2E latency, while GW and SS are scalable to, and less prone to the changing system loads.
The result implicates tracking and analyzing the response latency is of great importance in QoS assurance for the longrunning services and applications. Unawareness of such application-level latency at runtime could lead to higher performance interference among co-located workloads. It is thus highly imperative to localize such key components and take necessary actions of restricting and mitigating the manifestation of performance degradation.

QoS-Aware Co-scheduling
We enforce two distinct QoS management stages onto the cluster scheduling in the face of QoS violation: Containment: A QoS violation of a single microservice may propagate and lead to cascading violations across the entire system. We therefore locally restrict such propagation once the QoS measures are observably  degraded within a compute node. We then delay the procedure of scheduling more batch tasks onto the node to maintain the current level of co-location and thus give way to the existing DLRAs. This intervention aims to contain the spectrum of influence and diminish the aggravation of the QoS violation. Mitigation: As opposed to the delay-execution policy used in the containment stage, it is also desirable to proactively and dynamically adjust the existing resource allocations (aka. vertical auto-scaling), most notably for latency-sensitive core DLRA components such as databases or data stream operators, in the event of transient but severe QoS degradation. The best-effort tasks that harvest idle resources need to be properly reclaimed. To fulfill the two-stage QoS management for the diverse workload co-location, we need to answer the following research questions: [Q1] How to localize the performance hotspots from DLRAs and identify the most vulnerable microservices? [Q2] How to isolate the victim microservices from the co-located batch tasks? [Q3] How to effectively auto-scale the containers of the risky microservices with a proper resource adjustment to ensure the required performance recovery? [Q4] How to minimize the cost of preempting the running batch tasks during the auto-scaling?

System Architecture
Overview. TOPOSCH is built based on the state-of-the-art open source resource management platform YARN [24] to coschedule both latency-sensitive containers of DLRAs and tasks of batch jobs. TOPOSCH encompasses both a centralized resource manager, for high-quality resource allocation with fairness and capacity guaranteed, and a decentralized scheduling with distributed resource oversubscription extended from [16], [27] to support high job throughput and high cluster utilization. Fig. 3 describes the overall architecture of TOPOSCH and it comprises three main components: the central resource scheduler Resource Manager (RM), the per-DLRA manager DLRA Master and the per-node agent TOPOSCH-PAG 1 , a coresident module with the native Node Manager (NM). Furthermore, batch jobs in TOPOSCH will be separately managed according to their priority -The regular batch job will be managed by native per-job Job Master (JM). The JMs and DLRAMs are responsible for negotiating resources with the centralized RM, i.e., submitting resource requests and coordinating the resource allocation after obtaining the resource response from RM. By contrast, the opportunistic jobs will be directly submitted for improving utilization and the pertaining opportunistic tasks will be executed onto the nodes without a need of resource grant from RM.
DLRA Master (DLRAM). To align with the design of AM in YARN, we devise a specific programming framework to launch a DLRA consisting of microservices, request resources from the central RM, and provide standard functionalities of performance tracing and inter-component communication (e.g., RPC). The working mechanism is similar to the AM of DAG jobs; users can outline the topological relationships among microservices and specify the resource amount in the configuration file. At the core of DLRAM are QoS Analyzer and Resource Autoscaler: QoS Analyzer is the key component to track the request footprints generated within a certain time frame and build a weighted DAG that depicts the calling relationship. To tackle [Q1], TOPOSCH exploits instrumentation to trace the footprints of all requests through each microservice. We can then monitor, extract and calculate key measures -the average sojourn (processing) time on individual microservice and average transmission time. TOPOSCH periodically constructs a request calling graph based on the microservice dependencies and localize the microservices based on critical path analysis (Section 4.1). Those components are regarded as QoS victims and have higher risks of further slowdown and failures. The risk information will be passed on to RM to perform preventive delayscheduling of batch tasks (Section 4.2). Resource Autoscaler is the controller to infer and vertically adjust the resource allocation to each microservice container on demand to keep up with the varying QoS. In response to [Q3], the aim is to work out a proper (just-enough) slice of resources, to dynamically rescuing the degraded performance whilst minimizing the impact on the neighboring jobs. To conduct the resource inference, we need a predictor to understand the sensitivity of the DLRA container to the multiple resources, i.e., the relationship between the resource allocation and the resultant QoS. TOP-OSCH pre-trains an initial predictor in an offline manner, and the parameters will be synchronized to the autoscaler periodically when the resource usage on-the-fly is leveraged to tweak and update the model. We take as inputs the current resource allocation, system loads and target performance, and yield a new resource plan that can deliver a specific performance recovery. The resource change will be used for notifying the corresponding node agent and determining the detailed plans of task preemption (Section 5.2). Resource Manager (RM). To raise the awareness of DLRAlevel latency, RM differentiates the available nodes by the level of co-resident victim microservices. To cope with [Q2], once node's risk of performance degradation is perceptible, TOPOSCH recalculates and throttles the resource amount visible to YARN capacity scheduler -according to the current risk assessment on per-node basis -so that only a fraction of real available resources can be assigned to batch tasks, thereby delaying their execution (Section 4.2).
Node Agent (TOPOSCH -PAG). We inherit the main functionalities of default NM for container management and status update. We devise a multi-resource manager to control the access of a variety of resources such as CPU, memory, LLC and MBW among different containers. We employ Docker containers to fulfill an isolated execution environment for the tasks. Upon receiving the request for launching a new task or DLRA container, the container executor will then launch a docker container (Section 5.1). In addition, as opportunistic tasks are submitted and executed in a distributed manner, such queueable tasks are allocated by the YARN distributed schedulers [33], without going through the central RM, and managed by the local queue manager of each node. In response to [Q4], Preemption Manager is devised to determine which opportunistic tasks to be preempted and perform a graceful resource reclamation, upon receiving the updated information of allocation changes from the Resource Autoscaler (Section 5.3).

QOS-AWARE WORKLOAD CO-SCHEDULING
This section presents how we co-schedule the microservices of DLRAs and batch tasks by pinpointing the vulnerable microservices (Section 4.1) and scheduling intervention of low-priority batch tasks based on risk assessment (Section 4.2).

Pinpointing Vulnerable Microservices
Request Instrumentation and End-to-End Latency Tracing. To obtain as many footprints as possible, we aim to record per-request and per-microservice latency at RPC granularity. We instrument the incoming requests and output responses by tracking information including endpoints destination, inbound/outbound timestamp and request status. We use a set of identifiers to depict the information of each RPC call including url, requestID, serviceID, even-tType, nextServiceID, timestamp, statusCode (see Table 1). We can infer the elapsed latency of a specific request within a microservice. Those traces will be aggregated into a centralized database,e.g.redis (https://redis.io). TOPOSCH integrates the database with DLRA's AM to ensure effective data access whilst reducing the memory consumption of RM.
The aggregated requests/responses over a period of time constitute the latency trace graph (LTG). Formally, LTG ¼ ðV; E; fÞ comprises a set of microservice vertices V and a set of edges E denoting the interconnection links between microservices, i.e., f : E ! ðs i ; s j Þjðs i ; s j Þ 2 V 2^s i 6 ¼ s j where an incidence function maps each edge to an ordered pair of distinct microservices. There are a number of hierarchical execution entities in the system. A microservice provides multiple access points via RPC or RESTful APIs. TOPOSCH estimates the average sojourn time per request on microservices and transmission time between microservices.
Critical Path Analysis on the LTG. To be precise, the Mean Sojourn Time (MST) is the amount of time that a user request spends on average in each microservice; the length of MST is equal to the mean waiting time plus the mean service time. As a microservice may provide its clients multiple APIs, hundreds of thousands of requests are performed and aggregated through the API gateway before routing to specific microservices.
Through the latency instrumentation and tracing, we can easily obtain the entry and exit timestamps of a given request into a microservice. t and t represent the inbound and outbound timestamp. For a given request, the sojourn latency of a request i within microservice s k and the transmission latency of a request j between microservice s k and s l can be measured by using two adjacent timestamps: At the core of generating LTG is to set the weight for vertices and edges. We assign the edge weight as the mean transmission latency TT k;l among all requests: where G k;l is the set of requests between microservice s k and s l , and the size is denoted by jG k;l j. Notably, we do not differentiate the latency among different endpoints here based on the assumption of uniform RPC communication between two microservices 2 . Similarly, we assign the weight of a single vertex as the mean sojourn latency of all requests passing through the microservice s k .
where G k is the set of requests to the microservice s k . We then divide the vertices into two distinct categories: functional vertices and auxiliary vertices to embed the mean sojourn latency ST k and mean transmission latency TT k;l , respectively. To facilitate the graph algorithms, we retain 2. It is a common practice to only adopt one type of standard RPC library such as gRPC, Apache Dubbo, Apache Thrift, etc. rather than using multiple RPC libraries. This means all requests within a DLRA will use the same underlying RPC library, and thus the latency graph can simply depend upon the mean transmission time without involving the variation due to RPC frameworks by different DLRAs and even their cross-language performance. main attributes including the service_id, relevant microservices upstream_id/downstream_id, and the timing information. We exploit Bellman-Ford [34] to find the longest path of LTG as the critial path. For clarity, notations used in this paper are summarized in Table 2.

Risk Assessment of QoS Violation
Microservice-Level Risk Assessment. The goal of microservices risk assessment is to quantitatively estimate the victim microservices on the critical path. We mainly take into account the following factors: Request sojourn time. Longer request latency indicates the pertaining microservice is prone to QoS violation, as the increased latency from the microservice would be amplified and cascaded to the whole critical path. API call frequency. Any QoS violation in the microservice with higher API call frequency will involve more requests and intrinsically influence a wider range of users. Request failure rate. Higher failure rate indicates a reduced reliability of request handling of the microservice. Without further resource adjustment, those microservices have higher risks of QoS violation. To combine the first two factors, we consider both inter-API and intra-API request sojourn time. We calculated the weighted average sojourn time among different APIs because of the unbalanced number of requests coming into different APIs: where v u is the proportion, taken up by uth API url of the microservice k, of the total requests and f ST ðuÞ k denotes the intra-API averaging measure of uth API url. Particularly, we use the geometric mean of all requests pertaining to the url to mitigate the impact of outliers and smooth the average calculation. We then calculate the weighted sojourn proportion (WSP) to indicate the proportion and importance of the targeted microservice in the whole critical path: where S is the microservice collection on the critical path. We involve the request failure rate into the risk assessment, through the weighted request failure proportion (WFP): where the ratio of error requests are calculated. We integrate them into the risk assessment by setting a configurable weight a, which indicates a balance between sojourn latency and failure rate.
Node-Level Risk Assessment. TOPOSCH infers the risk level of QoS violation on a per-node basis, and thus we need to aggregate the risk score of each microservice i.e., where M n is the microservice set running on the node n and then forming the node-level risk R n by normalizing the overall risk level (e.g., using min-max normalization) among all running nodes. The node risk measures over a fixed time frame are maintained within RM. RM then transforms the obtained risk information into a dynamic resource adjustment, in terms of both available resources for batch tasks and reserved resources exclusively for DLRAs.

Resource Reservation and Scheduling Intervention
Risk-Aware Slack Resource Reservation for DLRAs. TOPOSCH aims to achieve a dynamic and healthy co-existence of DLRAs and batch jobs with balanced performance among different forces -trading the performance of batch jobs to some extent for prioritizing the runtime latency of interactive DLRAs. Intuitively, a node with higher risk level need to reserve more slack resource for DLRAs from its available resource pool. In this context, this piece of slack resource is only visible to DLRAs and cannot be used for batch tasks for a period of time. Namely, the visible resource to batch tasks adapts to the on-the-fly risk level, according to the estimation based on Eq. 5 to Eq. 7. This intervention mechanism can avoid unnecessary batch task placement onto the node, thereby reducing the performance interference in-between. In practice, we use a simple yet effective linear model with a reservation coefficient n to determine the resource reservation for DLRAs on a specific node. n represents the relationship between the risk level and the resource reservation. A higher value indicates that the amount of resource reservation is more sensitive to the change of risk level, and vice versa. The extreme case of zero n means no dedicated slack resource for the DLRAs, i.e., completely switch-off of TOPOSCH with default YARN scheduler enabled. Correspondingly, the ratio of visible resources to batch tasks can  the set of requests between s k and s l G k the set of requests sent to s k E k the set of error requests sent to s k M n the set of microservices running on node n r k ; R n the risk of a microservice s k and a node n the mini step size of resource reclamation be calculated by 1 À nR n . To make the value always valid, n is set to be ensure nR n between 0 and 1. Batch Scheduling Intervention. Algorithm 4.2.2 describes the procedure of resource allocation for task scheduling. YARN uses Container 3 as the basic unit of resource allocation in the scheduler of Resource Manager and then as resource lease to run a task. A Container will be reclaimed when a task is completed or killed. Unsatisfied Containers that represent the resource requests of the pending tasks will be queued in the scheduler's queue.
We select the Container from the waiting queue in a descending order by the waiting time and filter out a node list N where each node has sufficient capacity to meet the task's requirement (Lines 1-4). The scheduler will go through all potential nodes and calculate each node's visible available resource R vis n against the real available resource R real n according to the risk-aware reservation for DLRAs (Lines 10-12) if the default QoS violation policy is enabled (zeroviolation policy will be discussed below).
Only if the visible resource is big enough to underpin the requested amount, the current Container can be assigned to the node by reusing Assign(), the default scheduling procedure of the native YARN (Lines 14-16). Otherwise, we will hold up the Container from scheduling for a given number of times (e.g., setting maxRetryTime as 1 indicates the delay only occurs once). This design is out of consideration of performance trade-off -we can prioritize the QoS protection without too much delay of batch task executions. Once a task petitions for resources more than maxRetryTime, TOPOSCH attempts to allocate resources to its Container as soon as possible. In this case, the Container with a data locality requirement will be directly placed, despite the fact of temporarily aggravating the QoS violation (Lines 18-21). For the Container without a locality requirement, TOPOSCH can relax the scope of node selection -the scheduler will choose the node with the lowest risk level to reduce the impact of co-location on the increased latency (Lines 22-26).
Parameter Setting. Finding a suitable system parameter configuration is a non-trivial task. One common practice based on our large-scale engineering experience is to initially set conservative n for validation in a small-scale test system that has the same hardware configurations before deploying into larger-scale production. This procedure can significantly help to understand system behaviors in a controlled manner. We can set a starting point, such as 1.0, and gradually relax the parameter to allow for more co-located batch tasks by a step of 0.1 while observing the latency variations (e.g., slowdowns or failures) through daily regression tests. This procedure can help us gradually revise the configuration with a small step until all regression tests deliver stable outputs and achieve acceptable performance level of both latency-sensitive applications and batch jobs. Recent advancement in reinforcement learning can facilitate the parameter auto-tuning which is beyond the scope of this paper and will be left for future work.
Note that we also allow application-specific decision making to achieve a customized performance trade-off. Stricter violation policy, e.g., zero violation, could be applied to disallow any batch execution further and avoid worsening the QoS of the existing components of DLRAs. This could be easily implemented by setting up a global binary flag variable in the configuration file and allowing cluster administrators to specify the specific targeted scenario. If zero violation is enabled (Algorithm 4.2.2 Lines 6-8), all the available resources on a node will be entirely invisible to batch tasks until all the targeted DLRA's QoS recovered.

QOS-AWARE AUTO-SCALING
This section addresses how to manage multi-dimensional resources and isolate resources for a given task (Section 5.1), how many resources to revoke for auto-scaling (Section 5.2), and which batch tasks to be preempted in a cost-effective manner (Section 5.3).

Multi-dimensional Resource Control
TOPOSCH-PAG mainly uses Linux control groups (cgroups) and Intel RDT technology to achieve fine-grained softwareprogrammable control over the amount of resource allocation for different tasks.
We use cgroup cpuset subsystem to fulfill the CPU isolation: we set the cpuset.cpus to indicate the CPU affinity for different process group and allocate logical cores of the same CPU slot, as much as feasible, to a given microservice or batch task container. This can avoid frequent switches between CPU cores and cache contention in hyper- 3. In YARN's resource model, resource scheduler responds to a resource request by granting a Container. Container is the logical bundle of resources that grants rights to a Job Master to use a specific amount of resource (e.g., 1 Core CPU, 2GB RAM, etc.) on a specific node. threading. We exploit cgroup memory subsystem for limiting the amount of available memory to the LRA by setting memory.limit_in_bytes. Fig. 4 outlines how TOPOSCH agent manages CPU and memory with the group hierarchy.
We adopt Intel RDT to monitor and control the access to LLC ways and MBW to avoid resource starvation and consequent performance degradation. We leverage Cache Allocation Technology (CAT) to group different DLRAs and batch jobs into different classes of service (CLOS) -seen as resource control tags -and then assign different capacity bitmasks (CBM) to show the amount of LLC available to each CLOS. Similarly, we use Memory Bandwidth Allocation (MBA) to specify the portion of MBW that each CLOS can access. TOPOSCH-PAG will predict the required cache ways according to the result of runtime resource re-allocation. For example, assuming that there are currently two resource control tags, CLOS1 and CLOS2. If the required cache ways are estimated to be 4 and 8, respectively, TOP-OSCH will set 0x000f for CLOS1 and 0x0ff0 for CLOS2.

QoS Prediction Engine
We investigate the relationship between multi-dimensional resources and the QoS through systematic profiling and prediction model. We can then use the model to infer how much the QoS could be mitigated by a given plan of resource re-allocation. We leverage Million Instructions Per Second (MIPS) as the QoS indicator to guarantee the measurement accuracy. Compared with Instructions Per Cycle (IPC) or Cycles Per Instruction (CPI) [27], [35], MIPS is less dependent upon the measure of CPU frequency and the number of clock cycles in the event of frequency conversion or over-clocking techniques, and thus more accurate when an application experiences an interrupt IO.
Formally, the prediction engine take as input the normalized multi-dimensional vector R of existing resource allocation ðR CPU ; R mem ; R LLC ; R MBW ; iÞ for the profiled microservice i, to estimate the targeted QoS Q (the MIPS value). let F be the regression function trained and fitted on the resource and resultant QoS. As F is microservice-specific, each key component of DLRA will be profiled by the DLRA Master. We pre-train the prediction model in an offline training stage, similarly to existing approaches [18], [36], [37], based on a set of workload benchmarking and profiling, but will update the model parameters periodically according to the on-the-fly resource usage.
More specifically, we enumerate all possible amount of the multiple resource vectors by going through the available range of each individual resource and using a given step-size. For example, the memory allocation starts from 256MB to 4G while we increase the LLC cache ways by one way for each step. To exemplify the procedure, we showcase how prediction models are trained for a MongoDB microservice. Diverse regressors are applied into the model training including Linear Regression, k-Nearest Neighbor (KNN), Adaboost, ElasticNet and Gradient Boost Regression Tree (GBRT), etc. Model accuracy is determined through the Root Mean Square Error (RMSE) -an established measure of regression accuracy when the under-prediction error is enlarged. We also evaluate metrics such as Mean Absolute Error (MAE), and R 2 (coefficient of determination) to indicate the measurement effectiveness. Table 3 shows that GBRT has the smallest RMSE and highest R 2 , indicating its minimal prediction error. We also observe a stable prediction effectiveness in GBRT with merely 1.2 RMSE deviation. This is not surprising simply due to the ensemble nature of combining several base models to produce one optimal predictive model. Fig. 5 shows an example of the resources-QoS model.
The learnt model will be periodically synchronized to the corresponding DLRAM to conduct the resource re-allocation plan that can help the victim component back to the targeted QoS. Assume R is the current resource allocation vector and r is the reallocation to be enforced. Our goal is to ascertain r such that the subsequent QoS could reach the targeted QoS as much as possible, i.e., F ðR þ rÞ ! ð1 À "ÞQ tgt where " is a small number, e.g., 0.01 or 0.05. Practically, r can be determined by starting from setting up the CPU steps followed by fine-tuning the memory allocation. This stems from the fact that reclaiming CPU is a much easier and dominant step -it could effectively throttle disk reads and thus speed up the memory reclamation [38]. Subsequently, a vector of memory, LLC and MBW can be then finalized to achieve the approximated QoS.

Low-cost Task Preemption
Key Idea. The eviction of running tasks is particularly expensive. Many existing solutions such as the default YARN capacity or fair scheduler forcibly kill the preempted containers without saving the task context, which would incur substantial repeated task failover and re-submission. This inevitably  results in non-negligible system cost and delays the job completion. We aim to minimize the cost of task preemption by progressively reclaiming resources of opportunistic tasks, keeping task containers alive instead of interrupting them directly, without introducing noticeable performance degradation. We uniformly preempt resources from multiple tasks, a simple yet effective means to amortize the reclamation among tasks and affect each task as gently as possible. It can avoid excessive resource withdraw from one single task which may lead to dramatic execution slowdown or failures. While elaborating the characteristics of batch tasks and formalizing the preemption as an optimization problem may help to find the optimal solution to task preemption, it comes with a prohibitive implementation cost of instrumentation and profiling, and is not generally applicable (i.e., job-dependent and the huge number of tasks). At the core of the resource reclamation is to re-throttle the resource upper limit. Reclaiming CPU can be achieved simply by revoking CPU time slices and pinning them to other tasks. We adopt pageable memory mechanisms for assigning memory to applications. We use memory.lim-it_in_bytes to reduce the upper memory limit and then memory.memsw.limit_int_bytes to move the memory parts beyond the limit into the swap space on disks, without terminating the tasks.

Algorithm 2. Low-Cost Task Preemption
Input: m: The targeted microservice T : Opportunistic tasks queued on the node, Q tgt ðmÞ : Targeted QoS (MIPS) of m, c: A pre-defined amount of resource to preempt from each task w: A mini step of resource reclaim for each task, 1 while Q tgt ðmÞ is unsatisfied do 2 //get the resources to be preempted, from the autoscaler 3 r InferPreemptedResource() // determine the number of preemption 4 K dr=ce // pick up K tasks to be preempted 5 T GetKPreemptedTasks(T ; B) for t in T do in parallel 6 //initialize the preemption plan for each task 7 s c // reclaim resource in mini-steps 8 while s > 0 do 9 / /incrementally reclaim resource 10 s s À // reclaim the basic stepsize from the preempted task 11 r t r t À // task preemption with reduced runtime resource 12 Preemptðt; Þ // check the task aliveness Typically, memory management can be achieved in either static (page-locked/pinned memory allocation) or dynamic (pageable/unpinned memory allocation) policies, which have their inherent advantages and limitations. While page-locked memory can achieve higher efficiency of memory r/w operations without the need of communicating with the hard drive, developers must be responsible for memory allocation and free, which brings additional management overheads and potential performance uncertainties due to misuse. On the other hand, pageable memory is more widely-adapted in modern operating systems to virtually enlarge the memory capacity. It swaps the pageable segmentations between memory and hard drive based on page replacement algorithms; it may, however, lead to performance jitter due to the variation of swap availability. We adopted swapping-based dynamic allocation, but leave the option of pinned memory to the developers, who can decide whether to transfer and store the data from the pageable segmentation to the pinned memory based on the application-specific requirement, e.g., r/w frequency.
QoS-Driven Gradual Resource Reclamation. Algorithm 5.3 depicts the procedure of low-cost task preemption. Upon receiving the auto-scaling request -together with the resource preemption update (r) -from the Autoscaler of the corresponding DLRAM, Preemption Manager will launch the iteration of task preemption by choosing K opportunistic tasks from the node's queue according to a given preemption strategy and then reclaim resources from multiple task containers evenly and simultaneously (Lines 1-5). We introduce several pluggable algorithms to implement GetKPreemptedTasksðÞ (detailed below). For each individual task, we revoke the pre-defined amount of resource c by multiple mini-steps to reduce the noticeable performance degradation to the preempted task. Specifically, each step of the preemption will be performed by merely depriving a certain amount at once in PreemptðÞ (Lines 8-11). The value of is tuneable and should be set moderately -a big step can ensure rapid performance recovery for the DLRA but would lead to unexpected slowdown, or even failure of the opportunistic tasks. In contrast, a smaller value would delay the performance rescue and thus not ideal for real-world settings.
To minimize the risk of task failover, we introduce an aliveness checking process AlivenessCheckðÞ to ensure the affected task can keep alive as much as possible. Once the task is detected to lose its heartbeat or hanged due to memory shortage, we will instantly cease the resource claim and add it in the blacklist to avoid any further task preemption (Lines 13-15). The Autoscaler in DLRAM will measure and check if the MIPS dropdown is mitigated, i.e., the targeted QoS is satisfied. If not, another round of preemption will be launched -Preemption Manager will petition for inferring the amount of resource to be reclaimed from the Autoscaler, and then the aforementioned procedure repeats.
Pluggable Preemption Strategies. The following pluggable preemption strategies are configured in TOPOSCH-PAG: Random Based Scheme (RB): Opportunistic tasks are randomly selected for preemption. Longest Tasks First (LTF): Opportunistic tasks with the longest execution time are most likely to be preempted. This policy is based on the assumption the longest task is likely to be a straggler [39], [40] compared with its peer tasks. Reclaiming resources from a task that is already slow may not incur substantial slowdown further and even accelerate the straggler handling. Newest Tasks First (NTF): The latest tasks are most likely to be preempted. The intuition is reclaiming partial resources could have limited impact on the execution progress at an early execution stage. Non-locality Tasks First (NLTF): The tasks without required data locally are most likely to be preempted. This policy assumes that such tasks may resume and execute faster in other nodes with data to be processed. To analyze the impact of preemptive scheduling on the execution efficiency of co-located jobs, we also introduce a preemption scheme which works against the even distribution of the reclaimed resources among tasks: Least Preempted Scheme (LP): The policy will select the minimal number of tasks that can satisfy the requirement of resource reclaims. This policy is equivalent to Most Resources First (MRF) [38] where tasks with the most allocatable resources will be preempted. The intuition behind this scheme is to reclaim resource as fast as possible and reduce the scope of the affected tasks.

Experiment Setup
Hardware and Software. TOPOSCH was deployed onto a 12machine cluster with each machine containing two 16-core (32 logical cores) Intel-Xeon(R)-Silver 4110CPU@2.10GHz, 187GB RAM, 11MB LLC and 10 Gb Ethernet network. Each node was installed with Debian 4.9.82. We have implemented TOPOSCH in 5k+ lines of Java and fully integrated with YARN 3.0-Beta1. The prediction engine is written in Python and operates as a separate container. To submit a DLRA, the topology of microservices was specified in a configuration file DAG_SERVICE.xml, and all requests are tracked and recorded in redis key-value database. Each DRLAM periodically calculates microservices' risk level at a time interval such as 60s or 120s.
Workloads. We emulate a mixture of realistic workloads in cloud datacenters.
DLRAs. We adopt PiggyMetrics [32], a microservice architecture based financial management application, as the representative DLRA in our experiment. It consists of 12 components and each of them is encapsulated in a docker image. We embed the instrumentation and tracing mechanisms detailed in Section 4.1 into each component. We use JMeter [41] to generate workloads to PiggyMetrics and emulate the user behaviors via TPC-W [42]. There are two latencycritical components each PiggyMetrics instance: MongoDB serves as the primary database for each microservice while Kafka is used to support publishsubscribe model (pub-sub) and the messaging system among different microservices. Batch jobs. We employ Hibench [43] to generate batch jobs using Spark-2.4.6. They include 8 ML workloads: logistic regression (lr), random forest (rf), Bayesian classification (bayes), singular value decomposition (svd), principal component analysis (pca), gradient boosted trees (gbt), alternating least squares (als), and kmeans. The default configuration for each job is: spark.dirver.memory = 512M, spark.executor.memory = 6G, yarn.executor.cores = 4, map.parallelism = 12, shuffle. parallelism = 8, hibench.yarn.executor.num = 60, based on the profiling of internal traces and daily practice used in Alibaba's testing clusters. Metrics. We measure the following metrics: Comparative Approaches and Methodology. Generally, to validate the effect of QoS assurance, we generate and compare two variants of TOPOSCH as an ablation study, by switching on/off the procedure of performance prediction and autoscaling, and compare against the following two baselines: YARN: The native capacity scheduler of Apache YARN used for default co-location [24]. Run-Alone: The run-alone case where Piggy Metrics or batch jobs are independently executed in an isolated environment without the related interference. TOPOSCH-p: TOPOSCH with auto-scaling enabled with performance-driven task preemption. Opportunistic tasks are throttled to prioritize the latency-sensitive components, driven by performance modeling and prediction engine. TOPOSCH-n: TOPOSCH with auto-scaling disabled without performance modeling and opportunistic preemption. We also compare our approach with other baselines, the state-of-the-art performance-aware scheduling strategies for co-locating LRAs with batch jobs in shared clusters. For a fair comparison, we adapt their algorithms to the YARN setting and conduct their scheduling and QoS control schemes at the scheduler level: Quasar: A scheduling approach that uses collaborative filtering to predict the performance of monolithic workloads. We implemented it to guide the placement of batch tasks and microservices [44]. ROSE: A performance-aware scheduling approach that harvests idle resource by opportunistic tasks and guarantees the QoS of long-running applications by tracking the application-specific performance counters such as CPI and MPKI [27]. Kube-auto: Autoscaling [45] is an industry standard for elastically scaling allocations to acquire resources on demand. We implement a utilization-based autoscaling policy adopted by Kubernetes, one of the most appealing container management systems. It triggers pod auto-scaling based on CPU or memory utilization.
We mainly evaluate TOPOSCH in terms of the overall effectiveness of workload co-scheduling, effectiveness of autoscaling, and the individual contribution of each system component. Specifically, the experiments are three-fold: We evaluate the performance balance of both DLRAs and batch jobs. We compare two variants of TOPOSCH with the baseline approaches and the ground truth when running the DLRAs alone. (Section 6.2). We examine the effectiveness of auto-scaling with different preemptive strategies. For comparison, we evaluate our proposed schemes against the killingbased mechanism adopted by native YARN (Kill) and the Least Preempted scheme (LP) (Section 6.3). We perform several micro-benchmarks to demonstrate the performance gains and system overhead. We first evaluate the impact of multi-dimensional resource control and isolation, particularly on the key microservices. We mainly compare the proposed method against the default isolation mechanism in native YARN Node Manager (YARN) and the isolation mechanism provided by cpu subsystem without LLC and MBW control and isolation (CPU-SBS), typically adopted by cluster management systems [3], [16], [26], [27], [36]. We then analyze the parameter sensitivity, time consumption of conducting critical path analysis, and the overall system overhead. (Section 6.4) Result Report. To minimize the noise, we repeat each experiment 10 times independently and compute the average running time or performance.

Overall Scheduling Effectiveness
To emulate realistic production-level workloads, we submit 100 Spark ML jobs in several rounds. 30 of them are opportunistic jobs, consisting of approximately 400 opportunistic tasks, to improve the cluster utilization. 3 PiggyMetrics application instances are initially launched. To further investigate the impact of different workloads on the effectiveness, we increase the submitted number of PiggyMetrics instances with varying resource requirements, concurrent users, and request distributions. Specifically, we measure the tail latency of three types of requests including Log on, View, and Update operations to the Account service as the performance indicator of DLRAs.
Performance of DLRAs. Fig. 6 shows the tail latency increase ratio against Run-Alone when the DLRAs are coscheduled with different Spark jobs. Overall, TOPOSCH-p outperforms all baselines in all cases and the native YARN has the worst effectiveness in assuring QoS. For example, over all co-location scenarios, the tail latency of TOPOSCH-p is merely 1.12x on average (1.05x$1.19x) compared with the case of Run-Alone, and can be significantly reduced by 47% on average when compared with the native YARN. This observation derives from the synergetic effect of both the batch intervention mechanism and the elastic auto-scaling mechanism for prioritizing the QoS of latency-critical workloads over other Spark jobs. When the auto-scaling mechanism is disabled, the tail latency of TOPOSCH-n increase to 1.27x of Run-Alone on average (1.1x$1.36x) due to the single source of QoS protection by the scheduling intervention.
Regarding other baselines, Quasar ranks the second lowest in guaranteeing QoS, in the midst of TOPOSCH-p and TOP-OSCH-n, due to its elaborate mechanism in profiling and performance modeling of co-located workload performance. However, it is designated for monolithic applications and thus lacks fine-grained end-to-end track of distributed components and timely adjustment of resource allocation and task scheduling at runtime. We will also demonstrate its inferior effectiveness of batch JCTs and inflexibility of handling task re-scheduling. Compared with TOPOSCH and Quasar, Kube-auto has higher tail latency due to the low accuracy of using straight-forward threshold-based control scheme to trigger auto-scaling. ROSE relies on CPI and MPKI, highlevel and fluctuated performance counters, to throttle batch tasks for monolithic long-running applications without autoscaling mechanism. This drawback limits the accuracy of QoS assurance, leading to less competitive results than other auto-scaling based approaches. Fig. 7 depicts the corresponding cumulative distributed function (CDF) of the absolute values of tail latency in three types of requests, separately, when co-scheduling with all these Spark jobs. Aligned with the observations in Fig. 6, the curve of TOPOSCH-p is the closest to Run-Alone, followed by the Quasar and TOPOSCH-n.
Performance of Batch Jobs. Fig. 8 illustrates the normalized JCT of the Spark jobs when co-located with DLRA against the jobs are executed alone. Overall, YARN and ROSE have the shortest JCT unsurprisingly, due to their native focus on batch job scheduling. Nevertheless, their capability of QoS assurance is insufficient and thus are not ideal for co-location of DLRAs and batch jobs. Compared with native YARN, the adoption of TOPOSCH-n and TOPOSCH-p result in an average increase of 17% and 26%, respectively. This phenomenon conforms to the expectation of compromising the performance of batch jobs for the QoS assurance of DLRAs. By contrast, Quasar and Kube-auto have longer average JCT because of the lack of low-cost resource reclamation when making room for DLRAs.
The result shows the trade-off achieved in our design; considering the characteristics of offline processing, such an execution delay could be acceptable. Note that one can flexibly tweak the performance balance between DLRA and batch jobs by fine-tuning the parameter setting of the resource visibility in the containment phase and re-setting up a moderate QoS model in the mitigation phase.
Impact of Different Size of Workloads. We increase the number of DLRA instances from 3 to 12 when co-scheduling with Spark lr jobs. For generalization, the DLRA instances are submitted with different resource requirements. Fig. 9 shows the increase of 95th percentile latency against the case of DLRA run-alone. All the three types of requests unexceptionally experience a upward trend. Our approach consistently outperforms other baselines when the number picks up. This indicates the performance gain of our approach does not vary much, not particularly sensitive to workload instances with different characteristics.

Autoscaling and Preemption Effectiveness
Effectiveness of Autoscaling. Our experimental study shows, as opposed to other microservices in the DLRA, the database microservices, e.g., statistics-mongo-service, usually exhibit more latency fluctuations, particularly in the event of load   spikes, i.e., a surging increase in the user access. To instantiate this, we emulate different numbers of concurrent users, varying from 50 to 500, and measure the performance of database and its co-resident batch neighbors. Fig. 10 presents the relationship between the growth of concurrent users, the observed OPS of the database component and the corresponding JCT of jobs on the same node. In TOPOSCH-n where the autoscaling mechanism is disabled, the OPS starts to slowdown when user concurrency becomes 200 and to drop gradually when the concurrency reaches 400; meanwhile, the JCT also climbs up promptly from the point of 200 concurrent users. By contrast, when autoscaling is enabled, TOPOSCH-p can ensure more resources reallocated to the key database microservice and retain a high service level. As a result, the OPS growth can be proportionally maintained to match the increasing demand of user access, without any performance degradation. Intrinsically, the JCT of co-resident Spark jobs will be enlarged compared with TOPOSCH-n, simply because more resources are deprived to prioritize the QoS of DLRAs.
Comparison of Difference Preemption Schemes. We investigate how different preemption schemes perform in a controlled execution environment. We create two representative co-location settings with distinct system load -roughly 80% (heavy load) and 40% (light load) utilization by placing different numbers of lr opportunistic jobs on the same node of the MongoDB. 100 users are created in the PiggyMetrics and concurrently access the internal microservices, particularly the MongoDB service. Fig. 11 and Table 4 shows an overall increase of JCT and more tasks are involved in the preemption in the heavy load environment compared with light load environment. This is because DLRAs experience fiercer resource contention and need to deprive more resource from batch tasks to recover the QoS target. We can also observe larger deviations of JCTs in the heavy load cases, simply because a growing task-level execution delay or rescheduling caused by resource reclamation will affect the job-level progress in a more stochastic manner. Among all comparative schemes, the gradual preemption based schemes (LTF, NTF, RB and NLTF) significantly outperform LP and Kill-based scheme. For instance, the JCT of NTF can be reduced by 15.1% and 26.3%, respectively, compared with LP and Kill-based scheme. This is because the gradual preemption mechanism reclaims resource from multiple tasks and the mini-steps of resource reclaim can reduce the perceived performance degradation compared with the LP. Although less task containers are preempted in LP than the uniform preemption among different tasks, the MRF policy in LP can cause mandatory failover -the low CPU occupation or memory allocation sometimes fails the heartbeat communication between the running containers and RM, which eventually leads to substantial container restart. Killed-based scheme directly evicts and restarts all relevant tasks, and therefore has the longest JCT.
While gradual preemption based schemes have similar JCTs, NTF consistently outperforms others in both light and heavy load scenarios. This is because the impact on each individual task in NTF will be limited although more tasks are involved in the preemption in the heavy load scenario. In fact, reclaiming a thin piece of resource, particularly the CPU, from an early-stage task will have negligible impact on the overall execution. Considering the CPU slack or over-claiming is the norm rather than the exception in cluster management, the residual resource is sufficient for underpinning the task initialization and enabling the execution progress. By contrast, LTF and NLTF identify the longest tasks or the tasks without local data. However, the resource reclaim slows down those tasks further and the system-level straggler mitigation and task rescheduling will be triggered, resulting in longer JCT than NTF.

Micro-benchmarking
Performance of Key Latency-Sensitive Microservices. In this experiment, we evaluate how the key microservices in the DLRA perform in the co-located environment when different loads are enforced onto the application. We specifically count the QoS measure of the key database MongoDB and the key messaging microservice Kafka. We use ycsb-mongo to stress the database. Both the record count and operation count are set to be 100 million, and records take up 82GB roughly. We generate 75 million message to Kafka and each message occupies 1KB. 4 lr opportunistic jobs with 80 opportunistic tasks are placed onto the same node that executes the containers of these microservices.
As shown in Fig. 12, the proposed TOPOSCH-p outperforms other approaches in ensuring the QoS of both MongoDB and Kafka microservices. For instance, the MongoDB's OPS of TOPOSCH-p is 1.78x and 1.49x that of native YARN and CPU-SBS only approach, respectively. This is primarily due to the synergetic continuum of adaptive delay scheduling of batch tasks, effective isolation over multiple resources and the QoS assurance in the auto-scaling mechanism. In effect, regular batch tasks will give ways to the latency-sensitive DLRA components by leaving enough room when a rising risk has been detected. Meanwhile, opportunistic tasks will be moderately  Performance Balance Between DLRAs and Batch Jobs. As discussed in Section 4.2, the reservation coefficient n is leveraged to tune the impact of node-level risk on the amount of reserved resource for microservices of DLRAs. We gradually increase its value and examine the resultant performance of DLRAs and Spark jobs. Fig. 13 shows an increasing trend in the JCT of all batch jobs when n ramps up. Obviously, for a given node risk, an increased n will reserve more resources for the DLRA, and thus trade more batch performance for reducing the latency of DLRAs. Specifically, the average JCT of n ¼ 1 is 1.53x higher than that of n ¼ 0 where no QoS assurance is given, i.e., the native YARN. Kmeans jobs and lr jobs experience a 23.4% and 18.2% increase, respectively. Tasks without data locality -such as the Pi tasks -can be delayed for a longer time. This is because of a higher likelihood of throttling or eviction to yield sufficient resources for the victim microservices. Tasks with data locality requirement such as tasks of Kmeans jobs and lr jobs, on the other hand, will be directly launched from the second retry for rapid task startup, even if the node is detected risky (depicted in Algorithm 4.2.2. Accordingly, this will lead to a slightly increased latency of the co-existing microservices. System Overhead. We analyze a per-AM overhead from DLRA Analyzer in terms of time complexity and memory consumption. (i) Time Consumption. As shown in Fig. 14, the time cost linearly increases but slows down when the trace number reaches 30,000. The maximal measured time is no more than 1.6 seconds. Considering the overall time consumption in the resource allocation, the incurred increase to the scheduling latency is less than 1% compared with the native YARN. (ii) Memory Cost. The additional memory used for fast data access using redis is roughly 126MB, less than 2% increase compared against native YARN. Given the intrinsic diversity in request number and arrival pattern, the number of traces for tracking latency in TOPOSCH over a given period can be customized in AM to balance the scheduling precision and the incurred overhead. It is worth noting that the overhead analysis is on a per-AM basis but can be naturally extended to cases of multiple DLRAs. For cases of multiple DLRAs, memory cost will be increased by multiple times due to redis is instantiated to support multi-tenancy; each AM of DLRA will independently store its own request tracing information. Each AM will be encapsulated in a Docker container, and thus the AM can separately run with stringent resource isolation and negligible interference.

RELATED WORK
Resource Managers in Shared Clusters. Cluster resource management frameworks, such as YARN [14], Mesos [15], Fuxi [3] Borg [10] are based on two-level centralized scheduling. They decouple the inter-job resource sharing and intra-job task scheduling, and the job managers need to negotiate with the centralized manager and then take charge of the job execution. Capacity Scheduling [46] or Fairness Scheduling [47] are proposed to fulfill an efficient quota-based resource sharing among multiple jobs. The objective is the enforcement of scheduling invariants for heterogeneous applications, with policing/security utilized to prevent excessive resource occupation. To further improve cluster utilization and system throughput, many other systems are based on fully decentralized design, such as Apollo [48], Omega [49], or hybrid system design, such as Mercury [26] and ROSE [16]. However, all these systems are devised towards scheduling batch analytic jobs. TOPOSCH is built upon YARN 3.0 and based on a hybrid scheduling designour key modules are integrated with the centralized resource management framework while the opportunistic tasks are managed in a decentralized manner. The proposed mechanisms are designed to be complementary to, and can be implemented upon, the existing protocols in any twolevel resource management systems.  Performance Tracing and Diagnostics. Many prior works are devoted into anomaly diagnosis and behavior analysis of large-scale distributed applications. They can be classified into two categories: (i) black-box approaches using external application states to infer and analyze the problems. [29], [50] rely on a tremendous number of log files to extract performance information and infer the dependency models. [51] trains models to predict and localize latent errors in microservices based on log information comprising a set of predefined features. [52] uses fault injection to measure the execution and data flows of distributed applications and to diagnose the bottlenecks. (ii) white-box approaches by monitoring causality within microservices instead of inferences through statistical analysis. [53], [54] infer the execution path of the application based on the static analysis and symbolic execution. [55], [56] provide developers with tracing frameworks to add trace-points within the application to collect runtime footprints. In comparison, TOPOSCH uses a white-box methodology to track and trace the requests over the whole DLRA and avoids over-dependencies upon prior diagnosis conditions, typically pre-defined in black-box approaches. Instead of using the existing fine-grained tracking instrumentation, TOPOSCH adopts a light-weight tracking method to trace DLRA-level latency data, thereby significantly reducing per-DLRA runtime overhead.
QoS-Aware Workload Co-Location. The ability to co-locate jobs (i.e., execute within the same CPU or GPU) has been identified as a means to address under-utilization problem. Understanding and achieving high resource utilization or high energy efficiency for heterogeneous workloads in cloud computing is an important topic [27], [37], [44], [57], [58]. Existing work on QoS management when co-locating heterogeneous workloads has two distinct categories: (i) reducing the probability of resource contention by either granting isolated execution environments to LRAs [49], [59] or adjusting task placement to reduce the resource contention on a certain node [11], [60], primarily for runtime QoS of LRA. (ii) Reducing performance interference caused by resource contention through performance prediction and resource inference, prioritizing the resource requests of latency-sensitive LRAs [17], [18], [19], [57], [60], [61]. Many of them have applied machine learning to precisely characterize the behavioral patterns. For instance, [62], [63] leverage various ML methods such as support vector regression, random forest and extreme gradient boosting tree to predict workloads or system load changes. [64], [65] employ neural networks to estimate JCT and load fluctuation. However, they can hardly take runtime information into consideration and thus fail to provide sufficient insights into timely calibrating the runtime QoS. [36], [44] use complicated multi-variable statistical classifiers to predict the expected interference among applications. They perform preparatory small-scale interference tests with varied levels of background applications. [18], [19] use performance index to depict contention at the time of resource allocation and conduct offline studies of the relationship between multiple resources and the resulting performance. However, they are designated to guarantee performance for monolithic applications, and not directly applicable to tackle the scheduling problem when there is tempo-spatial latency fluctuation within DLRAs. Nevertheless, the key techniques are orthogonal to our QoS prediction engine and can be modified for profiling the QoS of key microservices. By contrast, TOPOSCH leverages the distributed tracing to pinpoint the risky microservices and intervene the batch scheduling; meanwhile, TOPOSCH adopts the prediction based auto-scaling to reclaim the most suitable resources from batch tasks and minimize the cost of task preemption.

CONCLUSION
Balancing cluster utilization and applications' QoS is a nontrivial task. Microservice architecture advances the manifestation of distributed LRAs (DLRAs), comprising multiple interconnected microservices that are executed in long-lived distributed containers and serve massive user requests. Detecting and mitigating QoS violation becomes even more intractable due to the network uncertainties and latency propagation across dependent microservices.
In this paper, we present, a scheduling system to adaptively co-schedule and co-locate latency-sensitive applications and batch jobs. TOPOSCH periodically identifies the risk of QoS violation for the running microservices by tracing and analyzing the critical path based on substantial requests and the consequential end-to-end latency graph. we then propose an effective delay scheduling mechanism in the scheduler for intervening the upcoming task placement that can prioritize the QoS assurance of DLRAs. A vertical autoscaling mechanism, with the aid of resource-performance modeling and fine-grained resource access control, is proposed for promptly mitigating the QoS violation of key microservices in the DLRAs. A graceful task preemption is leveraged to ensure a low-cost task preemption and resource reclamation during the auto-scaling.
It is intricate but imperative to understand the end-to-end and tail latency in a dynamic, highly-concurrent distributed system at Internet scale. An overt observation is cloud-based LRAs have now become another main type of workloads, even more important than the conventional batch jobs. This particularly boost the requirement for strict QoS guarantees when diverse workloads are mixed. The investigated holistic approach at both the cluster-level and node-level leads to potential implications of workload co-location in many realworld domains and thus is apt for adoption in Cloud and HPC schedulers.
In the future, we plan to examine the proposed mechanism over more microservices in production environments and investigate their QoS sensitivity to fine-grained resources at large scale. We also plan to auto-learn the parameter settings by using reinforcement learning.
Tianyu Wo (Member, IEEE) received the BEng and PhD Degrees in computer science from Beihang University, in 2001 and 2008 respectively. He is an Associate Professor with the School of Software with Beihang University. His current research interests include distributed systems, network operation systems and IoV systems.
Chunming Hu received the PhD degree from Beihang University, in 2006. He is a professor and dean of the School of Software, Beihang University. His current research interests include distributed systems, system virtualization, data management and processing systems.
Hao Peng is currently an Assistant Professor with Beijing Advanced Innovation Center for Big Data and Brain Computing in Beihang University, and School of Cyber Science and Technology in Beihang University. His research interests include representation learning, text mining and social network mining.
Junqing Xiao received the MSc degree from Beihang University, in 2018. He is currently a software engineer with Alibaba Group. His research interests include distributed systems and data center resource management.
Albert Y. Zomaya (Fellow, IEEE) is the Peter Nicol Russell Chair professor of Computer Science in the School of Computer Science, Sydney University, and serves as the director of the Centre for Distributed and High-Performance Computing. He has published more than 700 scientific papers and articles and is author, co-author or editor of more than 30 books. He is the editor in chief of the ACM Computing Surveys and serves as an associate editor for several leading journals. He is a decorated scholar with numerous accolades including Fellowship of the IEEE, AAAS, and the IET. Also, he is a fellow of the Australian Academy of Science, Fellow of the Royal Society of New South Wales, Foreign Member of Academia Europaea, and Member of the European Academy of Sciences and Arts. His research interests are in the areas of parallel and distributed computing, networking, and complex systems.
Jie Xu (Member, IEEE) is the chair professor of Computing with the University of Leeds, the leader for a Research Peak of Excellence with the Leeds, director of UK EPSRC WRG e-Science Centre, Executive Board Member of UK Computing Research Committee (UKCRC), and Chief Scientist of BDBC, Beihang University, China. He has worked in the field of dependable distributed computing for more than 30 years. He is a steering/executive committee member for numerous IEEE conferences including SRDS, ISORC, HASE, SOSE and is a co-founder for IEEE IC2E, DAPPS, JCC, etc. He has led or co-led many research projects to the value of over $30M, and published in excess of 400 academic papers, book chapters and edited books. He is a Turing Fellow of the Alan Turing Institute.