Abstract:
Long-running containerized workloads (e.g., machine learning), which typically show time-varying patterns, are increasingly prevailing in shared production clusters. To i...Show MoreMetadata
Abstract:
Long-running containerized workloads (e.g., machine learning), which typically show time-varying patterns, are increasingly prevailing in shared production clusters. To improve workload performance, current schedulers mainly focus on optimizing short-term benefits of cluster load balancing or initial container placement on servers. However, this would inevitably bring many invalid migrations (i.e., containers are migrated back and forth among servers over a short time window), leading to significant service level objective (SLO) violations. This paper introduces Tetris, a model predictive control (MPC)-based container scheduling strategy to proactively migrate long-running workloads for cluster load balancing. Specifically, we first build a discrete-time dynamic model for long-term optimization of container scheduling. To solve such an optimization problem, Tetris then employs two main components: (1) a container resource predictor, which leverages time-series analysis approaches to accurately predict the container resource consumption; (2) an MPC-based container scheduler that jointly optimizes the cluster load balancing and container migration cost over a certain sliding time window. We implement and open source a prototype of Tetris based on K8s. Extensive prototype experiments and trace-driven simulations demonstrate that Tetris can improve the cluster load balancing degree by up to 77.8% without incurring any SLO violations, compared to the state-of-the-art container scheduling strategies.
Published in: IEEE Transactions on Services Computing ( Volume: 17, Issue: 5, Sept.-Oct. 2024)