Blind GB-PANDAS: A Blind Throughput-Optimal Load Balancing Algorithm for Affinity Scheduling

Dynamic affinity load balancing of multi-type tasks on multi-skilled servers, when the service rate of each task type on each of the servers is known and can possibly be different from each other, is an open problem for over three decades. The goal is to do task assignment on servers in a real time manner so that the system becomes stable, which means that the queue lengths do not diverge to infinity in steady state (throughput optimality), and the mean task completion time is minimized (delay optimality). The fluid model planning, Max-Weight, and c- $\mu $ -rule algorithms have theoretical guarantees on optimality in some aspects for the affinity problem, but they consider a complicated queueing structure and either require the task arrival rates, the service rates of tasks on servers, or both. In many cases that are discussed in the introduction section, both task arrival rates and service rates of different task types on different servers are unknown. In this work, the Blind GB-PANDAS algorithm is proposed which is completely blind to task arrival rates and service rates. Blind GB-PANDAS uses an exploration-exploitation approach for load balancing. We prove that Blind GB-PANDAS is throughput optimal under arbitrary and unknown distributions for service times of different task types on different servers and unknown task arrival rates. Blind GB-PANDAS desires to route an incoming task to the server with the minimum weighted-workload, but since the service rates are unknown, such routing of incoming tasks is not guaranteed which makes the throughput optimality analysis more complicated than the case where service rates are known. Our extensive experimental results reveal that Blind GB-PANDAS significantly outperforms existing methods in terms of mean task completion time at high loads.


INTRODUCTION
Affinity load balancing refers to allocation of computing tasks on computing nodes in an efficient way to minimize a cost function, for example the mean task completion time [1]. Due to the fact that different task types can have different processing (service) rates on different computing nodes (servers), a dilemma between throughput and delay optimality emerges which makes the optimal affinity load balancing an open problem for more than three decades if the task arrival rates are unknown. If the task arrival rates and the service rates of different task types on different servers are known, the fluid model planning algorithm by Harrison and Lopez [2], [3], and Bell and Williams [4], [5] is a delay optimal load balancing algorithm by solving a linear programming optimization problem and using the result for assigning tasks to servers. The same number of queues as the number of task types is needed for the fluid model planning algorithm, so the queueing structure is fixed to the number of task types and does not capture the complexity of the system model, which is how heterogeneous the service rates of task types on different servers are. As an example given in [6] and [7], if task types only have three different processing rates on M servers (which is referred to three levels of data locality in the data center context), the fluid model planning algorithm requires M 3 queues, while Xie et al. [6] propose a delay optimal algorithm that uses 3M queues. As another extreme example, if the service rates of N T number of task types on all servers are the same, the fluid model planning algorithm still considers N T number of queues, while the First-Come-First-Served (FCFS) algorithm uses a single queue and is both throughput and delay optimal. It is true that in the last example all task types can be considered the same type, but this was just an example to enlighten the reasoning behind the queueing structure for GB-PANDAS (Generalized Balanced Priority Algorithm for Near Data Scheduling) presented in Section 2.1.
In the absence of knowledge on task arrival rates, Max-Weight [8] and c-µ-rule [9] algorithms can stabilize the system by just knowing the service rates of task types on different servers. None of these two algorithms are delay optimal though. The c-µ-rule is actually cost optimum, where it assumes convex delay costs associated to each task type, and minimizes the total cost incurred to the system. Since the cost functions have to be strictly convex, so cannot be linear, c-µ-rule does not minimize the mean task completion time. Since these two algorithms do not use the task arrival rates and still stabilize the system, they are robust to any change in task arrival rate as long as it is in the capacity region of the system. Both Max-Weight and c-µ-rule algorithms have the same issue as the fluid model planning algorithm on considering one queue per task type which can make the system model complicated as discussed in [6]. Note that Wang et al. [10] and Xie et al. [6] study the load balancing problem for special cases of two and three levels of data locality, respectively. In the former, delay optimality is analyzed for a special traffic scenario and in the latter delay optimality is analyzed for a general traffic scenario and in both cases there is no issue on the number of queues, but as mentioned, these two algorithms are for special cases of two and three levels of data locality. Hence, a unified algorithm that captures the trade-off between the complexity of the queueing structure and the complexity of the system model is missed in the literature. Yekkehkhany et al. [11] implicitly mention this trade-off in data center applications, but the generalization is not crystal clear and needs more thinking for the affinity setup, which is summarized in this work as a complementary note on the Balanced-PANDAS algorithm.
The affinity scheduling problem appears in different applications from data centers and modern processing networks that consist of heterogeneous servers, where dataintensive analytics like MapReduce, Hadoop, and Dryad are performed, to super market models, or even patient assignment to surgeons in big and busy hospitals and many more. Lack of dependable estimates of system parameters, including task arrival rates and specially service rates of task types on different servers, is a major challenge in constructing an optimal load balancing algorithm for such networks [12]. All the algorithms mentioned above at least require the knowledge of service rates of task types on different servers. In the absence of prior knowledge on service rates, such algorithms can be fragile and perform poorly, resulting in huge waste of resources. To address this issue, we propose a robust policy called Blind GB-PANDAS that is totally blind to all system parameters, but is robust to task arrival rate changes, learns the service rates of task types on different servers, so it is robust to any service rate parameter changes as well. It is natural that due to traffic load changes in data centers, the service rate of tasks on remote servers change over time. In such cases, Blind GB-PANDAS is capable of updating system parameters and taking action correspondingly. Blind GB-PANDAS uses an exploration-exploitation approach to make the system stable without any knowledge about the task arrival rates and the processing rates. More specifically, it uses an explorationexploitation method, where in the exploration phase it takes action in a way to make the system parameter estimations more accurate, and in the exploitation phase it uses the estimated parameters to do an optimal load balancing based on the estimates. Note that only the processing rates of task types on different servers are the parameters that are estimated, and the task arrival rates are not estimated. The reason is that task arrival rates change frequently, so there is not a point on estimating them, whereas the service rates do not change rapidly. Since Blind GB-PANDAS uses an estimate of the processing rates, an incoming task is not necessarily routed to the server with the minimum weighted-workload in the exploitation phase, which rises complexity in the throughput optimality proof of Blind GB-PANDAS using the Lyaponuv-based method. The throughput optimality result is proved under arbitrary and unknown service time distributions that do not necessarily require the memoryless property, but they should have bounded means.
As discussed in Section 2.1, the queueing structure used for Blind GB-PANDAS shows the trade-off between the heterogeneity of the underlying system model for processing rates and the complexity of the Blind GB-PANDAS queueing structure. Blind GB-PANDAS can also use a one queue per server queueing structure, where the workload on servers is of interest instead of the queue lengths, but for an easier explanation of the Blind GB-PANDAS algorithm we use multiple symbolic sub-queues for each server. The Blind GB-PANDAS algorithm is compared to FCFS, Max-Weight, and c-µ-rule algorithms in terms of average task completion time through simulations, where the same exploration-exploitation approach as Blind GB-PANDAS is used for Max-Weight and c-µ-rule. Our extensive simulations show that the Blind GB-PANDAS algorithm outperforms the two other algorithms at high loads by an obviously large difference.
The rest of the paper is structured as follows. Section 2 describes the system model, GB-PANDAS, and the queueing structure of GB-PANDAS, in addition to deriving the capacity region of the system. Section 3 presents the Blind GB-PANDAS algorithm and queueing dynamics for this algorithm. Section 4 starts with some preliminary results and lemmas and ends up with the throughput optimality proof for Blind GB-PANDAS. Section 5 evaluates the performance of Blind GB-PANDAS versus Max-Weight, c-µ-rule, and FCFS algorithms in terms of mean task completion time. Section 6 discusses the related works, and finally Section 7 concludes the paper with a discussion on opportunities for future work.

SYSTEM MODEL
Consider M unit-rate multi-skilled servers and N T number of task types as depicted in Figure 1. The set of servers and task types are denoted by M = {1, 2, · · · , M } and L = {1, 2, · · · , N T }, respectively. Each task can be processed by any of the M servers, but with possibly different rates. The service times are assumed to be non-preemptive and discrete valued with an unknown distribution. Nonpreemptive service means that the central load balancing algorithm cannot interrupt an in-service task, i.e. no other task is scheduled to a server until the server completely processes the task that is currently receiving service. The analysis can be extended to continuous service time by using approximation methods of continuous distributions with discrete ones. In this discrete time model, time is indexed by t ∈ N. In the following, service time distributions and task arrivals are discussed, which are both unknown to the central scheduler. Service time distribution: The service time offered by server m ∈ M to task type i ∈ L is a discrete-type random variable with cumulative distribution function (CDF) F i,m with mean 1 µi,m or correspondingly with rate µ i,m > 0. The service time distribution does not require the memory-less property. We further assume that the support of the CDF function F i,m is bounded, which is a realistic assumption and reduces the unnecessary complexity of the proofs specially in Lemma 4. This assumption about bounded support of the CDF functions can be relaxed by more care in the analysis. Note that the completion time for a task is the waiting time for that task until it is scheduled to a server plus the service time of the task on the server. Waiting time depends on the servers' status, the queue lengths or more specifically other tasks that are in the system or may arrive later, and the load balancing algorithm that is used, while service time has the mentioned distribution. Task arrival: The number of incoming tasks of type i ∈ L at the beginning of time slot t is a random variable on nonnegative integer numbers that is denoted by A i (t), which are temporarily identically distributed and independent from each other. Denote the arrival rate of task type i by λ i , i.e. E[A i (t)] = λ i . In the stability proof of Blind GB-PANDAS we need λ i to be strictly positive, so without loss of generality we exclude task types with zero arrival rate from L. Furthermore, we assume that the number of each incoming task type is bounded by constant C A and is zero with positive probability, i.e. P (A i (t) < C A ) = 1 and P (A i (t) = 0) > 0 for any i ∈ L. The set of arrival rates for all task types is denoted by vector λ = (λ i : i ∈ L).
Affinity scheduling problem refers to load balancing for such a system described above. The fluid model planning algorithm [3], MaxWeight [8], and cµ-rule [9] are the baseline algorithms for affinity scheduling. All these algorithms in addition to GB-PANDAS use the rate of service times instead of the CDF functions. Hence, the system model can be summarized as an N T × M matrix, where element (i, m) is the processing rate of task type i on server m, µ i,m , as follows: If both the set of arrival rates λ = (λ i : i ∈ L) and the service rate matrix B µ are known, the fluid model planning algorithm [3] derives the delay optimal load balancing by solving a linear programming. However, if the arrival rates of task types are not known, the delay optimal algorithm becomes an open problem which has not been solved for more than three decades. Max-Weight [8] and cµ-rule [9] can be used for different objectives when we do not know the arrival rates, but none have delay optimality. In this work, we are assuming that we lack knowledge of not only the arrival rates λ, but also the service rate matrix B µ . We take an exploration and exploitation approach to make our estimation of the underlying model, which is the service rate matrix, more accurate, and to keep the system stable.

Queueing Structure for GB-PANDAS
Every algorithm has its own specific queueing structure. For example, there is only a single central queue for the Fist-Come-First-Served (FCFS) algorithm, but there are N T number of queues when using fluid model planning, Max-Weight, or cµ-rule. In the following, we present the queueing structure used for GB-PANDAS that captures the tradeoff between the complexity of the system model and the complexity of the queueing structure very well. What we mean by the complexity of the system model is the heterogeneity of the service rate matrix, e.g. if all the elements of this matrix are the same number, the system is less complex than the case where each element of the matrix is different from other elements of the matrix. The heterogeneity of the system from the perspective of server m is captured in the m th column of the service rate matrix. Consider the m th column of the matrix has N m distinct values, where N m can be any number from 1 to N T . It is obvious that any of the task types with the same service (processing) rate on server m look the same from the perspective of this server. Denote the N m distinct values of the m th column of B µ by {α 1 m , α 2 m , · · · , α N m m } and without loss of generality assume that α 1 m > α 2 m > · · · > α N m m . We call all the task types with a processing rate of α n m on the m th server, the n-local tasks to that server, and denote them by L n m = {i ∈ L : µ i,m = α n m }. For ease of notation, we use both µ i,m and α n m throughout the paper interchangeably; however, they are in fact capturing the same phenomenon, but with different interpretations. Note that the n-local tasks to server m can be called (n, m)-local tasks in order to place more emphasis on the pair n and m, so the n-local tasks to server m are not necessarily the same as the n-local tasks to server m . We allocate N m queues for server m, where the n th queue of server m holds all task types that are routed to this server and are n-local to it. As depicted in Figure 2, different servers can have different numbers of queues since the heterogeneity of the system model can be different from the perspective of different servers. We may interchangeably use queue or sub-queue to refer to the n th queue (sub-queue) of the m th server. The N m sub-queues of the m th server are denoted by Q 1 m , Q 2 m , · · · , Q N m m and the queue lengths of these sub-queues, defined as the number of tasks in these sub-queues, at time slot t are denoted by Remark. The GB-PANDAS algorithm can also be implemented by considering a single queue per server. Since workloads of servers are used for load balancing, instead of tracking all sub-queue lengths, the central load balancing algorithm can track the workloads on servers as defined in (2). At the arrival of an n-local task to server m, the server's workload is increased by 1 α n m (and the workload is decreased at the departure of a task), so all tasks that are routed to server m can be kept in one queue associated with this server. However, the illustration of GB-PANDAS is easier to grasp by the queueing structure depicted in Figure  2 instead of one queue per server queueing structure, so we stick to the model with multiple sub-queues. Sub-queues also give the leverage to use priority scheduling.
In the next subsection, the GB-PANDAS algorithm is proposed when the service rate matrix B µ is known. Balanced-PANDAS for a data center with three levels of data locality is proposed by [6], and here we are proposing the Generalized Balanced-PANDAS algorithm from another perspective which is of its own interest.

GB-PANDAS Algorithm with Known Service Rate Matrix B µ
Before getting into the GB-PANDAS algorithm, we need to define the workload on server m. Definition 1. The average time needed for server m to process all tasks queued in its N m sub-queues at time slot t is defined as the workload on the server: A load balancing algorithm consists of two parts, routing and scheduling. The routing policy determines the queue at which an incoming task is stored until it is assigned to a server for service. When a server becomes idle, the scheduling policy determines the next task that receives service on the idle server. The routing and scheduling policies of the GB-PANDAS algorithm are as follows: GB-PANDAS Routing Policy: An incoming task of type i is routed to the corresponding sub-queue of the server with the minimum weighted workload, where ties are broken arbitrarily. The server m * with the minimum weighted workload is defined as The corresponding sub-queue of server m * for a task of type i is n if µ i,m = α n m . GB-PANDAS Scheduling Policy: An idle server m at time slot t is scheduled to process a task of sub-queue Q 1 m if there is any. If Q 1 m (t) = 0, a task of sub-queue Q 2 m is scheduled to the server, and so on. It is a common assumption that servers do not have the option of processing the tasks queued in front of other servers, so a server remains idle if all its sub-queues are empty. Note that the routing policy is doing a sort of weighted water-filling for workloads, so the probability that a server becomes idle goes to zero as the load increases at heavy traffic regime. Remember that the tasks in sub-queue Q 1 m are the fastest types of tasks for server m, the tasks in sub-queue Q 2 m are the second fastest, and so on. Using this priority scheduling, the faster tasks in the N m sub-queues of server m are processed first. Given the minimum weighted workload routing policy, the priority scheduling is optimal as it minimizes the mean task completion time of all tasks in the N m sub-queues of server m. In the following, Max-Weight and cµ-rule algorithms are discussed for the sake of completeness.

Max-Weight and c-µ-Rule Algorithms with Known Service Rate Matrix B µ
The queueing structure used for Max-Weight and c-µrule is as depicted in Figure 1, where there is a separate queue for each type of task. Denote the N T queues by Q 1 , Q 2 , · · · , Q N T , and their corresponding queue lengths at time slot t by Q 1 (t), Q 2 (t), · · · , Q N T (t). A downside of Max-Weight and c-µ-rule algorithms is that they require too many queues to keep each of the task types, while GB-PANDAS can use symmetry of real world applications to decrease the number of queues needed. As an example, for servers with rack structures, where Hadoop is used for mapreduce data placement, Max-Weight and c-µ-rule require The routing policy for both algorithms is obviously to route an incoming type of task to its corresponding queue. The scheduling policies for these two algorithms are discussed in the following. Max-Weight Scheduling Policy: An idle server m at time slot t is scheduled to process a task of type j from Q j , if there is any, such that C-µ-rule Scheduling Policy: Consider that queue Q i incurs a cost of C i Q i (t) at time slot t, where C i (.) is increasing and strictly convex. The c-µ-rule algorithm maximizes the rate of decrease of the instantaneous cost at all time slots by the following scheduling policy. An idle server m at time slot t is scheduled to process a task of type j from Q j , if there is any, such that where C (.) is the first derivative of the cost function. The c-µ-rule algorithm minimizes both instantaneous and cumulative queueing costs, asymptotically. The mean task completion time corresponds to linear cost functions for all task types, so c-µ-rule cannot minimize the mean task completion time, and as the result, is not delay or heavytraffic optimal.

Capacity Region of Affinity Scheduling Setup
Denote the arrival rate of task type i that is processed at server m by λ i,m . Hence, λ i = M m=1 λ i,m , i.e. the arrival rate of task type i is decomposed to (λ i,m , m ∈ M). By using the fluid model planning algorithm, the affinity queueing system can be stabilized under a given arrival rate vector λ as long as the necessary condition of total 1-local, 2-local, ..., and N m local load on server m being strictly less than one for any server m is satisfied: Hence, the capacity region of the affinity problem is the set of all arrival rate vectors λ that has a decomposition (λ i,m , i ∈ L, m ∈ M) satisfying (3): (4) A linear programming optimization problem can be solved to find the capacity region Λ. The GB-PANDAS algorithm stabilizes the system for any arrival rate vector inside the capacity region by knowing the service rate matrix. It is proved in Section 4 that the Blind GB-PANDAS algorithm is throughput-optimal without the knowledge of the service rate matrix, B µ .

THE BLIND GB-PANDAS ALGORITHM
The GB-PANDAS and Max-Weight algorithms need to know the precise value of the service rate matrix, but this requirement is not realistic for real applications. Furthermore, the service rate matrix can change over time, which confuses the load balancing algorithm if it uses a fixed given service rate matrix. In the Blind version of GB-PANDAS, the service rate matrix is initiated randomly and is updated as the system is running. More specifically, an exploration-exploitation framework is combined with GB-PANDAS. In the exploration phase, the routing and scheduling are performed so as to allow room for making the estimations of the system parameters more precise, and in the exploitation phase the routing and scheduling are done based on the available estimation of the service rate matrix so as to stabilize the system. Here we assume that the ordering of levels of localities for each server is known, which can be inferred from prior knowledge on the structure of the system. This is not a necessary assumption for the proof, but it makes the intuition behind Blind GB-PANDAS more clear. As mentioned before, a single queue per server can be used when using Blind GB-PANDAS, in which case, there is no need to know the ordering of service rates offered by servers for different task types.
We first propose the updating method used for the service rate matrix before getting into the routing and scheduling policies of the Blind GB-PANDAS algorithm. The estimated service rate matrix at time slot t is denoted as Note that α 1 m (t), α 2 m (t), · · · , α N m m (t), ∀m ∈ M which are the estimates of α 1 m (t), α 2 m (t), · · · , α N m m (t), ∀m ∈ M at time slot t are nothing but the distinct values of the elements of the service rate matrix. More specifically, those are the α n m , ∀m ∈ M, ∀n ∈ {1, 2, · · · , N m } that are getting updated and then mapped into their corresponding elements in the service rate matrix to form B µ in (5) as mentioned in Section 2.1. Consider a random initialization of α n m (0) > 0, ∀m ∈ M, ∀n ∈ {1, 2, · · · , N m } at time slot t = 0. If server m has processed n − 1 tasks that are n-local to this server by time t 1 , the estimate of α n m at this time slot is α n m (t 1 ), and a new observation of service time for n-local task to server m is made at time slot t 2 > t 1 as T n m (t 2 ), we have α n m (t) = α n m (t 1 ) for t 1 ≤ t < t 2 and the update of this parameter at time slot t 2 is .
Note that α n m is the service rate, not the service time mean, that is why 1 T n m (t2) is used above in the update of the service rate. In the following, the routing and scheduling policies of Blind GB-PANDAS are presented, where the exploration rate is chosen in such a way that infinitely many n-local tasks are scheduled for service on server m for any m ∈ M and any n ∈ {1, 2, · · · , N m } so that by using the strong law of large numbers, the parameter estimations in (6) converge to their real values almost surely. Blind GB-PANDAS Routing Policy: The estimated workload on server m at time slot t is defined based on parameter estimations in (6) as The routing of an incoming task is based on the following exploitation policy with probability p e = max(1 − p(t), 0), and is based on the exploration policy otherwise, where p(t) → 0 as t → ∞ and ∞ t=0 p(t) = ∞, e.g. the exploitation probability can be chosen as p e = 1 − 1 t c for 0 < c ≤ 1. • Exploitation phase: An incoming task of type i is routed to the corresponding sub-queue of the server with the minimum estimated weighted workload, where ties are broken arbitrarily. The server m * with the minimum weighted workload for task of type i is defined as The corresponding sub-queue of server m * for a task of type i is n if µ i, m * = α n m * . • Exploration phase: An incoming task of type i is routed to the corresponding sub-queue of a server chosen uniformly at random among {1, 2, · · · , M }. Blind GB-PANDAS Scheduling Policy: The scheduling of an idle server is based on the following exploitation policy with probability p e , and is based on the exploration policy otherwise.
• Exploration phase: An idle server is scheduled to one of its non-empty sub-queues uniformly at random, and stays idle if all its sub-queues are empty.
Since the arrival rate of any task type is strictly positive, infinitely many of each task type arrives to system, and given the fact that the probability of exploration in both routing and scheduling policies decays such that ∞ t=0 p(t) = ∞, using the Borel-Cantelli lemma (zero-one law), it is obvious that n-local tasks to server m are scheduled to this server for infinitely many times for any locality level and any server, so B µ (t) → B µ as t → ∞ using the updates in (6).

Queue Dynamics under the Blind GB-PANDAS Algorithm
Recall from Section 2.4 that λ i,m denotes the arrival rate of tasks of type i that are processed at server m. Denote the number of such incoming tasks at the beginning of time slot t by A i,m (t), i.e. E[A i,m (t)] = λ i,m . Then, by denoting the number of incoming n-local tasks to server m that are routed to Q n m at the beginning of time slot t by A n m (t), we have: Denote the number of n-local tasks that finish processing on server m at the end of time slot t by S n m (t), where such tasks depart from Q n m . Then, the queue dynamics for any m ∈ M is as follows: is the unused service offered by server m at time slot t.
Denote the queue length vector at time slot t by Q(t) = Q 1 1 (t), Q 2 1 (t), · · · , Q N 1 1 (t), · · · , Q N M M (t) . Due to the following reasons, Q(t), t ≥ 0 does not form a Markov chain, i.e. Q(t + 1)|Q(t) ⊥ Q(t − 1). First, the service time distributions do not necessarily have the memoryless property, second, nothing can be said about locality of an in-service task at a server by just knowing the queue lengths. Hence, in order to use Foster-Lyapunov theorem for proving the positive recurrence of a Markov chain, we need to consider two other measurements of the status of servers in addition to the queue lengths as follows.
• The number of time slots at the beginning of time slot t that server m has been allocated on the current in-service task on this server is denoted by Ψ m (t). This parameter is set to zero when server m finishes processing a task. Let Ψ (t) = Ψ 1 (t), Ψ 2 (t), · · · , Ψ M (t) . • The set of working status of servers is denoted by vector  Let η m (t) denote the scheduling decision for server m at time slot t. As long as server m is not idle, we have f m (t) = η m (t). When the in-service task on server m is completely processed and departs the system at time slot t − 1, we have f m (t − ) = −1, so the central load balancing algorithm schedules another task on idle server m by determining η m (t). Defining η(t) = η 1 (t), η 2 (t), · · · , η M (t) , we have the following lemma.

THROUGHPUT OPTIMALITY OF THE BLIND GB-PANDAS ALGORITHM
Before getting into the main throughput optimality theorem of the Blind GB-PANDAS algorithm, we need to present some preliminaries from the workload dynamic of servers, the ideal workload on servers, some lemmas, and an extended version of the Foster-Lyapunov theorem.

Preliminary Materials and Lemmas
The workload on server m evolves as follows: where (a) is true by using the queue dynamics in (9) and (b) follows from defining the pseudo task arrival, service, and unused services of server m as (10) By defining the pseudo task arrival, service, and unused service processes as A(t) = A 1 (t), A 2 (t), · · · , A M (t) , S(t) = S 1 (t), S 2 (t), · · · , S M (t) , and U (t) = U 1 (t), U 2 (t), · · · , U M (t) , respectively, the vector of servers' workloads defined by W = (W 1 , W 2 , · · · , W M ) evolves as Lemma 2. For any arrival rate vector inside the capacity region, λ ∈ Λ, there exists a load decomposition {λ i,m } and δ > 0 such that The fluid model planning algorithm uses the load decomposition {λ i,m } for load balancing on the M servers. In other words, this load decomposition is a possibility of assigning tasks to servers so that the system becomes stable.
Lemma 2 is used in the proof of Lemmas 4 and 5.

Definition 2.
The ideal workload on server m corresponding to the load decomposition {λ i,m } of Lemma 2 is defined as Let w = (w 1 , w 2 , · · · , w M ). The vector of servers' ideal workload is used as an intermediary term in Lemmas 4 and 5 which are later used for throughput optimality proof of the Blind GB-PANDAS algorithm.

Lemma 4.
Under the exploration-exploitation routing policy of the Blind GB-PANDAS algorithm, for any arrival rate vector inside the capacity region, λ ∈ Λ, and the corresponding ideal workload vector w defined in (13), and for any arbitrary small θ 0 > 0, there exists T 0 > t 0 such that for any t 0 ≥ 0 and T > T 0 : where the constants θ 0 , c 0 > 0 are independent of Z(t 0 ). We emphasize that θ 0 can be made arbitrarily small, as you can see in the proof, which is used in the throughput optimality proof of Blind GB-PANDAS, Theorem 1. Throughout this paper, . and . 1 are the L 2 -norm and L 1 -norm, respectively.

Lemma 5.
Under the exploration-exploitation scheduling policy of the Blind GB-PANDAS algorithm, for any arrival rate vector inside the capacity region, λ ∈ Λ, and the corresponding ideal workload vector w in (13), there exists T 1 > 0 such that for any T > T 1 , we have: where the constants θ 1 , c 1 > 0 are independent of Z(t 0 ).

Lemma 6. Under the exploration-exploitation load balancing
of the Blind GB-PANDAS algorithm, for any arrival rate vector inside the capacity region, λ ∈ Λ, and for any θ 2 > 0, there exists T 2 > 0 such that for any T > T 2 and for any t 0 ≥ 0, we have: Theorem 3.3.8 in [13], an extended version of the Foster-Lyapunov theorem: Consider an irreducible Markov chain {Z(t)}, where t ∈ N, with a state space S. If there exists a function V : S → R + , a positive integer T ≥ 1, and a finite set P ⊆ S satisfying the following condition: for some θ > 0 and C < ∞, then the irreducible Markov chain {Z(t)} is positive recurrent.

Throughput Optimality Theorem and Proof
Theorem 1. The Blind GB-PANDAS algorithm is throughputoptimal for a system with affinity setup discussed in Section 2, with general service time distributions, without any prior knowledge on the service rate matrix B µ and the arrival rate vector λ.

Proof:
We use the Foster-Lyapunov theorem for proving that the irreducible and aperiodic Markov chain Z(t) = Q(t), η(t), Ψ (t) , t ≥ 0 (Lemma 1) is positive recurrent under the Blind GB-PANDAS algorithm, as far as the arrival rate vector is inside the capacity region, λ ∈ Λ. This means that as time goes to infinity, the distribution of Z(t) converges to its stationary distribution, which implies that the system is stable and Blind GB-PANDAS is throughputoptimal. To this end, we choose the following Lyapunov function V : S → R + and use Lemmas 2, 3, 4, 5, and 6 to derive its drift afterward: By choosing θ 0 in Lemma 4 less than θ 1 in Lemma 5, θ 0 < θ 1 , we get T 0 from Lemma 4, which is used in the drift of the Lyapunov function in Lemma 7.

Lemma 7.
For any t 0 ≤ T 0 < T , specifically T 0 from Lemma 4 that is dictated by choosing θ 0 < θ 1 , we have the following for the drift of the Lyapunov function in (16), where T 0 is used in the first summation after the inequality: By choosing T > max{T 0 , T 1 , T 2 , θ2+c2 2(θ1−θ0) }, where θ 2 > 0 is the one in Lemma 6, and substituting the terms on the righthand side of the Lyapunov function drift (17) in Lemma 7 from the corresponding inequalities in Lemmas 4, 5, and 6, we have: where c = 2c 0 + 2c 1 + c 3 + M T . Let P = Z = Q, η, Ψ ∈ S : Q 1 + Ψ 1 ≤c +c θ2 for any positive constantc > 0, where P is a finite set of the state  (16), all the conditions of the Foster-Lyapunov theorem are satisfied, which completes the throughput optimality proof for the Blind GB-PANDAS algorithm.
Note that the priority scheduling in the exploitation phase of the Blind GB-PANDAS algorithm is not used for the throughput optimality proof since the expected workload of a server is decreased in the same rate no matter what locality level is receiving service from the server. As long as an idle server gives service to one of the tasks in its subqueues continuously, the system is stable. Given the routing policy, the priority scheduling is used in the exploitation phase to minimize the mean task completion time in this phase.

SIMULATION RESULTS
In this section, the simulated performance of the Blind GB-PANDAS algorithm is compared with FCFS, Max-Weight, and c-µ-rule algorithms. FCFS does not use system parameters for load balancing, but Max-Weight and c-µ-rule use the same exploration-exploitation appraoch as Blind GB-PANDAS. Convex cost functions C i (Q i ) = Q 1.01 i for i ∈ {1, 2, 3} are used for the c-µ-rule algorithm. Since the objective is to minimize the mean task completion time, the convexity of the cost functions are chosen in a way to be close to a line for smal values of Q i . Three types of tasks and a computing cluster of three servers are considered with processing rates depicted in Figure 3, which are not known from the perspective of the load balancing algorithms. Note that this affinity structure does not have the rack structure mentioned in [6] since from the processing rates of task type 2 on the three servers, servers 1 and 2 are in the same rack as server 3, but from the processing rates of task type 3 on the three servers, the second server is in the same rack as the third server, but not the first server. Hence, this affinity setup is more complicated than the one with a rack structure. Refer to LINK-to-codes for more details on simulation setup.
A related direction of work on scheduling for data centers with multi-level data locality, which is a direct application of affinity scheduling, is to efficiently do data replication on servers in MapReduce framework to increase availability. Increasing the availability is translated to increasing service rates in the context of this article which enlarges the capacity region and reduces the mean task completion time. For more information on data replication algorithms refer to Google File System [23], Hadoop Distributed File System [24], Scarlett [25], and Dare [26]. Data replication algorithms are complementary and orthogonal to throughput and delay optimality that is studied in this article.
Fairness is an issue in most scheduling problems which conflicts with delay optimality. A delay optimal load balancing algorithm can cooperate with fair scheduling strategies though by compromising on delay optimality to party achieve fairness. Refer to Delay Scheduling [16], Quibcy [15], and the references therein for more details on fair scheduling.

CONCLUSION AND FUTURE WORK
The Blind GB-PANDAS algorithm is proposed for the affinity load balancing problem where no knowledge of the task arrival rates and the service rate matrix is available. An exploration-exploitation approach is proposed for load balancing which consists of exploration and exploitation phases. The system is proven to be stable under Blind GB-PANDAS and is shown empirically through simulations to have a better delay performance than Max-Weight, cµ-rule, and FCFS algorithms. Investigating the subspace of the capacity region in which GB-PANDAS is delay optimal is a promising direction for future work. Note that both GB-PANDAS and Max-Weight algorithms have high routing and scheduling computation complexity which can be alleviated using power-of-d-choices [27] or join-idle-queue [28] algorithms which are interesting directions to study as well.

APPENDIX .1 Proof of Lemma 1
Consider Z(0) = 0 ( m∈M N m )×1 , m∈M N m , 0 M ×1 as the initial state of the Markov chain Z(t).
Irreducible: Since F i,m is increasing for any task-server pair, we can find an integer τ > 0 such that F i,m (τ ) > 0 for any 1 ≤ i ≤ N m and m ∈ M. Furthermore, probability of zero task arrival is positive in each time slot. Hence, for any state Z = (Q, η, Ψ ), there is a positive probability that each task receives service in τ time slots and no new task arrives at the system in τ m∈M N m n=1 Q n m time slots. Accordingly, the initial state of the Markov chain is reachable from any states of the system. Conversely, using the same approach, it is easy to see that any states of the system is reachable from the initial state, Z(0). Consequently, the Markov chain Z(t) is irreducible. Aperiodic: Since Markov chain Z(t) is irreducible, in order to show that it is also aperiodic, it suffices to show that there is a positive probability for transition from a state to itself. Due to the fact that there is a positive probability that zero task arrives to the system, the Markov chain stays at the initial state with a positive probability. Hence, the Markov chain Z(t) is aperiodic.

.2 Proof of Lemma 2
The capacity region Λ is an open set, so for any λ ∈ Λ, there exists δ > 0 such that (1 + δ)λ = λ ∈ Λ. On that account, (4) follows by i∈L λ i,m µi,m = i∈L (1+δ)λi,m µi,m < 1, ∀m ∈ M, which completes the proof: .3 Proof of Lemma 3 If the unused service for server m is zero, U m (t) = 0, the corresponding term for server m is zero in the above summation. Alternatively, the unused service of server m is positive if and only if all N m sub-queues of the server are empty, which again makes the corresponding term for server m in the above summation equal to zero.

.4 Proof of Lemma 4
By our choice of the exploration rate for Blind GB-PANDAS, any task that is n-local to server m is scheduled on this server for infinitely many times in the interval [t 0 , ∞). Processing time of an n-local task on server m has an unknown CDF function F i,m with a finite mean. Due to strong law of large numbers, using the update rule (6) for the elements of the service rate matrix, we have: ∀t > T 0 , ∀m ∈ M, ∀n ∈ {1, 2, · · · , N m }. (18) By the above choice of , for t > T 0 , the different locality levels are distinct from each other with at least 1 − δ probability.
For an incoming task of type i ∈ L at time slot t, define the exact (but not known) and estimated minimum weighted workloads as where W m (t) and W m (t) are defined in (2) and (7), respectively. W m (t) and W m (t) are related to each other as follows: hence, using (18), for any t > T 0 and any m ∈ M, we have: Using (19) and (20), we have the following for any m ∈ M: Using the conditional independence of W (t) and A(t) from Z(t 0 ) given Z(t), we have the following for T 0 ≤ t ≤ t 0 + T − 1: where (a) and (b) are simply followed by the definitions of pseudo task arrival process in (10) and A n m (t) in (8), respectively. The order of summations is changed in (c). Using (18) and (20), (d) is true, and (e) follows by the routing policy of Blind GB-PANDAS, where an incoming task at the beginning of time slot t is routed to the corresponding subqueue of the server with the minimum estimated weighted workload with probability p e = max(1 − p(t), 0) and is routed to the corresponding sub-queue of a server chosen uniformly at random with probability 1 − p e . Also note that the number of arriving tasks at a time slot is assumed to be upper bounded by C A . The last step, (f ), is true by using (18), upper bounding the exploration probability 1−p e by 1 t δ given that δ > 0 is a constant, and doing simple calculations, where c 0 and c 0 are constants independent of Z(t 0 ). Note that minimum value of the estimated service rates, min i,m { µ i,m (t)}, is lower bounded for any t ≥ t 0 by a constant which is the minimum of the initialization of service rates and the inverse of the maximum support of CDF functions F i,m . On the other hand, where (a) is true by the definition of the ideal workload on a server in (13), the order of summations is changed in (b), and (c) is followed by (21).
Putting (22) and (23) together, we have the following for T > T 0 > T 0 : where (a) follows by upper bounding 1 − δ , 1 (1− ) 2 (1+ ) 2 , and 1 t δ by 1, 16 9 , and 1 T δ 0 , respectively, and (b) is true by the fact that the number of arriving tasks is bounded by C A , the number of task types is N T , and the maximum arrival rate of task types, max i {λ i }, is bounded by the number of servers. Inequality (c) is true by doing simple calculations and using the fact that min i,m { µ i,m (t)} is lower bounded by a constant for any t ≥ t 0 as discussed in (f ) of (22). Remark. θ 0 can be made arbitrary small by choosing and δ small and T 0 large enough.

.5 Proof of Lemma 5
The proof is similar to the proof of lemma 4 in [11] and is presented for the sake of completeness. By the assumption on boundedness of arrival and service processes, there exists a constant C A such that for any t 0 , t, and T with t 0 ≤ t ≤ t 0 + T , we have the following for all m ∈ M: (24) On the other hand, by Lemma 2, the ideal workload on a server defined in (13) can be bounded as follows: Hence, where (a) is true by bringing the inner summation on m out of the expectation and using the boundedness property of the workload in equation (24), and (b) is true by equation (25). Before investigating the second term on the left-hand side of equation (14), E t0+T −1 t=t0 W (t), S(t) Z(t 0 ) , we propose the following lemma which will be used in lower bounding this second term. Lemma 8. For any server m ∈ M and any t 0 , we have the following: We then have the following: where (a) follows by bringing the inner summation on m out of the expectation and using the boundedness property of the workload in equation (24).
Using Lemma 8, for any 0 < 0 < δ 1+δ , there exists T 1 such that for any T ≥ T 1 , we have the following for any server m ∈ M:

.6 Proof of Lemma 6
This proof is the same as the proof of lemma 5 in [11] and is presented for the sake of completeness. For any server m ∈ M, let t * m be the first time slot after or at time slot t 0 at which the server is available; that is, where it is obvious that Ψ m (t * m ) = 0. Note that for any t ≥ t 0 , we have Ψ m (t + 1) ≤ Ψ m (t) + 1, which is true by the definition of Ψ (t) that is the number of time slots that server m has spent on the currently in-service task. From time slot t to t + 1, if a new task comes in service, then Ψ m (t + 1) = 0 which results in Ψ m (t+1) ≤ Ψ m (t)+1; otherwise, if server m continues giving service to the same task, then Ψ m (t + 1) = Ψ m (t) + 1. Thus, if t * m ≤ t 0 + T , it is easy to find out that Ψ m (t 0 + T ) ≤ t 0 + T − t * m ≤ T . In the following, we use t * m to find a bound on E[Ψ m (t 0 + T ) − Ψ m (t 0 )|Z(t 0 )]: where (a) is true as given that t * m ≤ t 0 + T we found that Ψ m (t 0 + T ) ≤ T , so Ψ m (t 0 + T ) − Ψ m (t 0 ) ≤ T − Ψ m (t 0 ), and given that t * m > t 0 + T , it is concluded that server m is giving service to the same task over the whole interval [t 0 , t 0 + T ], which results in Ψ m (t 0 + T ) − Ψ m (t 0 ) = T .
Since service time of an n-local task on server m has CDF F n m with finite mean, we have the following: lim T →∞ P t * m ≤ t 0 + T Z(t 0 ) = 1, ∀m ∈ M, so for any θ 2 ∈ (0, 1) there exists T 2 such that for any T ≥ T 2 , we have P t * m ≤ t 0 + T Z(t 0 ) ≥ θ 2 , for any