Task-Oriented Data Compression for Multi-Agent Communications Over Bit-Budgeted Channels

Various applications for inter-machine communications are on the rise. Whether it is for autonomous driving vehicles or the internet of everything, machines are more connected than ever to improve their performance in fulfilling a given task. While in traditional communications the goal has often been to reconstruct the underlying message, under the emerging task-oriented paradigm, the goal of communication is to enable the receiving end to make more informed decisions or more precise estimates/computations. Motivated by these recent developments, in this paper, we perform an indirect design of the communications in a multi-agent system (MAS) in which agents cooperate to maximize the averaged sum of discounted one-stage rewards of a collaborative task. Due to the bit-budgeted communications between the agents, each agent should efficiently represent its local observation and communicate an abstracted version of the observations to improve the collaborative task performance. We first show that this problem can be approximated as a form of data-quantization problem which we call task-oriented data compression (TODC). We then introduce the state-aggregation for information compression algorithm (SAIC) to solve the formulated TODC problem. It is shown that SAIC is able to achieve near-optimal performance in terms of the achieved sum of discounted rewards. The proposed algorithm is applied to a geometric consensus problem and its performance is compared with several benchmarks. Numerical experiments confirm the promise of this indirect design approach for task-oriented multi-agent communications.


I. INTRODUCTION
The design of traditional communication systems has often been carried out according to task-agnostic principles. Information and coding theories drive the major analytical and design techniques, where the former sets the upper bounds on the system capacity, and the latter focuses on techniques for approaching the bounds with infinitesimal error probabilities. Accordingly, digital communications have made astonishing strides in terms of performance, enabling robust information transmission even under adverse channel conditions. However, in the era of cyber-physical systems, the effectiveness of communications is not solely dictated by the traditional performance indicators (e.g., bit rate, latency, jitter, fairness etc.), but most importantly by the efficient completion of the task in hand, e.g., remotely controlling a robot, automating a production line or collaboratively sensing/communicating through a drone swarm. The  Machine to machine communications occur since the received signals can help the receiving end to make more informed decisions or more precise estimates/computations. In this context, the reliability of the communications is not essential beyond serving the specific needs of the control/estimation/computational task that the receiving end machine is trying to accomplish. This calls for a fresh look into the design of communication systems that have been engineered with reliability as one of their ultimate goals. The emerging literature on semantic communications as well as goal/task-oriented communications is trying to take the first steps towards the above-mentioned goal, i.e., incorporating the semantics as well as the goal/usefulness of the message exchange into the design of communication systems [1]- [3]. By jointly analyzing the features of the collaborative task and the constraints on the underlying communication infrastructure, the communication strategies can be adapted or tailored such that they will be specifically effective for the task.
This paper attempts to take the first steps towards designing an indirect task-effective data compression theory. While the data compression algorithm proposed by this paper is designed in an indirect 1 fashion i.e., not for a specific task, we demonstrate its applicability in a specific task: a geometric consensus problem under finite observability [6]. As attested by [7], "a unified framework to support various tasks is still missing in multi-user semantic communications.". Unlike earlier task-oriented quantization techniques that tailor a quantization scheme to certain application [8], this work proposes an indirect design for its task-oriented quantization scheme -SAIC. The indirect design is carried out in a fashion that the it never benefits from any explicit domain knowledge about any specific task e.g., geometric consensus problems. Accordingly, the indirect design of the algorithms allows them to be applied beyond the geometric consensus problems and to a much wider range of tasks. The framework can be applied where a major communication bottleneck is in place between multiple cooperative decision makers. This bottleneck can occur due to a multitude of reasons (i) the energy lifetime of the communicating agents e.g., in the case of UAV/LEO satellite communications, that forces agents to communicate with low- 1 By using the word indirect here we are not referring to the concept of indirect access to the source of information [4] -this usage of the word falls in the nomenclature of source coding and information theory. In fact, we are referring to the concept being introduced by the control theory nomenclature in which an indirect design is generic enough to be used for an unmodelled system dynamics and not a certain dynamic [5]. Thus the schemes -such as SAIC -which enjoy from an indirect design can be applied to all/a wider range of tasks. In contrast to indirect schemes, "the direct schemes aim at guaranteeing or improving the performance of the cyber-physical system at a particular task by designing a task-tailored communication strategy" [1].
Due to the bit-budgeted communications between the agents, it is necessary for agents to compactly represent their observations in communication messages. As we ultimately measure the performance of the MAS in terms of the expected return, the loss of information caused by the compact representation of the agents' observations needs to be managed in such a way that it minimally affects the obtained return [15], [16]. As such, in this form of compression scheme which we call task-oriented data compression, the goal of abstraction is different from conventional compression schemes whose ultimate aim is to reduce the distortion between the original signal and the decoded/reconstructed signal [17] -see [8], [18], where a similar task-based notion is introduced and a comparison of it with our work in Table I.

B. Literature Review
As we study the joint communication and control design of a MAS, the topic of this paper falls under the general category of multi-agent communications [19]. In contrast to many other cooperative multi-agent systems [20], the full state and action information are not available here to each agent. Accordingly, agents are required to carry out communication to overcome these barriers [19]. Earlier works used to address the coordination of multiple agents through a noise-free communication channel, where the agents follow an engineered communication strategy [21]- [25]. Later the impact of stochastic delays in multi-agent communication was considered on the multi-agent coordination [24], while [25] considers event-triggered local communications. Deep reinforcement learning with communication of the gradients of the agents' objective function was proposed in [26] to learn the communication among multiple agents. In contrast to the above-mentioned works, the presence of noise in the interagent communication channel was first studied by [27] where exact reinforcement learning was used to design the interagent communications. Later, the authors of [16] proposed a deep reinforcement learning approach to address a similar problem. Papers [8], [16], [18], [27], [28] and [29] have contributed to the rapidly emerging literature on task-oriented communications [1]. Noteworthy are also some novel metrics that are introduced in [30] to measure the positive signaling and positive listening amongst agents which learn how to communicate [26], [27], [29].
The current work can also be seen as designing a state aggregation algorithm. In this paper, state aggregation enables each agent to compactly represent its observations through communication messages while maintaining their performance in the collaborative task. Classical state aggregation algorithms, however, have been used to reduce the complexity of the dynamic programming problems over MDPs [31]- [34] as well as Partially Observable MDPs [35]. One similar work is [36], which studies a task-based quantization problem. In contrast to our work, the assumption there is that the parameter to be quantized is only measurable and cannot be controlled. In our problem, agents' observations stem from a generative process with memory, an MDP. Similarly, in [37], the authors have introduced a gated mechanism so that reinforcement learning-aided agents reduce the rate of their communication by removing messages which are not beneficial for the team. However, their proposed approach mostly relies on numerical experiments. In contrast, this paper relies on analytical studies to design a multi-agent communication policy which efficiently coordinates agents over a bit-budgeted channelthe benefits of our analytical approach are briefly explained in the contributions section I-C. State aggregation algorithms are often developed for single-agent scenarios and are used to reduce the complexity of MDPs. To the best of our knowledge, we are the first to design a TODC algorithm using stateaggregation schemes. In particular, we use state-aggregation to design a data compression scheme to compactly represent the observation process of each agent in a multi-agent system.
Conventionally, the communication system design is disjoint from the distributed decision-making design [21]- [24], [26], [38]. The current work can also be interpreted as a demonstration of the potential of the joint design of the data compression/quantization and control policies. Determining the existence of a quantizer operating at a certain bit-budget to achieve a given figure of expected return is known to be an "intriguing open problem" [15] -even for single agent scenarios. Here we set a non-closed form upper bound on the expected-return performance of the multi-agent system given a quantization data rate/ the finite size of the discrete alphabet of the quantizer. We show how this joint quantization and control design problem is connected to minimizing an absolute error distortion measure via Theorem 1. A similar interpretation of the TODC problem can also be seen in [39]. While relevant, their setup is different from our work as they consider two distortion criteria for the rate-distortion problem.
We will show in section II-B, that, in fact, the decentralized problem we target can be translated as the joint constrained design of the control policies as well as the observation function of a Dec-POMDP to maximize the expected return. While in classic Dec-POMDP problems the observation function is considered to be a fixed function [40], by a constrained design of the observation function, our problem setting offers more flexibility in designing a multi-agent system. The design of the observation function helps to filter the nonuseful observation information of each agent while meeting the problem's constraint i.e., the communication bit-budget. The mathematical framework being used here is neither a classic MDP as we have the issue of partial observability, nor is a partially observable MDP (POMDP) [41] as the action vector is not jointly selected at a single entity. Our problem setting is differentiated from Dec-POMDPs due to the fact that in Dec-POMDPs the partial observability is accepted as is, where as in our problem setting we design the lens through which the agents acquire a partial observation/perception of the environment.
Nevertheless, a similar class of problems -often referred to as task-oriented, goal-oriented or efficient communication approaches, has recently received significant attention from the communication society, see e.g., the extensive surveys on similar problems in [1]- [3]. Table I positions the current work against some of the recent research that is closely related. To date, there is no work in the literature that we are aware of, which provides an analytical approach to the design of task-based communications for the coordination of multiple cooperative agents.

C. Contributions
The contributions of this paper are as follows: Firstly, we develop a general cooperative multi-agent framework in which agents interact over an underlying MDP en-vironment. Unlike the existing works which assume perfect communication links [26], [29], [38], [42], we assume the practical bit-budgeted communications between the agents. We formulate a multi-agent cooperative problem where agents interact over an underlying MDP and can communicate over a bit-budgeted channel. Our goal is to derive the optimal control and communication strategies to maximize the expected return. We will show in section II-B, that an underlying difference in our setting from the Dec-POMDP is that here we carry out a constrained design of each agent's perception function -which is also referred to as the observation function in the literature of the Dec-POMDP [43]. The constraints of this design are dictated by the bit-budget of the inter-agent communication channels.
Secondly, Theorem 1, in section III, derives the interconnection between the joint control and communication/quantization problem and a generalized version of the data quantization problem: TODC problem. In fact, the TODC problem distils all the relevant features of the control task and takes them into account in a novel non-conventional communication design problem. This is the underlying reason behind the effectiveness of the designed communications and is one the contributions in this work differentiating it from existing works in [8], [15], [16], [18], [26], [27], [30], [44]. Our analytical studies show that how the value function -the function that estimates the expected return of the system given the current observation -can be considered as a proper indirect measure of the usefulness of the data to be compressed. Thus, Theorem 1, shows how the usefulness of the (observation) data can be incorporated into the design of the TODC policy.
Thirdly, we propose a novel algorithm -SAIC -as a multi-agent state-aggregation algorithm which designs indirect task-effective communication strategies via solving (an approximated version of) the TODC problem. As a result, the performance of SAIC in terms of the system's expected return is on par with the jointly optimal strategies. To the best of our knowledge, this is the first use of state-aggregation algorithms for data-compression applications (in multi-agent systems) according to which our work differs from the classic state-aggregation literature [31]- [34] as well as the recent advancements in multi-agent communication literature [26], [30].
Moreover, we extend the existing results in the single-agent state-aggregation literature [33] on the gap between the optimal control and the state-aggregated control schemes, where the former has access to the true state of the environment and the latter has access to an aggregated state of the environment -to reduce the computational complexity. We quantify the same gap for a multi-agent system -Theorem 8. In our work, however, the gap is due to the bit-budget that is introduced on the inter-agent communication channels, whereas in classic state-aggregation literature the gap was a consequence of the constraints on the computational complexity. In addition to that, our theoretical results show that if our proposed method, SAIC, is applied the expected return of the multiagent communication system -with the bit-budget in place - Explicit Analytical can stay in close proximity to the optimal expected return that is obtained under jointly optimal strategies.
Last but not least, numerical experiments are carried out on a geometric consensus problem to compare the performance of SAIC with several other benchmark schemes in terms of the optimality of the expected return, for a multi-agent scenario 2 . It is shown that when communication bit-budgets are in place, SAIC is of significant advantage over the benchmarks. In particular, we observe a very tight gap between the performance of SAIC and the optimal control strategy where only the latter runs over perfect communication channels and the former runs over bit-budgeted channels.

D. Organization
Section II describes the system model for a cooperative multi-agent task with rate-constrained inter-agent communications. Section III Proposes a scheme for the joint design of communication and control policies that takes the value of information into account to perform data compression. We also provide analytical results on how distant the result of this algorithm can be from the optimal centralized solution. The numerical results and discussions are provided in section IV. Finally, section V concludes the paper.

E. Notation
For the reader's convenience, a summary of the notation that we follow in this paper is given in Table II. Bold font is used for matrices or scalars which are random and their realizations follows simple font.
II. SYSTEM MODEL In the multi-agent system, comprised of n agents, at any time step t each agent i ∈ N makes a local observation o i (t) ∈ Ω on environment while the true state of the environment is a member of S = Ω n . The alphabets Ω and S define observation space and state space, respectively. The particular observation structure of agents' observations, is referred to as collective observations in the literature [19]. Under collective observability, individual observation of an agent provides it with partial information about the current state of the environment, however, having knowledge of the collective observations acquired by all of the agents is sufficient to realize the true state of environment -eq. (1). The columns of the state vector are orthogonal to each other. Note that even in the case of collective observability, for agent i to be able to observe the true state of environment at all times, it needs to have access to the observations of the other agents j ∈ N − {i} N −i through communications at all times.
The true state of the environment s(t) is controlled by the joint actions m(t) = m 1 (t), .. T s(t + 1)|s(t), m(t) = p s(t + 1)|s(t), m(t) (2) which is unknown to the agents. T (·) : Ω 2n × M n → [0, 1] determines the future state of the environment s(t + 1) given its current state s(t) and the joint actions m(t). We recall that each agent i's domain level action m i (t) can, for instance, be in the form of a movement or acceleration in a particular direction or any other type of action depending on the domain of the cooperative task.
A deterministic reward function r(·) : Ω n × M n → R indicates the reward of all agents at time step t, where the arguments of the reward function are the joint observations s(t) and the domain-level joint actions m(t) of all agents. We assume that the underlying environment over which agents interact can be defined in terms of an MDP 3 determined by the tuple Ω n , M n , r(·), γ, T (·) , where Ω and M are discrete alphabets, r(·) is a function, T (·) is defined in (2) and the scalar γ ∈ [0, 1] is the discount factor. The focus of this paper is on scenarios in which the agents are unaware of the state transition probability function T (·) and of the closed form of the function r(·). However we assume that, further to the literature of reinforcement learning [45], a realization of the function r s(t), m(t) will be accessible for all agents at some time steps. Since the tuple Ω n , M n , r(·), γ, T (·) is an MDP and the state process s(t) is jointly observable by agents, the system model of this cooperative multi-agent setting, under perfect communications, is also referred to as a multi-agent MDP (MAMDP or MMDP) in the literature of multi-agent decision making [14], [46], [47].
In what follows two problems regarding the abovementioned setup is detailed i.e., centralized and decentralized control problems. The main intention of this paper is to address decentralized control which also incorporates interagent communications for a system of multiple agents. The centralized control problem, however, is also formalized in subsection II-A as the optimal expected return obtained for the centralized problem can serve as a lower-bound/(upper-bound) for the decentralized scheme. Moreover, the simpler nature and mathematical notations used for the centralized problem, allow the reader to have a smoother transition to the decentralized problem which is of a more complex nature.

A. Centralized Control
We consider a scenario in which a central controller has instant access to the observations o 1 (t), ..., o n (t) of both agents through a free (with no cost on the objective function) and reliable communication channel. From the central controller's point of view, the environment is the same as the underlying MDP that governs the system Ω n , M n , r(·), γ, T (·) . The goal of the centralized controller is to maximize the expected sum of discounted rewards (3). The expectation is computed over the joint PMF of the whole system trajectory s(1), m(1), ..., s(M ), m(M ) from time t = 1 to t = M , where this joint probability mass function (PMF) is generated if agents follow policy π(·), eq. (4), for their action selections at all times and the initial state s(1) ∈ S is randomly selected by the initial distribution s(1) ∼ α s . For the sake of having a more compact notation to refer to the system trajectory, hereafter, we represent the realization of a system trajectory at time t by tr(t) which corresponds to the tuple o 1 (t), ..., o n (t), m 1 (t), ..., m n (t) and the realization of the whole system trajectory by {tr(t)} t=M t=1 . Accordingly, the problem boils down to a single agent problem which can be Table II  TABLE OF NOTATIONS Symbol Meaning Information entropy of x(t) (bits) Expectation of the random variable X over the probability distribution p(x) δ(·) Dirac delta function where the policy π can be expressed as a CMF and p π s(t + 1)|s(t) is the probability of transitioning from s(t) to s(t+1) when the joint action policy π(·) is executed by the central controller. Similarly, p π {tr(t)} t=M t=1 is the joint PMF of tr(1), tr(2), ..., tr(M ) when the joint action policy π(·) is followed by the central controller.
On one hand, problem (3) can be solved using single-agent Q-learning [45] and the solution π * (·) obtained by Q-learning is guaranteed to be the optimal control policy, given some nonrestricting conditions [48]. On the other hand, the use-cases of the centralized approach are limited to the applications in which there is a permanent communication link with an unlimited bit-budget between the agents and the controller. Whereas these conditions are not met in many remote applications, where there is no communication infrastructure to connect the agents to the central controller.
Given sufficient training time, and channels with the sufficient rate of communication between the agents and the central controller, the centralized algorithm provides us with a performance upper bound in maximizing the objective function (3). Perfect communication between the central controller and distributed agents, however, may not exist due to the resource limitations of the telecommunication/communication network. Thus, the aim of this paper is to introduce decentralized approaches which are run over practical bit-budgeted communication channels, yet show comparable performance levels. In the distributed scenario, the agents do not communicate with a central controller, but the bit-budgeted communications are performed for inter-agent message exchange. The centralized problem can be presented by an MDP and be solved efficiently by a single agent reinforcement learning algorithm. As explained in the section I-C, the decentralized problem is a more complicated/general form of Dec-POMDP, where we know that a Dec-POMDP is already much more complex than an MDP to solve [43] -to see further insights about the significance and the applications of the decentralized problem see e.g., [1].

B. Problem Statement
Here we consider a scenario in which the same objective function explained in Eq. (3) needs to be maximized by the multi-agent system in a decentralized fashion, Fig. 1. Namely, agents with partial observability can only select their own actions. To prevail over the limitations imposed by the local observability, agents are allowed to have direct (explicit) communications, and not indirect (implicit) communications [44], [49]. However, the communication is done through a bitbudgeted but reliable channel. The bit-budget of the channel is R-bits per time step. Equivalently, each agent i at every time step t produces and transmits a single digit communication i.e., the size of the code-books C for all agents is the same and is less than 2 R . The communication message c i (t) produced by agent i is broadcast and received every agent j ∈ N −i . It should be noted that the design of the channel coding is beyond the scope of this paper and the main focus is on the compression of agents' observations. In particular we consider R to be time-invariant and to follow: The above-mentioned information constraint which will be in place throughout this paper together with the observation structure assumed in eq. (1) are of the aspects that distinguish our work from many of the related works in the literature of multi-agent communications [16], [27]. Now let the function g(t ) denote the system's return: Note that g(t ) is a random variable and a function of t as well as the trajectory {tr(t)} t=M t=t . Due to the lack of space, here we drop a part of the arguments of this function. In contrast to the centralized problem, the goal of the decentralized problem is to jointly design the communication/quantization as well as π c i (·) control policies π m i (·) for each agent i ∈ N to maximize the average return of the system. The control policy in which, c −i (t) ∈ C n−1 is a vector that includes all communication messages c j (t), ∀j ∈ N −i . The communication policy π c i : Ω × C n−1 → C of each agent i is a deterministic data quantization (many to one) function: which has a discrete domain Ω × C, making the quantizer a discrete quantizer. The joint control policy π m is a tuple made of n elements with its i-th element being π m i (·). Similarly, The joint communication policy π c is another tuple with its i-th element being π c i (·). According to the above definitions, the decentralized joint control and communication design problem is formalized as where the expectation is taken over p π m ,π c {tr(t)} t=M t=1 which is the joint PMF of tr(1), tr (2), ..., tr(M ) when each agent i ∈ N follows the action policy π m i (·) and the communication policy π c i (·) and the initial state s(1) ∈ S is randomly selected by the initial distribution s(1) ∼ α s . Given communication policy π c i (·), ∀i ∈ N , we now define the perception function h i (·) : S → C n−1 ×Ω of agent i which is the lens through which agent i perceives the state s(t) of the environment.
, ..., π c n (o n (t)) Agent i's perception of the environment is characterized by the communication policy π c j (·) of each agent j ∈ N −i . Accordingly, agent i uses its sensory signal o i (t) together with the received communication signals c −i (t) to acquire its perception of the environment. While the perception function defined here plays a role very similar to the observation function in Dec-POMDPs [40], the main difference is that here we design communication policies such that they directly affect the perception of agents from the environment. In contrast, in the case of Dec-POMDPs, the observation function is given. Communication policies π c j (·), ∀j ∈ N −i partially define the perception function of agent i.
To make the problem more concrete, further to (8) and (9), here we assume the presence of instantaneous and synchronous communications between agents, contrasting with the delayed [27], [50] and sequential communication models.    Figure 3. Here we show how we approached solving the joint control and communication problem for a distributed multi-agent system in a sequence of steps. According to the legend, one can understand that at the end of each step what are the known and unknown policies. a. This step solves the problem (3) for a centralized multi-agent system where the objective is to design one centralized control strategy. b. This step solves the problem (13) for a distributed multi-agent system where the objective is to design the communication policies of all agents. c. this step solves the problem for a distributed multi-agent system where the objective is to design the control policies of all agents.
Here we assume that the communication resources are split evenly amongst the agents, by considering the bit-budget of all communication channels to be equal to R. As such, each agent i ∈ N encodes its observation o i (t) to c i (t) using a code-book C of the same length |C| -with the constraint (5) in place.

III. STATE AGGREGATION FOR INFORMATION COMPRESSION (SAIC) IN MULTI-AGENT COORDINATION TASKS
The main result of this section -provided by Theorem 1 -is to show that finding the quantization policy in the joint control and quantization problem (10) can be approximated by a TODC problem. The goal of this problem is to quantize the observations of all agents according to how valuable these observations are within any specific task. The value of observations should be measured by the value function V * (·) -eq. (25). Lemma 2 approximates the TODC to a k-median clustering of the of observations according to their values, while lemma 4 computes the value function of each agent's observation. The concluding remarks of this section study the convergence and the optimality of the decentralized control policies. Fig. 3 is brought to demonstrate the chronological order according to which a joint communication and quantization is solved by SAIC. Our proposed scheme, SAIC, breaks down the joint communication and quantization problem to smaller problems that are feasible to solve. In this section, however, the subsections are organized according to the logical order that these smaller problems are encountered: (A) in section III-A , we address the communication design of multi-agent communications by transforming the primary joint control and quantization problem (10) to a novel problem (12) called TODC -step "b" of the Fig. 3. (B) Since solving the TODC problem relies on the knowledge of the value function V * (·), it is necessary to obtain the value function V * (·) prior to solving the TODC problem. In section III-B, the optimal value function V * (·) is obtained via a centralized training phase -step "a" of the Fig. 3. Given the knowledge of the value function V * (·), the TODC problem incorporates the features of the specific control task in the communication design problem. Accordingly, we can separately solve the communication problem with very little compromise on the optimality of the system's expected return. (C) As the final step, in section III-C, decentralized training phase is carried out to distributively design the control policy of each agent given the communication/quantization policy obtained via solving the TODC problem. Decentralized training is shown in step "c" of the Fig. 3. Since we follow standard methods to carry out the centralized training -steps "a" of the Fig. 3 we will be mainly focused on deriving and solving the TODC problem and providing guarantees on the performance of the MAS in the decentralized training phase -steps "b" and "c" of the Fig. 3 respectively. Fig. 4 illustrates how SAIC performs data compression while it maintains the performance of the multi-agent system in its task.

A. Task-Oriented Data Compression Problem
The main result of this section is provided by Theorem 1. This theorem departs from the joint communication/quantization and control problem and arrives at the taskoriented data compression problem (12). Theorem 1. The design of the communication policy in problem (10) can be approximated as a generalized data quantization problem in which the measure of distortion is the absolute difference of the value functions V * s(t) and V * h i s(1) with the source of information s(t) ∈ Ω n being a Markovian stochastic process. The function V * h i s(1) measures the optimal value of the perceived state h i s(1) from agent i's perspective.
Proof. Appendix A.
In Appendix A-C, we provide more details on how to obtain the value V * h i s(1) of the perceived state from the agent i's point of via Lemma 12. This value function allows us to indirectly quantify the usefulness of agent i's observation. With this interpretation in mind, in the TODC problem (12), unlike conventional quantization problems, we are not minimizing the absolute difference between the original signal s(1) and its quantized version h i s(1) . Instead, we  are minimizing the distance between how useful/valuable the original signal s(1) is and how useful the quantized version of the signal h i s(1) are for the task at hand. This is inline with what many believe as the mission of the goaloriented/task-oriented communications. Let us recall that the value function here is an indirect measure of usefulness, as it can be obtained for any task that can be expressed via Markov Decision Processes -making it a measure of usefulness that is applicable to a plethora of scenarios [1], [11].
The significance of the result obtained by Theorem 1 is multi-fold: (i) Multi-dimensional observations will be transformed to one-dimensional output space of the value functions, reducing the complexity of the clustering algorithm, (ii) It can be shown that the observation points will be linearly separable when being clustered according to the problem (12), (iii) It is widely accepted that the mission of goal oriented communications is to incorporate the usefulness/value of the data for the task when designing the task-effective communications. The result of Theorem 1, in which the design of the quantizer relies on the value/usefulness of observations resonates well with this purpose of goal-oriented communications. (iv) It is known that the value of observations starts to grow as we get closer to the ultimate target of the task in hand. With this interpretation of "target" in mind, the finding of Theorem 1 is in line with the adaptive quantization schemes, which stretch the quantization intervals when the observations are far from the target and sharpen the quantization when the observations are closer to the target [13], [51]. This interpretation is also confirmed by our numerical results in section V, Fig. 8.
To solve a quantization problem as (12) using nonvariational techniques, it is customary to approximate/convert a quantization problem by/to a clustering problem [52], [53]. Lemma 2 approximates the quantization problem (12) by a clustering problem.
Lemma 2. The quantization problem (12) can be approximated by a clustering problem where µ k is the centroid of the k-th cluster P i,k and P i = {P i,1 , . . . , P i,|C| } is a partition of the observation space Ω. Similar to any other quantization function, the quantizer π c i (·), can be uniquely described by the partition P i together with C.
Proof. Appendix B provides proof and discussions.
The problem (13), can be solved via k-median clustering. In order to that, one can first perform the k-median clustering on the observation values by solving

B. Centralized Training Phase
While solving the TODC problem can provide us with a task-effective design of quantization policy, to solve (12) we need to know the value of observations according to the optimal centralized control policy. By solving the centralized problem (3), the value of joint observations and actions Q * s(t), m(t) can be obtained. Let us recall that the centralized training phase will only yield an optimal policy if the environment is jointly observable -as described by condition 3.
Accordingly, following the lemma 4 we can compute the value of each agent's observations V * o i (1) . But before lemma 4, let us first give an intuitive/philosophical meaning of the centralized training and distributed execution. We know that in task-oriented communication design, our goal is to take into account the usefulness/value of the data for the task in hand. Thus we need to first be able to measure the usefulness/value of the data to be transmitted. The centralized training phase is needed to come up with a precise measure of usefulness for the specific task in hand. We have already shown in Theorem 1, that this measure of usefulness is nothing but the value observations V * o i (1) -yet the exact function values can be known only after the centralized training phase. During the centralized training phase, we assume perfect communication between all agents and a central controller -this is a common practice in the literature of multi-agent communications and coordination [26], [54]. Whereas, in the decentralized training -step "c" of the Fig. 3 -as well as in the execution phase, we assume bit-budgeted communications. That is, all the results reported for SAIC in section V are obtained via bit-budgeted communications.
Proof. Appendix C.
Based on (15), V * o i (1) can be computed both analytically (if transition probabilities of environment are available) and numerically. As detailed in Algorithm 1, SAIC first solves a centralized control problem to compute the value V * (o) for all o ∈ Ω -this is equivalent to the step "a" of the Fig. 3 and subplot (b) of the Fig. 4. Afterwards, SAIC solves the approximated TODC problem (12) by converting it to a k-median clustering (13), leading to an observation aggregation/quantization function for each agent i determined by π c i (·) -this is equivalent to the step "b" of the Fig. 3 and the subplot (c) of the Fig. 4. By following this aggregation function, the observations o i (t) ∈ Ω will be aggregated/quantized such that the performance of the multi-agent system in terms of the objective function it attains is optimized. As SAIC uses a deterministic mapping of observation o i to produce the communication message c i , SAIC is guaranteed to have positive signalling [30].

C. Obtaining Decentralized Control Policies via a Decentralized Training Phase
Upon the availability of the π c i (·), ∀i ∈ N , which was obtained by solving problem (13), we need to find control policies for all agents corresponding to the communication policies π c i (·), i ∈ N . That is, we now solve the problem (10) by plugging the exact communication policy π c i (·) ∀i ∈ N into it. Within this training phase -referred to as the decentralized training phase -control Q-tables Q m i (·) ∀i ∈ N are obtained -step "c" of the Fig. 3. This training phase, as well as the execution phase of the algorithm, can both be carried out distributively, while agents communicate over bitbudgeted channels using the communication policies obtained before in section III-A. The following remarks are brought to characterize the performance of SAIC, in the decentralized training phase.
We now first define the concept of lumpablity, according to which we will then set a condition -Condition 6 -for the correctness of remarks 3 and 4.
Definition 5. Lumpability of an MDP: Let α s be the probability distribution of the initial state of an MDP at the initial step. The MDP is called (strongly) lumpable with respect to the perception function h i (·) if the the transitions between all the perceived states h i (s(t)) -which are perceived through the lens of h i (·) -follow Markov rule for every probability distribution α s of the initial state of the original MDP [34]. Condition 6. Let the environment as perceived from the perspective of agent i within the decentralized training phase be called an aggregated MDP denoted by Ω × C n−1 , M, r(·), γ, T (·) , whereas the state space of the aggregated MDP Ω × C n−1 is an image of Ω n under the perception function h i (·). Now given the definition 5, assuming the lumpability of the underlying MDP Ω n , M n , r(·), γ, T (·) with respect to h i (·) is equivalent to the assumption that the aggregated Ω × C n−1 , M, r(·), γ, T (·) is an MDP under every possible α s . This assumption is in place for the correctness of remarks 3 and 4.
Remark 1: The optimal policy π * (·) is achievable by the centralized training phase. Assuming Condition 3 to hold, the environment is fully observable for the central controller while the central controller posses the ability to jointly select the actions for all agents. The problem will thus reduce to a single agent Q-learning applied on an MDP with asymptotic convergence to the optimal policy π * (·). and all-zero Q-table Q oi(t), oj(t), mi(t), mj(t) . 5: Obtain π * (·) and Q * (·) by solving (3) using Q-learning [45]. 6: Compute V * oi(t) following eq. (15), for ∀oi(t) ∈ Ω. 7: Solve problem (13) by applying k-median clustering to obtain π c i (·), for i ∈ N . 8: for each episode k = 1 : K do 9: Randomly initialize local observation oi(t = 1), for i ∈ N 10: for t k = 1 : M do Select ci(t) following π c i (·), for i ∈ N

12:
Obtain message c−i(t), for i ∈ N 13: and π m i mi(t)|oi(t), c−i(t) by following greedy policy for i ∈ N underlying MDP denoted by Ω n , M n , r(·), γ, T (·) , views an aggregated form of the original MDP denoted by Ω × C n−1 , M, r(·), γ, T (·) . The aggregated MDP experienced by agent i will be an MDP itself, if the conditions 3 and 6 hold.

Remark 3:
The MAS, during the decentralized training phase, will be composed of n different MDPs with identical state space Ω × C n−1 , action space M and reward signal. The resulting multi-agent environment will be, according to the definition, a multi-agent MDP (MMDP) [47].

Remark 4:
Within the distributed training phase, distributed Q-learning is applied to a deterministic MMDP 4 , which leads to an asymptotically optimal control policy [14] 5 . For this remark to be true conditions 3 and 6 must hold.
Note that the control policy π m,SAIC i (·) that is obtained within the distributed training phase of SAIC is optimal for the given communication policy π c,SAIC (·), that was obtained within the centralized training phase. Therefore, π m,SAIC i (·) is not necessarily an optimal solution to the problem (10). In Theorem section IV, however, we set an upper-bound on the possible loss on the expected return of the system due to the joint selection of π m,SAIC i (·) and π c,SAIC (·). 4 The definition of MMDP in [47] is identical to the definition of cooperative MAMDP used in [14]. 5 This training phase can result in an asymptotically optimal control policy of all agents for non-deterministic MMDPs. This, however, will require n additional centralized training phases prior to the decentralized training phase, where n is the number of agents.

IV. CHARACTERIZING THE ERROR BOUND OF SAIC
As discussed in section III, SAIC uses two approximations to solve the original joint quantization and control problem. It was not, however, explained that how these approximation would impact the performance of SAIC in terms of the system's average return. By extending the results of [33] to a multi-agent scenario, we characterize the performance gap of SAIC proposed in section III. Instead of measuring the difference between the average return obtained by SAIC with that of the jointly optimal policies for the problem (10), in Theorem 8, we measure the performance gap between the average return attained by SAIC with that of the centralized controller -whereas the latter has had access to perfect communications and as well as full observability of the environment. The measured gap is, indeed, larger than the performance gap between SAIC and a hypothetical jointly optimal solution to (10), as in the case of the central controller there is no communication/observation limitation in place. The performance gap between SAIC and the centralized solution provided by Theorem 8 is proposed in terms of the discount factor λ of the task and a positive scalar . Definition 7 details the notion of -cost uniform. Lemma 9 is proposed to compute the value of for SAIC. Definition 7. Given a positive number a subset P i,k ⊂ Ω is said to be -cost-uniform with respect to the policy π(·) if the following conditions hold for two arbitrary observations o , o ∈ P i,k : (16) where M π (o ) = m ∈ M : π(m|o ) > 0 .
Theorem 8. Consider a multi-agent system in which agents are subject to local observability and local action selection. If agents are allowed to communicate through communication channels with a bit-budget R-bits at each time step, the maximum achievable expected return of the multi-agent system following SAIC algorithm will be in a small neighbourhood of the same MAS if it was controlled with a centralized unit under perfect communications: where γ is the discount factor and should be computed according to lemma 9, conditioned on the lumpability of the original MDP -Condition 6.
Proof. Appendix D.
In Theorem 8, we will show that the error gap between Lemma 9. Given the partition P i = {P i,1 , ..., P i,2 R } that is obtained by solving eq. (38) during the centralized training phase, all subsets P i,k for k ∈ {1, 2, ..., 2 R } are -cost-uniform with respect to the optimal joint policy π * (·) where can be obtained by the following Proof. Following definition 7 and eq. (13) the proof is straightforward.
V. PERFORMANCE EVALUATION In this section, we evaluate our proposed schemes via numerical results for a particular geometric consensus problem with finite observability called the rendezvous problem. Geometric consensus problems arise in numerous emerging applications such as UAV/vehicle platooning -making them a meaningful application area for the framework proposed by this paper [6]. The numerical results achieved by SAIC will prove the suitability of the proposed framework as a potential enabling technology for vehicle/UAV platooning under limited communications.
The rendezvous problem, which is a sub-category of the geometric consensus, has been previously investigated in the literature [42], [55], whereas in our case the inter-agent communication channel is set to have a limited bit-budget. The rendezvous problem is of particular interest to us, also because it allows us to consider a cooperative MAS comprising of multiple agents that are required to communicate for their coordination task. In particular, as detailed in subsection V-A, if the communication between agents is not efficient, at any time step t each agent i will only have access to its local observation o i (t), which is its own location in the case of rendezvous problem. This mere information is insufficient for an agent to attain the larger reward C 2 , but is sufficient to attain the smaller reward C 1 . Accordingly, compared with cases in which no communication between agents is present, in the set up of the rendezvous problem, efficient communication policies can increase the attained objective function of the MAS up to sixfolds, as will be seen in Fig. 4. The system operates in discrete time, with agents taking actions and communicating in each time step t = 1, 2, ... . We consider a variety of grid worlds with different size values N and different locations for the goal-point ω T . We compare the proposed SAIC and LBIC with (i) the centralized Q-learning scheme and (ii) the Conventional Information Compression (CIC) scheme which is explained in subsection V-B. Changing the reward function can also build new scenarios. For example, a reward function that encourages the agents to come together as close as possible but not collide with each other can emulate a vehicle platooning scenario. While useful, it is outside the scope of our work to investigate the response of the multi-agent system to different rewarding schemes. Note that, according to Theorem 1, regardless of the definition of the reward function, the geometric consensus problem (or in general the joint quantization and control problem) can be solved by SAIC if the necessary Conditions 3 and 6 are met, and centralized training phase is feasible. As the number of agents n increases, the Q-learning for the centralized training phase becomes increasingly demanding in terms of computational complexity; this is where SAIC's bottleneck lies. A larger reward C 2 > C 1 is given to both agents when they enter the goal point at the same time, as in the example; (c) in contrast, C 1 is the reward accrued by agents when only one agent enters the goal position [27].

A. Rendezvous Problem
As illustrated in Fig. 5, in a rendezvous problem, multiple agents operate on an N × N grid world and aim at arriving at the same time at the goal point on the grid. Each agent i ∈ N at any time step t can only observe its own location o i (t) ∈ Ω on the grid, where the observation space is Ω = {0, 1, ..., n 2 − 1}. Each episode terminates as soon as an agent or more visit the goal point which is denoted as ω T ∈ Ω. That is, at any time step t that the observation of each agent i ∈ N is a member of Ω T , the episode will be terminated -so the time horizon M is non-deterministic. The subset S T ⊂ S also defines all state realizations where one or more agents are in the goal location i.e., We also define the subset S T n ⊂ S T that includes all the terminal states where only n number of agents have arrived at the goal location i.e., where N ⊆ N is a subset of all agents with size |N | = n . Following the same definition for S T n , the subset S T n is equivalent to the set of all terminal states where all agents are at the goal location. At time t = 1, the initial position of all agents is randomly and uniformly selected amongst the nongoal states, i.e., for each agent i ∈ N the initial position of the agent is At any time step t = 1, 2, ... each agent i observes its position, or environment state, and acquires information about the position of the other agents by receiving a communication message vector c −i (t) sent by the other agents j ∈ N −i at the time step t. Based on this information, agent i selects its environment action m i (t) from the set M = {Right, Left, Up, Down, Stop}, where an action m i (t) ∈ M represent the horizontal/vertical move of agent i on the grid at time step t. For instance, if an agent i is on a grid-world as depicted on Fig. 5 (a), and observes o i (t) = 4 and selects "Up" as its action, the agent's observation at the next time step will be o i (t + 1) = 8. If the position to which the agent should be moved is outside the grid, the environment is assumed to keep the agent in its current position. We assume that all these deterministic state transitions are captured by  T o 1 (t), ..., o n (t), m 1 (t), ..., m n (t) , which can determine the observations of agents in the next time step t + 1 following o1(t + 1), ..., on(t + 1) = T o1(t), ..., on(t), m1(t), ..., mn(t) .
Accordingly, given observations o i (t + 1), ..., o n (t + 1) and actions m 1 (t + 1), ..., m n (t + 1) , all agents receive a single team reward r o1(t), ..., on(t), m1(t), ..., where C 1 < C 2 and the propositions P 1 and P 2 are defined as P 1 : T o 1 (t), ..., o n (t), m 1 (t), ..., m n (t) ∈ S T − S T n and P 2 : T o 1 (t), ..., o n (t), m 1 (t), ..., m n (t) ∈ S T n . When only a subset N , |N | = n < n of agent arrives at the target point ω T , the episode will be terminated with the smaller reward C 1 being obtained, while the larger reward C 2 is attained only when all agents visit the goal point at the same time. Note that this reward signal encourages coordination between agents which in turn can benefit from inter-agent communications.
Furthermore, at each time step t agents choose a communication message to send to the other agent by selecting a communication action c i (t) ∈ C = {0, 1} R of R bits, where R (bits per channel use / per time step) is the fixed bit-budget of all inter-agent communication channels. The goal of the MAS is to maximize the average return by solving the problem (10).

B. Conventional Information Compression In multi-agent Coordination Tasks
As a baseline, we consider a conventional scheme that selects communications and actions separately. For communication, each agent i sends its observation o i (t) to the other agents by following policy π c i (·). According to this policy the agent's observation o i (t) will be mapped to a binary bit sequence c i (t), using an injective (and not necessarily surjective) mapping f 1 : Ω → {0, 1} R . Consequently, the communication policy π c i becomes deterministic and follows Agent i obtains an estimate c j (t) of the observation of all agents j ∈ N −i by having access to a quantized version of o j (t). This estimate is used to define the environment state- . This function is updated using Q-learning and the UCB policy in a manner similar to Algorithm 1, with no communication policy to be learned.
This communication strategy is proven to be optimal [19], if the inter-agent communication does not impose any cost on the cooperative objective function, the communication channel is noise-free and the bit-budget of communication channels are larger than the entropy rate of the observation process R ≥ H(o i ). Under these conditions, and when the dynamics of the environment are deterministic, each agent i can distributively learn the optimal policy π m i (·), using value iteration or its model-free variants e.g., Q-learning [14]. While this communication policy is optimal only with a channel bitbudget R ≥ H(o j ), in this paper, we are focused on the scenarios with R ≤ H(o j ). Therefore, due to the bit-budget of the communication channel, a form of TODC is required.
Note that compression before a converged action policy is not possible, since all observations are a priori equally likely. Thus, we first train the CIC on a communication channel with unlimited capacity. Afterwards, when a probability distribution for observations is obtained, by applying Lloyd's algorithm [52], we define an equivalence relation on the observation space Ω with 2 R numbers of equivalence classes Q 1 , ..., Q 2 R . According to the defined equivalence relation by Lloyd's algorithm, we can uniquely define the mapping f 1 : Ω → {0, 1} R that maps each agent i's observation o i (t) to a communication message c i (t). The inverse f −1 1 (·) of the quantization mapping that maps agent j's quantized observation c j (t) into a estimated observation is not an injective mapping anymore. That is, by receiving the communication message c j (t) ∈ Q k ⊂ C agent i can not retrieve o j (t) but understands the observation of agent j has been a member of Q k . Note that CIC algorithm has a limitation, as it requires the first round of training to be done over communication channels with unlimited capacity.

C. Results
To perform our numerical experiments, rewards of the rendezvous problem are selected as C 1 = 1 and C 2 = 10, while the discount factor is γ = 0.9. A constant learning rate α = 0.07 is applied, and the UCB exploration rate c = 12.5. In any figure that the performance of each scheme is reported in terms of the averaged discounted cumulative rewards, the attained rewards throughout training iterations are smoothed using a moving average filter of memory equal to 10% of the experiment iterations. We will use the terms "value of the collaborative objective function", "value of the objective function" and "average return" interchangeably throughout this section. Regardless of the grid-world's size and goal location, the grids are numbered row-wise starting from the left-bottom as shown in Fig. 5-a. Apart from Fig. 7 that illustrates the result related to a rendezvous problem for a three-agent system, other figures have been obtained when experimenting in a two-agent environment. Fig. 6 illustrates the performance of the proposed SAIC as well as six other benchmark schemes • Centralized Q-learning under perfect communications. • Learning based information compression (LBIC) is a different indirect scheme to design task-oriented communications which performs the joint design of communication and control policies through reinforcement learning following an algorithm similar to the one proposed in [27]. • CIC, see the details of CIC in subsection V-B. • Heuristic non-communicative (HNC) algorithm is a direct heuristic scheme which exploits the domain knowledge of its designer about the rendezvous task -making it not applicable to any other task rather than the rendezvous problem. The domain knowledge is utilized to design a control policy where no communication is present. In HNC, agents approach the goal point and wait next to it for a large enough number of time-steps to make sure the other agent has also arrived there. Only after that, they will get into the goal point. Note that this scheme requires communication/coordination between agents prior to the starting point of the task. • Heuristic optimal communication (HOC) algorithm is a direct heuristic scheme which exploits the domain knowledge of its designer about the rendezvous taskmaking it not applicable to any other task rather than the rendezvous problem. The domain knowledge is utilized to design jointly optimal communication and control policies. In HNC, agents approach the goal point and wait next to it until they hear from the other agent it also has arrived there. Only after that, they will get into the goal point. Note that this scheme requires communication/coordination between agents prior to the starting point of the task. • Hybrid scheme uses the abstract representation of agents' observations according to SAIC with R = 2 bits and feeds these latent observations to a centralized controller. The central controller learns the joint action selection of both agents using Q-learning.
It is imperative to recall that, not all the schemes evaluated by Fig. 6 are benefit from indirect designs -making them not sufficiently general to be applied to all other multiagent communication problems with rate-limited inter-agent channels. Regardless of their effectiveness, SAIC, LBIC, CIC and Hybrid are indirect schemes potentially applicable to any other task-oriented compression problem. Whereas, HNC and HOC are tailor-made for the rendezvous problem. In other words, the knowledge that we have about the rendezvous task is already embedded in HNC and HOC to enable the most effective communication/control strategies. HNC and HOC, however, allow us to understand how effective other indirect approaches are even when no knowledge about the specific rendezvous task is embedded in them.
The performance is measured in terms of the expected sum of discounted rewards in a rendezvous problem. The grid-world is considered to be of size N = 8 and its goal location to be ω T = 22. The bit-budget of the channel between the two agents is R = 2 bits per time step. Since centralized Q-learning is not affected by the limitation on the channel's bit-budget, it achieves optimal performance after sufficient training, 160k iterations. The CIC, due to the insufficient bit-budget of the communication channel, never achieves the optimal solution. The LBIC, however, is seen to outperform the CIC, although it is trained and executed fully distributedly. While enjoying a fast convergence, it is observed that the SAIC can achieve optimal performance by less than 1% gap, whereas the performance gap for the LBIC and CIC are much more pronounced ranging from 20% to 30%. The yellow curve showing the performance of the CIC with no communication between agents would show us the best performance of distributed reinforcement learning that can be achieved if no communication between agents is in place without having any domain knowledge -that is present in the HOC and HNC. In fact, the better performance of any scheme compared with the yellow curve, is the sign that the scheme is either benefiting from some effective communication between agents or from some domain knowledge. Note that, when inter-agent communication is unavailable, i.e., R = 0 bit per time step, there would be no difference in the performance of the CIC, SAIC or LBIC as all of them use the same algorithm to find out the action policy π m i (·). We also recall the fact that both the CIC and SAIC require a separate training phase which is not captured by Fig. 5. SAIC requires a centralized training phase -to perform the computations demonstrated in line 5 of the algorithm 1 -and CIC a distributed training phase with unlimited capacity of inter-agent communication channels. The performance of these two algorithms in Fig. 5 is plotted after the first phase of training.
Similar to Fig. 6, the performance of SAIC is illustrated in Fig. 7, this time in a n = 3 three-agent system. In this case, the grid-world is considered to be of size N = 3 and its goal location to be ω T = 9. The bit-budget of the interagent communication channels is set to be R = 1 bits per time step. The shaded area around the curve corresponding to SAIC, shows the standard deviation of SAIC in the training as well as the execution phases -at any given training episode k the width of the shaded curve is equal to the standard deviation of SAIC's return from the training episode k to the episode k−1000. This figure illustrates the very robust performance of SAIC in a three-agent scenario. For this particular experiment we used decaying epsilon greedy policies with the starting value of = 1 and the ending value of = 0.03. To overcome the issue of credit assignment in multi-agent systems -see e.g., [54] to get familiar with the concept, here we used a different reward function via which we trained the agents. Accordingly, given observations o i (t+1), ..., o n (t+1) and actions m 1 (t+ 1), ..., m n (t + 1) , all agents receive a single team reward r o1(t), ..., on(t), m1(t), ..., where the proposition P 3 is defined as P 3 : T o 1 (t), ..., o n (t), m 1 (t), ..., m n (t) ∈ S T n . When a subset N , |N | = n ≤ n of agent arrives at the target point ω T , the episode will be terminated with the reward C n −1 2 being obtained, while the largest reward C n−1 2 is attained only when all agents visit the goal point at the same time. Note that this reward signal encourages coordination between agents which in turn can benefit from inter-agent communications.
To explain the underlying reasons for the remarkable performance of the SAIC, Fig. 8 is provided so that equivalence classes {P i,1 , ..., P i,2 R } computed by the SAIC can be seen -all the locations of the grid shaded with the same colour belongs to the same -cost-uniform equivalence class. The SAIC is extremely efficient in performing state aggregation such that the loss of observation information barely incurs any loss on the achievable sum of discounted rewardsalso depicted in Fig. 5. The Fig. 8-(a), illustrates the state aggregation adopted by the SAIC, for which the average return is illustrated in Fig. 4. It is illustrated in Fig. 8-(a) that how the SAIC performs observation compression with ratio R c = 3 : 1, while it leads to nearly no performance loss for the collaborative task of the MAS. Here the definition of compression ratio follows R c = H o i (t) / H c i (t) . It was observed in 8 that the observation clusters identified by SAIC have not been linearly separable under their original representation. In contrast, when clustered according to their values, as seen in Fig. 9, observation points become linearly separable. Fig. Fig. 9, allows us to see how precise the approximation of V π m * ,π c o i (1),   We also investigate the impact of channel bit-budget R on the value of average return achieved by the LBIC, SAIC and CIC, in Fig. 10. In this figure, the normalized value of average return achieved for any scheme at any given R is shown. As per (22), the average return for the scheme of interest is computed by E p π m ,π c ({tr(t)} t=M t=1 ) g(1) , where π m i (·) and π c i (·) are obtained by the scheme of interest after solving (10) with a given value of R. The average return is then normalized by dividing it to the average return E p π * ({tr(t)} t=M t=1 ) g(1) that is obtained by the optimal centralized policy π * (·). The policy policy π * (·) is the optimal solution to (3) under no communications constraint. .
Accordingly, when the normalized objective function of a particular scheme is seen to be close to the value 1, it implies that the scheme has been able to compress the observation information with almost zero loss with respect to the achieved objective function. On one hand, it is demonstrated that the SAIC achieves the optimal performance while running with 2 bits of inter-agent communications, while it takes the CIC at least R = 4 bits to get to achieve a sub-optimal value of the objective function. The LBIC, on the other hand, provides more than 10% performance gain in very low rates of communication R ∈ {1, 2, 3} bits per time step, compared with CIC and 20% performance gain compared with SAIC at R = 1 bits per time step. Fig. 11, studies the normalized objective functions attained by the LBIC, SAIC and CIC under different compression ratios R c . A whopping 40% performance gain is acquired by the SAIC, in comparison to the CIC, at high compression ratio R c = 3 : 1. This is equivalent to 66% of saving in the bitbudget with no performance drop with respect to the collaborative objective function. The SAIC, however, underperforms the LBIC and CIC at very high compression ratio of R c = 6 : 1. This is due to the fact that the condition mentioned in remark 2 is not met at this high rate of compression. Moreover, the CIC scheme is seen not to achieve the optimal performance even at the compression rate of R c = 6 : 5 which is due to the fact that by exceeding the compression ratio R c = 1 : 1 each agent i may lose some information about the observation o j (t) of the other agent which can be helpful in taking the optimal action decision. As demonstrated through a range of numerical experiments, the weakness of conventional schemes for compression of agents' observations is that they may lose/keep information regardless of how useful they can be towards achieving the optimal objective function. In contrast, the task-based compression schemes SAIC and LBIC, for communication bit-budgets (very) lower than the entropy of the observation process, manage to compress the observation information not to minimize the distortion but to maximize the achievable value of the objective function. Even though the numerical example provided in section IV, evaluates the performance of SAIC in a problem with a very low communication bitbudget, our theoretical results are applicable in scenarios with higher communication rates, as long as the processing unit that is deployed to solve the problem (3) is of sufficient computational resources to solve the problem in the desire time window.

VI. CONCLUSION
We have investigated the distributed joint design of communications and control for an MAS under bit-budgeted communications with the ultimate goal of maximizing the system's expected return. Since we consider a limited bit-budget for the multi-agent communication channels, task-based compression of agents' observations has been of the essence. Our proposed scheme, SAIC, which derives and solves the TODC problem can be differentiated from the conventional data quantization algorithms in the sense that it does not aim at achieving minimum possible distortion between the original signal and its reconstructed version -given a bit-budget for inter-agent communications. In contrast, SAIC aims at achieving the minimum possible distortion between the (learned) usefulness/value of the original observation signal and the learned usefulness/value of the the reconstructed observation signalgiven a bit-budget for inter-agent communications. We have demonstrated the outstanding performance of SAIC compared with the conventional data compression algorithms, by up to a remarkable 40% improvement in the achieved objective function, when being imposed with tight constraints on the communication bit-budget.
To maximize the system's expected return, we could show analytically, how one can disentangle the TODC from the control problem -given the possibility of a centralized training phase. Our analytical studies confirm that despite the separation of the TODC and control problems, we can ensure very little compromise on the MAS's average return -compared with the jointly optimal control and quantization. Since the computational complexity of Q-learning in the centralized training phase is order of |Ω n ×M n | time complexity [56], the addition of one single agent will multiply the complexity of the centralized training by |Ω × M|. Thus, the complexity of the centralized training phase becomes a hurdle for the scalability of SAIC to a high number of agents. Accordingly, improving the scalability of the algorithm as well as extending the results for non-symmetric variable bit-budgets can be useful avenues to improve the applicability of the proposed schemes.

APPENDIX A PROOF OF THEOREM 1
To prove this theorem we first introduce a definition in subsection A-A, together with two lemmas and their proofs in subsections A-B and A-C. Lastly, we complete the proof of Theorem 1, in subsection A-D leveraging the abovementioned.
A. Task-based information compression problem: a definition Definition 10. [Task-based information compression (TBIC) problem] Let the higher order function Π m * be a map from the vector space K c of all possible joint communication policies π c = π c 1 (·), ..., π c n (·) to the vector space K m of optimal corresponding joint control policies π m = π m * 1 (·), ..., π m * n (·) . Upon the availability of Π m * , by plugging it into the problem (10), we will have a new problem where we maximize the system's return only with respect to the joint communication policies π c . The joint optimal control policies π m * 1 (·), ..., π m * n (·) are automatically computed by the mapping Π m * π c 1 (·), ..., π c n (·) . The problem is called here as the TBIC problem.
B. Reformulating the objective function: a lemma Lemma 11. The objective function of the decentralized problem (10) can be expressed as E p π m ,π c ({tr(t)} t=M t=t ) g(t ) = E p π m ,π c (h i (s(t ))) E p π m ,π c ({tr(t)} t=M t=2 |h i (s(t ))) g(t )|hi(s(t )) = E p π m ,π c (h i (s(t ))) Vπm,πc hi(s(t )) , for all i ∈ N , where V π m ,π c h i (s(t )) is the solution to the Bellman equation corresponding to the joint control and communication policies π m , π c .
Proof. Considering the definition of the value function, given in (25), the proof is immediately concluded when applying Adam's law on the expectation of the value function Vπm,πc hi s(t ) = E p π m ,π c ({tr(t)} t=M t=t +1 ) g(t )|hi s(t ) .
( 25) C. Value of the perceived state of environment: a lemma Lemma 12. Using the knowledge of the solution π * (·) to the centralized problem, we can find the optimal value of a perceived state V * h i s(t) in terms of the value of the underlying state V * s(t) by Quantization levels are disjoint sets P i,k ⊂ Ω, where their union ∪ 2 R k=1 P i,k will cover the entire Ω. Each quantization level is represented by only one communication message c j (t) = c k ∈ C. Further to lemma 12, the value of V * h i s(t) can be computed by empirical mean (26).
The quantization problem (36) becomes a k-median clustering problem where P i = {P i,1 , ..., P i,2 R } is a partition of Ω, and the first summation oj (t)∈Ω j∈N−i is a concatenation of n − 1 summations each one acting over o j (t) ∈ Ω where j ∈ N −i .
By taking the mean of V * s(t) over the empirical distribution of o j (t), ∀j ∈ N i , we can also marginalize out o j (t), ∀j ∈ N i . Again, it does not change the solution of the problem and we will have in which µ k = oj (t)∈P i,k µ k will approximate V * c i (t) .
To gain more insight about the meaning of this task-based information compression, it is useful to take a look at the conventional quantization problem which is adapted to our problem setting in eq. (39), where c j = π c j o j (1) . In fact, the compression scheme applied in the CIC, explained in subsection (V-B), is obtained by solving the following problem which can be solved optimally by the Lloyd's algorithm [52].

APPENDIX C PROOF OF LEMMA 4
Proof. Further to the law of iterated expectations, V * o i (t ) can be expressed as where the expectation of the last term is the optimal value of the state s(t ) = o i (t ), o −i (t ) of the underlying MDP V * s(t ) = E π * g(t )|o i (t ), o −i (t ) .
Using (40) and (42) we can simply compute V * o i (t ) by APPENDIX D PROOF OF THEOREM 8 Proof. Without loss of generality, we have written the proof of this theorem for a two agent scenario to improve the readability. Given the proof for the two-agent system, the extension to a multi-agent system is straightforward. According to the [33](Lemma 1), optimal state values of the aggregated MDPs (the environment as is seen by one agent during the decentralized training phase of SAIC) are in a