Effective Communication With Dynamic Feature Compression

The remote wireless control of industrial systems is one of the major use cases for 5G and beyond systems: in these cases, the massive amounts of sensory information that need to be shared over the wireless medium may overload even high-capacity connections. Consequently, solving the effective communication problem by optimizing the transmission strategy to discard irrelevant information can provide a significant advantage, but is often a very complex task. In this work, we consider a prototypal system in which an observer must communicate its sensory data to a robot controlling a task (e.g., a mobile robot in a factory). We then model it as a remote Partially Observable Markov Decision Process (POMDP), considering the effect of adopting semantic and effective communication-oriented solutions on the overall system performance. We split the communication problem by considering an ensemble Vector Quantized Variational Autoencoder (VQ-VAE) encoding, and train a Deep Reinforcement Learning (DRL) agent to dynamically adapt the quantization level, considering both the current state of the environment and the memory of past messages. We tested the proposed approach on the well-known CartPole reference control problem, obtaining a significant performance increase over traditional approaches.


I. INTRODUCTION
T HE main goal of classical communication theory is to build reliable systems for the accurate transmission of arbitrary data through a constrained communication channel while using as few symbols as possible.However, in the preface to Shannon's seminal work [1], Warren Weaver already envisioned two more complex Levels of communication beyond the simple transmission of bits.Classical communications are then included in Level A, or the technical problem, which concerns itself with the accurate and efficient transmission of arbitrary raw data.Level B, or the semantic problem, is to find the best way to convey the meaning of the message, even when irrelevant details are lost or misunderstood, while Level C, also called the effectiveness problem, deals with the resulting behavior of the receiver [2]: as long as the receiver takes the optimal decision, the effectiveness problem is solved, regardless of the quality of the received information.The Level B and C problems are tightly intertwined, as defining the meaning of a message is often related to the intentions of the receiver.
While the Level B and C problems attracted limited attention for decades, the explosion of Industrial Internet of Things (IIoT) systems has drawn the research and industrial communities toward semantic and effective communication [3], optimizing remote control processes under severe communication constraints beyond Shannon's limits on Level A performance [2].In particular, the effectiveness problem is highly relevant to robotic applications, in which independent mobile robots, such as drones or rovers, must operate based on information from remote sensors.In this case the sensors and the cameras act as the transmitter in a communication problem, while the robot is the receiver: by solving the Level C problem, the sensors can transmit the information that best directs the robot's actions toward the optimal policy [4].We can also consider a case in which the robot is the transmitter, while the receiver is a remote controller, which must get the most relevant information to decide the control policy [5].
The rise of communication metrics that take the content of the message into account, such as the Value of Information (VoI) [6], represents an attempt to approach the problem in practical scenarios, and analytical studies have exploited information theory to define a semantic accuracy metric and minimize distortion [7].In particular, information bottleneck theory [8] has been widely used to characterize Level B optimization [9].However, translating a practical system model into a semantic space is a non-trivial issue, and the semantic problem is a subject of active research [10], [11].The effectiveness problem is even more complex, as it implicitly depends on estimating the effect of communication distortion on the control policy and, consequently, on its performance [12].While the effect of simple scheduling policies is relatively easy to compute [13], and linear control systems can be optimized explicitly [14], realistic control tasks are highly complex, complicating an analytical approach to the Level C problem.Pure learning-based solutions that consider communication as an action in a multi-agent DRL problem, such as emergent communication, also have limitations [15], as they can only deal with very simple scenarios due to significant convergence and training issues.In some cases, the information bottleneck approach can also be exploited to determine state importance [16], but the existing literature on optimizing Level C communication is very sparse, and limited to simpler scenarios [17].Another possible approach to the remote tracking of Markov sources is addressed in zero-delay coding theory [18].However, this theory considers the error on the hidden state estimate as the objective of the optimization, arXiv:2401.16236v1[cs.LG] 29 Jan 2024 which is similar to what we could consider a semantic (or Level B) approach.In this work, we show that it is possible to optimize the system with respect to other metrics which cannot be explicitly derived such as cumulative rewards in DRL: by accepting a higher distortion at the semantic Level when it is not relevant to the task, we can further reduce the required bitrate without sacrificing the control performance.
In this work, we consider a dual model which combines concepts from DRL and semantic source coding: we consider an ensemble of VQ-VAE models [19], each of which learns to represent observations using a different codebook.A DRL agent can then select the codebook to be used for each transmission, controlling the trade-off between accuracy and compression.Depending on the task of the receiver, the reward to the DRL agent can be tuned to solve the Level A, B, and C problems, optimizing the performance for each specific task.In order to test the performance of the proposed framework in a relevant example scenario, we consider the well-known CartPole problem, whose state can be easily converted into a semantic one, as its dynamics depend on a limited set of physical quantities.The problem we selected is purposefully simple, as this allows for a better explainability and an easier training, but the solution is not limited to the CartPole problem, and can be adapted to more complex tasks.The main contributions of this paper are then given by the following: • We model a remote-control system as a remote POMDP problem and present an efficient solution for learning effective communication through the dynamic compression of learnable features; • We show that dynamic codebook selection outperforms static strategies for all three Levels, and that considering the Level C task can significantly improve the control performance without increasing the bitrate; • We adopt an explainability framework to understand the choices of the agent in this simple problem, and verify that the Level C dynamic compression captures the receiver's uncertainty in the state estimation and its impact on the expected reward, transmitting only when necessary.
We remark that the dynamic codebook selection policy is not limited to the VQ-VAE ensemble we consider, but is a general technique that can be applied to any compression algorithm with adaptable quality parameters, helping deliver more accurate information when it is relevant to do so.The results and policy analysis lead to significant insights for the design of communication strategies for remote control.A partial version of this work was presented at the IEEE INFOCOM WiSARN 2023 workshop [20].This paper includes a more complete theoretical characterization of the problem, as well as an updated learning architecture, additional results, and an indepth analysis of the DRL-based dynamic compression policy.The rest of the paper is organized as follows: first, we analyze the state of the art on semantic and effective communication in Sec.II.Sec.III then presents the general system model and the three Levels of communication we consider.We then describe the dynamic feature compression solution in Sec.IV, which is evaluated by simulation in Sec.V. Finally, Sec.VI concludes the paper.

II. RELATED WORK
The specific requirements of distributed and remotely controlled systems have focused the research community's attention towards communication systems that must provide updated information to enable real-time high-level tasks such as inference, tracking or control.Although metrics such as Age of Information (AoI) [6] represent a major improvement with respect to latency and packet loss, they are still limited, as they assume that the quality of the information available at the receiver degrades deterministically with time, most commonly (but not necessarily [21]) in a linear fashion.However, more sophisticated systems can also take into consideration the current state of the system in order to decide whether and when to update the status of the receiver.Metrics such as Urgency of Information (UoI) [14] and VoI [22] incorporate state information in their definition and are thus aware of the intrinsic value of potential updates.Other context-aware indices to measure the nonlinear time-varying importance and the non-uniform context dependence of the status information have also been proposed [14].The authors of [23] considered a system in which a transmitter monitors the status of a system and updates the controller, providing status information.Then a constrained Markov decision process (MDP) is formulated to minimize the cost of actuation and simultaneously guarantee a target communication rate.Both works show significant improvements with respect to other metrics such as AoI.
At the same time, the development of learning-based coding schemes has allowed communication system designers to move beyond packet error as the key coding performance metric, exploiting semantic considerations.Joint source-channel coding for wireless image transmission is implemented in [24], [25], and the encoder-decoder pair is parameterized by a neural network (NN), whose architecture may vary.This approach can be used to maintain the semantic information contained in the transmitted data, while improving the compression performance.
Semantic information at the receiver can be used to solve different tasks.Effective communication [12] can be seen as an extension of this, in which the task involves the receiver taking actions and possibly altering the information that the transmitter is communicating.Effective communication differs from semantic communication mostly because the "semantic" content which has to be preserved in the communicated messages is not explicit.Moreover, control tasks have a temporal component that must be taken into account, as investigated in [12].The scenario considered in the work is a two-agent POMDP in which one agent communicates and the other agent interacts with the environment, using DRL to solve the joint problem and encoding the information.A distributed perception scenario, in which multiple sensors communicate to a single robot, is considered in [26] and solved using multi-agent reinforcement learning (MARL), showing that joint training improves the performance of the system, particularly when communication is severely constrained.While past works aimed at specific scenarios and objectives, this paper proposes a novel DRL approach that combines status updates with an adaptive coding scheme and can be easily adapted to operate on any of the three Levels of communication (A, B, or C).

III. SYSTEM MODEL
The recent interest in semantic and effective communications from the research community has driven the development of a wide array of models and conceptualizations, as highlighted in the previous section.At the highest level of abstraction, our purpose is to define a model in which effective communication is meaningful, and the differences between the three problems in Weaver's formulation become clear.
Let us then consider a simple example: we have a remote actuator performing a control task, while a camera observes the results and transmits its observation through a wireless channel.The actuator might have its own sensors, but it relies on the video feed to improve its performance and maintain stable and efficient control.The classical, Level A approach to the problem would be to compress the video as efficiently as possible, minimizing the reconstruction error on frames by using an appropriate codec.The difference between Level A and Level B solutions is then obvious: the former encodes new frames so that the reconstruction fidelity is preserved, while the latter maps elements in the frame to their importance when estimating the physical state of the system.In some control applications, the state can be defined trivially, while in others it may be more complex, but in general, the translation of the video to the state space is unaffected by irrelevant information (such as, e.g., movements in the background).
If we consider Level C, we target control performance directly, and thus further restrict the definition of relevant information: while Level B concerns itself with estimating the system state correctly, a Level C solution only considers errors in the state estimation when they cause performance drops.If the control action is the same over a wide set of states, accuracy then becomes unnecessary, as the actuator only needs a rough estimate of the state to decide what to do; the same happens if there are multiple actions with almost equivalent performance, i.e., if the optimality gap caused by imperfect information remains small.These natural observations represent the core concepts of effective communication, but implementing them in practical systems is often complex as actions have long-term consequences, and state estimates are based on a history of observations, so that transmitting a message may affect future performance in complex ways.In the following, we provide an analytical framework using the remote POMDP approach to objectively evaluate these choices and implement a solution for effective communication in cyber-physical systems.We will denote random variables with capital letters, their possible values with lower-case letters, and sets with calligraphic or Greek capitals.Table I reports the main symbols we introduce in the following sections for the reader's convenience.

A. POMDP Definition and Solution
In the infinite horizon POMDP formulation [27], one agent needs to optimally control a stochastic process defined by Belief distribution at the observer ξ pri, (r)  Prior belief distribution at the robot ξ (r)  Belief The POMDP proceeds in discrete steps indexed by t: at each step t, the agent can infer the system state s t only from the partial information given by the history of stochastic observations h t = (o t , o t−1 , . . ., o 1 ) ∈ O t .Based on these observations, and on its policy π : O t → Φ(A), which outputs a probability distribution over the action space for each possible observation history, the agent interacts with the system by selecting an action A t ∼ π(h t ).The sampled action a t is then performed in the real environment, whose hidden state s t is unknown to the agent, which then receives a feedback from the environment in the form of a (potentially stochastic) reward r t , with expected value R t = R(s t , a t ). 1 The goal for the agent is then to optimize its policy π to maximize the expected cumulative discounted reward G = E t γ t R t .
Having an optimal policy is equivalent to knowing the optimal state-action values, also known as Q-values, and taking action where is the expected cumulative reward starting from (h t , a).
However, considering the full history of observations makes the solution highly complex, as the length of h t is potentially unbounded.We then define an estimator ξ : O t → Φ(S), which outputs the a posteriori belief distribution over the state space.We can then recast the original POMDP as a standard MDP, whose state space is S ′ = Φ(S), i.e., the space of possible belief distributions.Solving the POMDP in this modified belief space has been proved to be optimal in [28].
The policy over this modified MDP is then π : S ′ → A, which can be optimized using standard tools [29].We can also compute the new transition probability and expected reward as in [28].Given ξ t (s) = Pr (S t = s | h t ), which represents the maximum likelihood estimate of the state given all the history of observations h t , and A t = a, we define the a priori belief over the state at time t + 1 as The a posteriori belief ξ t+1 , which also includes the new observation o t+1 , can then be obtained by performing a Bayesian update using the a priori belief as a prior: .
(1) These update equations allow us to compute the modified transition probability matrix P ′ for the belief MDP, which is then defined by the tuple ⟨Φ(S), A, P ′ , R, γ⟩.

B. The Remote POMDP
In this paper we consider a variant of the POMDP, that we define remote POMDP, in which two agents are involved in the process.The first agent, i.e., the observer, receives observation O t ∈ O, and needs to convey such information to a second agent, i.e., the robot, through a constrained communication channel, which limits the number of bits the observer can send reliably.Consequently, the amount of information the observer can send to the robot is limited.The robot then chooses and takes an action in the physical environment.This system can formalize many control problems in future IIoT systems, as sensors and actuators may potentially be geographically distributed, and the amount of information they can exchange to accomplish a task is limited by the shared wireless medium, which has to be allocated to the many devices installed in the factory, as well as by the energy limitations of the sensors.Similar systems have been analyzed in [12], [23].We will now analyze the problems for the two agents, considering a case in which communication and control are designed separately.Joint control and communication approaches [26] can outperform separate approaches by tuning the two agents' policies to each other, but they introduce additional training complexity, and will not be considered in this work.In the following, we will refer to variable x related to the robot as x (r) , while the corresponding variable on the observer side will be denoted by x (o) .
1) The Robot-Side POMDP: We denote the message communicated to the robot at time step t as m t ∈ M. The set of possible messages M forms the set of observations that are available to the robot, and the history of these observations is given by h (r) t = {m t , . . ., m 1 }, which is the sequence of messages received up to time t.We can then see the robot as an agent with its own POMDP, in which the observations are filtered by both the partial knowledge of the observer and the further distortion produced by the fact that these observations are encoded and communicated through a constrained channel.The robot-side POMDP is then defined by the tuple S, A, M, P, π (o) , R, γ , as observations depend on the observer's policy.
The message transmitted from the observer to the robot modifies the belief distribution over the next state as a Bayesian update.Let us define the distribution over the current state, given that the message m t has been received, as ξ (r) t .For example, if the communicated message m t contains the correct state S t = s, the belief distribution becomes deterministic, i.e., ξ(s ′ | m t ) = δ s,s ′ , where δ m,n is the Kronecker delta function, equal to 1 if the two arguments are the same and 0 otherwise.Ideally, an intelligent observer will allocate more communication resources and thus provide more precise messages if the a priori distribution of the robot is far from the one estimated by the observer.The modified MDP is then defined by the tuple Φ(S), A, P (r) , R, γ , where P (r) represents the Bayesian update function.According to the previous notation, we can express the optimal action at time t + 1 as where ξ (r) t is the current belief at the robot side (after message m t is received).The robot's reward is simply given as the reward of the original POMDP, i.e., the control performance in the environment.The optimal policy can be reached by using standard DRL tools.
2) The Observer-Side POMDP: On the other side, the observer needs to encode its belief ξ (o) t in a message m t ∈ M and transmit it.We can then consider the observer-side POMDP, in which the action set corresponds to the set of messages M and the state space is represented by the belief from the observed results.The tuple defining this POMDP is S, M, O, P, ω, R (o) , γ .We assume that the observer knows the robot's policy, i.e., it can know the actions that the robot takes in the environment and use them to improve its estimate of the state.This can also be accomplished if the robot transmits the actions it takes as feedback to the observer.As described above for the robot-side problem, we can transform this POMDP into the belief MDP given by Φ(S) × Φ(S), M, P (o) , R (o) , γ , where P (o) is the Bayesian update given in (1).We highlight that the observer needs to keep track of both its own and the robot's belief, as the effectiveness of communication depends on the difference between the two, and the state of the observer is given by ξ . The objective of the observer is to minimize channel usage, i.e., communicate as few bits as possible, while maintaining the highest possible performance in the control task: the expected reward R (o) depends on both components.We then consider a simple linear combination approach: if it transmits message m, whose length in bits is ℓ(m), the observer then gets a penalty βℓ(m), where β ∈ R + is a cost parameter.In order to optimize its policy, the observer also needs to have a way to gauge the value of information, which is a complex problem: information theory, and in particular rate-distortion theory, have provided the fundamental limits when optimizing for the technical problem, i.e., Level A, where the goal is to reconstruct the source signals with the highest fidelity [30].We will discuss the definition of VoI in the following sections.More complex modeling choices for the transmission cost are also possible, e.g., considering energy constraints for a sensor with energy harvesting capabilities, but are beyond the scope of this work.
As the complexity of the problem is massive, we restrict ourselves to a smaller action space by making a simplifying assumption, which allows us to separate the problem: the observer does not transmit the entire belief distribution, which may be implicit, but rather the observation O t .We then consider the encoding function Λ : O × Φ(S) → M, which will generate a message to be sent to the robot at each step t.

C. Observer Reward in Remote POMDPs
The first and simplest way to solve the remote POMDP problem is to blindly apply standard Level A rate-distortion metrics to compress the sensor observations into messages to be sent to the agent.As an example, in the CartPole problem analyzed in this work (see Sec. V), one sensor observation is given by two consecutive 2D camera acquisitions.The observer's policy is then independent of the robot's task, and can be computed separately.The Level A reward function R (o) A is then given as follows: In the CartPole case, a natural distortion metric is the image Peak Signal to Noise Ratio (PSNR), an image quality metric proportional to the logarithm of the normalized Mean Square Error (MSE) between the images.Naturally, encoding the observation with a higher precision will require more bits, as the set of messages needs to be bigger.The Level B problem considers the projection of the raw observations into a significantly smaller semantic space, over which we measure distortion using function d B , explicitly capturing the error over the needed physical system information, e.g., the angular position and velocity of the pole in the CartPole problem.The Level B reward function B is then given as follows: In our CartPole case, this may be simply represented by the MSE between the best estimate of the state at the transmitter and the receiver.
Finally, we can consider the Level C system.In this case, the distortion metric is not needed, as the control performance can be used directly, and the reward R (4) The VoI of message m, V ξ pri,(r) t , m , can then be given by the difference between the expected performance of the robot with and without this information: Thus, the optimal Level C observer policy π (o) C will balance the trade-off between the performance at the receiver and the communication cost not only in the current time step but also in the long term.This foresighted behavior is essential when considering that the belief distributions incorporate the memory of previously received messages.Providing information that does not improve the expected reward in the next step might still be worth the cost if it allows the robot to improve its estimate, reducing the need for future communication.

IV. PROPOSED SOLUTION
In this section, we introduce the architecture we used to represent Λ, the VQ-VAE, and discuss the remote POMDP solution.As the VQ-VAE model is not adaptive, we consider an ensemble model with different quantization levels, limiting the choice of the observer to which VQ-VAE model to use in the transmission.As we mentioned, directly learning the encoding is highly complex, with a vast action space, and techniques such as emergent communication that learn it explicitly are limited to scenarios with very simple tasks and immediate rewards.By restricting the problem to a smaller action space, we find a potentially sub-optimal solution, but we can deal with much more complex problems.

A. Deep VQ-VAE Encoding
In order to represent the encoding function Λ, and to restrict the observer-side POMDP to a more manageable action space, the observer exploits the VQ-VAE architecture introduced in [19].The VQ-VAE is built on top of the more common Variational Autoencoder (VAE) model, with the additional feature of finding an optimal discrete representation of the latent space.The VAE is used to reduce the dimensionality of an input vector X ∈ R I , by mapping it into a stochastic latent representation Z ∈ R L ∼ q ν (Z|X), where L < I.The stochastic encoding function q ν (Z|X) is a parameterized probability distribution represented by a neural network with parameter vector ν.To find optimal latent representations Z, the VAE jointly optimizes a decoding function p θ ( X|Z) that aims to reconstruct X from a sample X ∼ p θ ( X|Z).This way, the parameter vectors ν and θ are usually jointly optimized to minimize the distortion d(X, X) between the input and its reconstruction, given the constraint on Z, while reducing the distance between q ν (Z|X), and some prior q(Z) [31] used to impose some structure or complexity budget.
However, in practical scenarios, one needs to digitally encode the input X into a discrete latent representation.To do this, the VQ-VAE quantizes the latent space by using N K-dimensional codewords z 1 , . . ., z N ∈ R K , forming a dictionary with N entries.Moreover, to better represent 3D inputs, the VQ-VAE quantizes the latent representation Z using a set of F blocks, each quantizing one feature f (X) of the input, and chosen from a set of N possible codewords.We denote the set containing all the N F possible concatenated blocks with M(N ), as it represents the set of all possible messages the observer can use to convey to the robot the information on the observation O, by using F discrete N -dimensional features.The peculiarity of the VQ-VAE architecture is that it jointly optimizes the codewords in M(N ) together with the stochastic encoding and decoding functions q ν and p θ , instead of simply applying fixed vector quantization on top of learned continuous latent variables Z.When the communication budget is fixed, i.e., the value of L is constant, the protocol to solve the remote POMDP is rather simple: first, the observer trains the VQ-VAE with N = 2 F −1 L to minimize the technical, semantic, or effective distortion d α , depending on the problem; then, at each step t, the observer computes m ∼ q ν (•|o t ), and finds m t = arg min m∈M(N ) ∥m − m∥ 2 .The message m t is sent to the robot, which can optimize its decision accordingly.

B. Dynamic Feature Compression
We can then consider the architecture shown in Fig. 1, consisting of a set of VQ-VAEs V = {ζ ∅ , ζ 1 , . . ., ζ V }, where each VQ-VAE ζ v compresses each feature using v bits.We also include a null action ζ ∅ , which corresponds to not transmitting anything.As we only consider the communication side of the problem, the actor is trained beforehand using the messages with the finest-grained quantization, which are compressed with the VQ-VAE ζ V with the largest codebook.This choice ensures that the actor can deal with finer-grained inputs, while still being robust to lower-precision features.The robot can then perform three different tasks, corresponding to the three communication problems: it can decode the observation (Level A) with the highest possible accuracy, using the decoder part of the VQ-VAE architecture; it can estimate the hidden state (Level B) using a supervised learning solution; or it can perform a control action based on the received information and observe its effects (Level C).
In all three cases, the dynamic compression is performed by the observer, based on the feedback from the robot.The observer side of the remote POMDP, whose reward is given in ( 2)-( 4 described in the previous section, the type of reward depends on the communication problem that the observer is trying to solve: at Levels A and B, the observer aims at minimizing distortion in the observation and semantic space, respectively.At Level C, the objective is to maximize the robot's reward.We remark that the only Level at which the decision of the receiver matters is Level C. The semantic estimate describes the physical process, which models the dynamics of the control process that the Observer can sense.Optimizing the transmitter to minimize the reconstruction error in the semantic estimate does not consider the decision of the actor (receiver).In general, the physical state of the system can carry redundant or irrelevant information with respect to the agent's decision, and is not equivalent to Level C optimization.
In all three cases, memory is important: representing snapshots of the physical system in consecutive instants, subsequent observations have high correlations, and the robot can glean a significant amount of information from past messages.This is an important advantage of dynamic compression, as it can adapt messages to the estimated knowledge at the receiver side.
While the observer is adapting its transmissions to the robot's task, the robot's algorithms are fixed.They could themselves be adapted to the dynamic compression strategy, but this joint training is significantly more complex, and we consider it as a possible extension of this work.

C. RL implementation
There are two policies in the considered system, one for the observer and one for the robot and in both cases the policies are learned through the Actor Critic algorithm.This means that an agent learns a parametric policy π λ and a Qvalues estimator.Both the policy and the Q-values are neural networks.In order to take into account the past observations the two networks share a Long Short-Term Memory (LSTM) layer which estimates a latent state which is then given as input to both the policy and the values estimator.This architecture avoids explicitly modeling the belief distribution which may be complicated to treat in continuous settings like the one considered in this work.This practical choice is also useful to avoid decoding the latent features discovered by the VQ-VAE back in the observation space O or in the physical state space S, increasing the potential for errors.Indeed, the quantized features communicated with the message m t contain a structured representation of the observation space which can be used effectively by an LSTM to estimate the true state.
The training algorithm is the standard Advantage Actor Critic (A2C), but the replay buffer is appropriately modified to take into account the history of previously received messages.

V. SIMULATION SETTINGS AND RESULTS
The underlying use case analyzed in this work is the wellknown CartPole problem, as implemented in the OpenAI Gym library. 2 In this problem, a pole is installed on a cart, and the task is to control the cart position and velocity to keep the pole in equilibrium.The physical state of the system is fully described by the cart position x t and velocity ẋt , and the pole angle ψ t and angular velocity ψt .Consequently, the true state of the system is s t = (x t , ẋt , ψ t , ψt ), and the semantic state space is S ⊂ R 4 (because of physical constraints, the range of each value does not actually span the whole real line).The main simulation parameters are reported in Table II.
At each step t, the observer senses the system by taking a black and white picture of the scene, which is in a space P = {0, . . ., 255} 180×360 .To take the temporal element into account, an observation O t includes two subsequent pictures, at times t − 1 and t, so that the observation space is O = P × P.An example of the transmission process is given in Fig. 2, which shows the original version sensed by the observer (above) and the reconstructed version at the receiver (below) when using a trained VQ-VAE ζ 6 , i.e., an encoder trained with 6 bits per feature, the maximum we consider in this study.
In the CartPole problem, the action space A contains just two actions Left and Right, which push the cart to the left or to the right, respectively.At the end of each step, depending on the true state s t , and on the taken action a t , the environment will return a deterministic reward    thus to maximize the cumulative discounted sum of the reward R t , while limiting the communication cost.

A. The Coding and Decoding Functions
As mentioned in Sec.III-C, the observer can optimize its coding function Λ according to different criteria depending on the considered communication problem.However, as we explained in Sec.IV, optimizing Λ without any parameters is usually not feasible due to the curse of dimensionality on the action space.Consequently, we rely on a pre-trained set V of VQ-VAE models, whose codebooks are optimized to solve the technical problem, i.e., minimizing the distortion on the observation measured using the MSE: The training performance of the VQ-VAE with ζ 6 , i.e., using the maximum value V of 6 bits per feature, is shown in Fig. 3: the encoder converges to a good reconstruction performance, which can be measured by its perplexity.The perplexity is simply 2 H(p) , where H(p) is the entropy of the codeword selection frequency, and a perplexity equal to the number of codewords is the theoretical limit, which is only reached if all codewords are selected with the same probability.The perplexity at convergence is 54.97, which is close to the theoretical limit for a real application.
The observer then uses DRL to foresightedly choose the best VQ-VAE ζ v at each time step, maximizing the expected long-term reward for each communication problem.We train the observer to solve each specific coding problem by de t , ŝ(r) t ), as part of the reward defined in (3), and the decoder needs to estimate the underlying physical state s t by minimizing the MSE, i.e., the distance between ŝ(o) t and ŝ(r) t in the semantic space.In our case, the estimator used to obtain the estimates is a pre-trained supervised LSTM neural network; 3) Level C (effective problem): In this case, there is no direct distortion metric, and the control performance is used directly as in (4).The policy π (r) is given by an actor-critic agent implementing an LSTM architecture, pre-trained using data with the highest available message quality (6 bits per feature).In this case, the task depends on all the semantic features contained in S t .However, the 4 components of the state do not carry the same amount of information to the robot: depending on the system conditions, i.e., the state S t , some pieces of information are more relevant than others.

B. Neural Network Architecture and Training
The VQ-VAE architecture is made with Convolutional Neural Network (CNN) layers to extract latent features and it is trained separately before the training of the control policy.To this end, a dataset of observations is collected through a random policy.We then train an encoding network, the vector quantization layer and the decoder jointly as in the standard VQ-VAE [19].The first vector quantization layer learned contains the highest number of codewords.Finally, we fix the encoder and the decoder and just train the other vector quantization layers, obtaining multiple quantizers over the same latent space discovered by a common encoder.The hyperparameters used to train the VQ-VAE are reported in Table II.After obtaining the V quantizers, we train the policy using the standard A2C algorithm.Table III shows the Encoder-Decoder layers of the VQ-VAE.In Table IV, the layers of the implemented Regressor and the Actor-critic neural networks are reported.All the NNs are implemented through the Pytorch library.Once the robot policy has been obtained, we can train the observer policy.The observer learns a policy through the same A2C algorithm, but in this case the input to the policy are the features before quantization.A unique observer policy is trained for different values of the trade-off parameter β and for different communication Levels.
For further details on the implementation, training and testing process, we refer to the publicly available simulation code. 3he additional computational cost of the architecture on the observer side is well within the computational capabilities of even relatively simple embedded devices [32], and even more complex problems can be dealt with by Edge devices.At the same time, training the actor with compressed representations actually reduces its computational burden, as the feature extraction is performed by the sender.However, if the sender is significantly computationally constrained, replacing the VQ-VAE with a classical compression scheme such as JPEG might be a good way to reduce the cost and still deliver the benefits of dynamic compression.

C. Results
We assess the performance of the three different tasks in the CartPole scenario by simulation, measuring the results over 1000 episodes after convergence.Fig. 4 shows the performance of the various schemes over the three problems, compared with a static VQ-VAE solution with a constant compression level.In the Level C evaluation, we also consider a static VQ-VAE solution in which the robot is not retrained for each v, but is only trained for v = 6 bits per feature (i.e., 48 bits per message, as the VQ-VAE considers 8 features) as for the dynamic scheme.We trained the dynamic schemes with different values of the communication cost β, so as to provide a full picture of the adaptation to the trade-off between performance and compression.We also introduce the notion of Pareto dominance: an n-dimensional tuple η = (η 1 , . . ., η n ) We can extend this to schemes with multiple possible configurations.The definition of Pareto dominance for schemes x and y is: x ≻ y ⇐⇒ ∃η(x) ≻ η(y) ∀η(y), i.e., for each configuration of scheme y, there is a setting of x that Pareto dominates it.In other words, we can always tune scheme x so that it outperforms any configuration of scheme y on all metrics.We first consider the technical problem performance, shown in Fig. 4a: as expected, the Level A dynamic compression outperforms all other solutions, and its performance is Pareto dominant with respect to static compression.Interestingly, the Level B and Level C solutions perform worse than static compression: by concentrating on features in the semantic space or the task space, these solutions remove information that could be useful to reconstruct the full observation, but is meaningless for the specified task.
In the semantic problem, shown in Fig. 4b, a lower MSE on the reconstructed state is better, and the Level B solution is Pareto dominant with respect to all others.The Level A solution also Pareto dominates static compression, while the Level C solution only outperforms it for higher compression levels, i.e., on the left side of the graph.
Finally, Fig. 4c shows the performance of the effectiveness problem (Level C), summarized by how long the CartPole system manages to remain within the position and angle limits.The Level C solution significantly outperforms all others, but is not strictly Pareto dominant: when the communication constraint is very tight, setting a static compression and retraining the robot to deal with the specific VQ-VAE used may provide a slight performance advantage.In general, almost perfect control can be achieved with less than half of the average bitrate of the static compressor, which can only reach a similar performance at a much higher communication cost.We also note that, in this case, the Level B solution performs worst: choosing the solution that minimizes the semantic distortion is not always matched to the task, as it considers the state variables as having equal weight, while a higher precision may be required when the quantization error affects the robot's action.
Another analysis is conducted on the way the CartPole is controlled with the different communication policies.Fig. 5 shows the Angular Root Mean Squared Deviation (RMSD) (Fig. 5a) and the Position RMSD (Fig. 5b), defined as: where x target is the desired value of the controlled process and x i is the recorded value of the process at time step i.
Both RMSD are computed with respect to the central and vertical position of the CartPole: x target = 0 and ψ target = 0.These results help to evaluate how well the control dynamics keep the CartPole near the optimal central position and to assess the smoothness of the resulting pole oscillations.It is possible to see that, in general, a higher rate allows to keep the angular RMSD smaller.In particular, in the Level C system, the values are the smallest.However, this comes at the cost of deviating more from the central position, as shown in the figure.The policy prioritizes the stabilization of the pole oscillations, though this requires deviating from the central   position.This is because swings in the pole's angle are harder to control due to the instability of the inverted pendulum, and there is a significant risk that the pole might go out of the acceptable range, ending the episode.

D. Comparison with existing compression approaches
In this section, we compare the performance of our proposed approach to that of other methods.Specifically, we show how digital compression techniques, such as JPEG, perform in the same scenario.We also compare other NN-based compression models and evaluate their performance.For the digital compression, we use different sets of parameters combining image resizing, the quality parameter of the JPEG standard, and the number of color grayscale levels.For learning-based compression, we used the CompressAI library which implements the model proposed in [33].
Fig. 6 shows the performance of other methods with respect to the proposed dynamic feature compression method.It is possible to see that the digital compression scheme does not allow the actor to effectively control the system, as its stability is low even though the updates are bigger by two orders of magnitude.Obtaining a high control performance would require an extremely high bitrate.On the other hand, the neural compression scheme achieves a higher performance with a limited bitrate, but since the model is a general compression technique designed to compress a wide variety of images, it cannot reach extremely low bitrates.After retraining the scheme on CartPole pictures, it is possible to obtain lower bitrates while improving the resulting control performance.Our approach, which is directly trained on the CartPole task, outperforms all others; however, we do not claim that VQ-VAE is the best compression technique for all scenarios.The main contribution of this work is not in the specific compression scheme, but rather in the dynamic and goal-oriented adaptation of the compression parameters for each transmitted update, which could be directly applied to different NN architectures and even JPEG.

E. Analysis of the communication policy
We can then use an explainability approach to gain further insights on how effective communication operates.Fig. 7 shows the distribution of the quantization level selected by an observer trained for each communication Level (A, B, and C).We note that we adapted the scale of β for the three Levels, so as to obtain comparable results: as the reward process takes values in different ranges (e.g., the PSNR is in dB while the reward of the MDP is between −1 and 1), using the same transmission cost would result in very different outcomes.We then chose to rescale the transmission cost parameter to have the full range of average bitrates at each communication Level.The similarity in the compression level distributions at the three Levels is striking.For lower values of β, the observer selects ζ ∅ , which corresponds to no transmission, more often.As β decreases, the communication cost becomes lower, and thus the observer chooses longer messages more often.Another common pattern is that quantizing features using 1, 2 or 3 bits is a rare choice.This shows that the memory implemented implicitly in the system through the LSTM is powerful enough to obtain adequate beliefs based on past messages, so that the observer can rely on it and not send anything.A roughly quantized update has a relatively low value, as its novelty is limited, and transmitting intermittent updates at a higher quality results in a better performance.
However, the real difference between the three policies is given by when they decide not to transmit.Therefore, we propose an analysis based on the visualization of the observer policy and the receiver policy.In Fig. 8, four colormaps show different policies projected in the same domain: the pole angle on the x-axis and the cart velocity on the y-axis.More   where p(a) is empirically estimated by counting the number of times each action is chosen when the state is in the projected cell.Fig. 8c and Fig. 8d show the average length of the update packets, i.e., the number of Bytes transmitted in each cell when optimizing for Levels A and C, respectively.This can be seen as the average number of bits that the transmitter allocates for each projected slice of the state space.Fig. 9 shows the same results but for a different physical state projection, mapping the angle ψ on the x-axis and the pole angular velocity ψ on the y-axis.In both figures, there is a strong correspondence between the states where the robot entropy is higher and the states where the Level C policy allocates a higher number of bits.This confirms that an effective observer policy manages to discriminate the uncertainty at the robot side.In regions of the state space where it is more difficult to retrieve the correct action, i.e., the action entropy is higher, the observer will provide the robot with more precise information by sending longer messages.There are regions where the robot action is always the same, e.g., whenever the cart is moving fast and the tip of the pole is pointing to the same side the cart is moving towards.In these cases, the entropy is extremely low, and the transmitter can avoid sending new updates to the robot.This is due to the fact that, even if the estimated state at the receiver differs from the observed one, the action to perform remains the same and will be to push further the cart to try to get the pole more vertical.Recalling (5), we note that if Q ξ (r) t , π (r) ξ (r) t is very sensitive to small variations in ξ (r) t , then the gap in ( 5) is going to be significant, leading the observer to choose to send precise information.In principle, a Level C transmitter could reduce the message length or even avoid transmission as long as the robot is able to choose the correct actions, even though its belief is incorrect.An optimal communication scheme approximately follows which means that the message length is roughly proportional to VoI.This concept might be used when defining a heuristic policy, which behaves similarly to the effective communication policy but is much simpler to design and implement.Note that this condition includes two separate cases in which a Level C observer chooses not to transmit, while Level A and B transmitters would send precise data: • The action corresponding to the prior belief is the same as the one after the updating message.In this case, the VoI of the communicated message is low and thus we can lower ℓ t ; • The action is different after the communicated message, but the long-term rewards are close enough that the robot is not going to benefit too much from choosing the other   action.Even in this case, sending less information is not going to affect the control performance significantly.
These cases cannot be taken into account in Levels A and B. Indeed, the Level A policy shown in Fig. 8c tends to allocate communication resources in the states where the picture is changing more rapidly, so that the memory available to the robot is less useful to estimate the current observation, regardless of the correct action.As the cart speed ẋ increases along the y-axis, the number of bits increases too.The same reasoning can be applied to the results in Fig. 9.
Another general principle that we can deduce for an effective policy is that it should be aware of variations of the value function with respect to the belief.If the value function is strongly affected by small perturbations of the belief, then the effective policy should communicate more information in order to reduce the discrepancy between ξ with respect to changes in its belief distribution ξ (r) t .When this value is big, an inaccurate estimation of the state would cause a poor estimation of the value function, which may in turn cause the robot to choose a low-quality action.
In Fig. 10, we provide an analysis of the communication strategy with respect to different AoI values.This allows to show how the memory of the robot and of the observer plays a crucial role on the communication decisions.In particular, we consider five values of the AoI: AoI = 0 indicates that a message of any length was transmitted in the previous time step.AoI = ∆ with ∆ ∈ {1, 2, 3, 4} means that no messages have been received by the observer for ∆ time steps since the last received message.This is a measure of how up to date the memory of the robot is, allowing us to evaluate the next choice of the observer for a given age.We then consider the distribution of the observer actions (y-axis) with respect to different ranges of the robot actions entropy (xaxis).This means that, for each entropy interval, we count the number of times each action is performed, in order to obtain an empirical distribution.The columns are normalized so that each cell shows the probability that the observer chooses a specific ℓ t whenever the robot action entropy falls within the corresponding interval, for different values of the AoI.Fig. 10a clearly shows that, if there was a transmission in the previous time step (AoI = 0), it is very unlikely that the system is going to be updated again in the current time step.As remarked above, Figs.10a-e show the case with β = 0.15, for which the observer almost always picks ζ 4 to transmit.On the other hand, Figs.10f-j show the case with β = 0.1, in which the agent sometimes selects other codebooks due to the lower transmission costs.The observer often chooses to communicate if AoI = 1, with an exception if the system is in a very low entropy state, in which case the probability of communicating using ζ 4 is similar to the one corresponding to action ζ ∅ .If we look at the behavior for higher values of the AoI, we can notice a general trend: communication is more likely to happen in higher entropy states than in lower entropy ones.This shows that the observer policy understands the cases where the state has to be precisely estimated by the robot to choose its action correctly.Additionally, the probability of transmitting an update actually decreases when the AoI reaches higher values.The observer will only skip several consecutive transmission opportunities in two cases: either the system state is highly predictable, and the actor can rely on its past knowledge to get a precise estimate of it, or the two actions are almost equivalent, e.g., if the pole is balanced vertically.The former case is more likely to be a low-entropy state, while the latter is high-entropy, but has a small difference between the rewards for the two actions.We can see that the trend holds for different values of β by looking at Figs. 10f-j: if we decrease the value of β, the observer tends to transmit more often, and use higher message lengths when it transmits, but the general tendency to transmit more whenever the robot action entropy is high clearly holds.This final analysis allows us to get an easy heuristic for effective communication when the value function is not available or cannot be learned.Fig. 11 shows that the Level A policy allocates communication resources without considering the entropy of the control actions.While Fig. 10 showed a clear monotonic trend in the probability of selecting ζ 4 , which increased as the entropy increased, the pattern is much weaker in this case.The value of β for this was chosen to get a similar overall bitrate (and, as we discussed, a similar overall action distribution) to the Level C case with β = 0.15.As we discussed above, there is a weak correlation between the action entropy and features such as the angular and cart velocities, but it is the latter that the Level A policy considers: as the difficulty of accurately reconstructing the image increases with the speed of the CartPole system, more unstable states have more frequent transmissions.

VI. CONCLUSION
In this work, we presented a dynamic feature compression scheme that can exploit an ensemble VQ-VAE to solve the semantic and effective communication problems.The dynamic scheme outperforms fixed quantization, and can be trained automatically with limited feedback, unlike emergent communication models that are unable to deal with complex tasks.The choices made by the observer are clearly tied to the control policy of the robot it aims to help, significantly outperforming a simpler optimization that does not take into account the semantic and effective problems.We also analyzed the optimal policies to draw insights on their decisions, showing that the Level C optimization indeed considers the robot's policy.
A natural extension of this model is to consider more complex tasks and wider communication channels, corresponding to realistic control scenarios, or scenarios with multiple transmitters with partial information about each other and the robot.Considering a more realistic channel model, which has a loss probability and time-varying statistics in addition to the transmission cost, would also be an interesting direction, joining our model with Joint Source Channel Coding (JSCC) theory.Dynamically adapting JSCC parameters with the goal of helping a remote DRL agent would be a natural extension of our proposed approach for more realistic wireless scenarios.Another interesting direction for future work is to consider joint training of the robot and the observer, or cases with partial information available at both transmitter and receiver.
sufficient statistic i(s) of any given state s ∈ S, which is enough to determine the robot's performance in that state.Denoting the number of bits required to represent a realization of random variable X as b(X), we consider a case in which:

b(i(S)) < b(S) < b(O).
Indeed, the observation may contain much more information than needed to estimate the state [30], and lossily compressing the message to preserve the relevant information, removing redundant or irrelevant details, can ease communication requirements without any performance loss.We can also observe that i(S) → S → O is a Markov chain.The random quantity i(S) represents the minimal description of the system with respect to the robot's task, i.e., no additional data computed from S adds meaningful information for the robot's policy.The state S may also include task-irrelevant physical information on the system.However, both S and i(S) are unknown quantities, as the observer only receives a noisy and highdimensional representation of S through O.This is a wellknown issue in DRL: in the original paper presenting the Deep Q-Network (DQN) architecture [34], the agent could only observe the screen while playing classic arcade videogames, and did not have access to the much more compact and precise internal state representation of the game.Introducing communication and dynamic encoding adds another layer of complexity.
We can then consider the case in which communication is limited to a maximum length of L bits, i.e., to 2 L+1 − 1 messages, considering all possible lengths lower than or equal to L, including no communication.Naturally, this assumes that the receiver has a way to discriminate between messages of different lengths, e.g., through a MAC layer header.The channel is ideal, i.e., instantaneous and error-free, but it includes a constant cost per bit as in the observer reward we gave in the previous section.Consequently, the problem introduces an information bottleneck between the observation O t and the estimate ôt that the robot can make, based on the message M t conveyed through the channel.If we define a distortion measure over the observation space d A : O 2 → R + , any communication introduces a non-zero distortion d A (o, ô) whenever b(o) > L, whose theoretical asymptotic limits are given by rate-distortion theory [30].If we also consider memory, i.e., the use of past messages in the estimation of ô, the mutual information between o and the previous messages can be used to reduce the distortion, improving the quality of the estimate.
In the semantic problem, the aim is to extrapolate the real physical state of the system S t from the compressed observation M t , which can be a complex stochastic function.In general, the real state lies in a low-dimensional semantic space S. The term semantic is motivated by fact that, in this case, the observer is not just transmitting pure sensory data, but some meaningful piece of physical information about the system.Consequently, the distortion to be considered in this case can be represented by a measure d B : S 2 → R over the semantic space, so that the distortion d B (ŝ ŝ(r) t ) is computed between the observer's best estimate of the state and the one performed by the robot based on M t , and on its memory of past messages.Finally, to be even more efficient and specific with respect to the task, the observer may optimize the message M t to minimize a distortion measure d C i ξ (o) t , i ξ (r) t between the effective representation of the observer's belief on the state, which contains only the task-specific information, and the knowledge available to the robot.Naturally, any message instance m t ∈ M must be at most L bits long, in order to respect the constraint.However, defining the sufficient statistic i ξ (o) t may be highly complex and problem-dependent, and using the robot's reward as a direct performance measure is significantly more direct, with the same guarantees.
∀t, where x max = 4.8 m and ψ max = 2π 15 rad (equivalent to 24 • ) are the maximum values for the two quantities.If the angle or cart position go outside the boundaries, the episode is over, and the agents do not accumulate any more reward.The goal for the two agents is 2 https://www.gymlibrary.dev/environments/classiccontrol/cart pole/ (a) Original.(b) Reconstructed.

Fig. 2 :
Fig. 2: Example of the original and reconstructed observation.

Fig. 4 :
Fig. 4: Performance of the communication schemes on the three Levels of the remote POMDP.
Angular RMSD from the central position.
Position RMSD from the central position.

Fig. 8 :
Fig. 8: Analysis of the transmission policy as a function of the pole angle ψ and cart linear velocity ẋ.The bitrate is measured in Bytes per transmission.

Fig. 9 :
Fig. 9: Analysis of the transmission policy as a function of the pole angle ψ and angular velocity ψ.The bitrate is measured in Bytes per transmission.
t .This reasoning can be intuitively understood by looking at the differential of the robot's value function Q ξ (r) t , π (r) ξ (r) t

Fig. 11 :
Fig. 11: Level A observer action distribution for different robot action entropy levels with β = 1.

TABLE I :
Main notation and definitions.

TABLE II :
Simulation Parameters.