Blind Interactive Learning of Modulation Schemes: Multi-Agent Cooperation Without Co-Design

We examine the problem of learning to cooperate in the context of wireless communication. In our setting, two agents must learn modulation schemes that enable them to communicate across a power-constrained additive white Gaussian noise channel. We investigate whether learning is possible under different levels of information sharing between distributed agents which are not necessarily co-designed. We employ the “Echo” protocol, a “blind” interactive learning protocol where an agent hears, understands, and repeats (echoes) back the message received from another agent, simultaneously training itself to communicate. To capture the idea of cooperation between “not necessarily co-designed” agents we use two different populations of function approximators — neural networks and polynomials. We also include interactions between learning agents and non-learning agents with fixed modulation protocols such as QPSK and 16QAM. We verify the universality of the Echo learning approach, showing it succeeds independent of the inner workings of the agents. In addition to matching the communication expectations of others, we show that two learning agents can collaboratively invent a successful communication approach from independent random initializations. We complement our simulations with an implementation of the Echo protocol in software-defined radios. To explore the continuum of co-design, we study how learning is impacted by different levels of information sharing between agents, including sharing training symbols, losses, and full gradients. We find that co-design (increased information sharing) accelerates learning. Learning higher order modulation schemes is a more difficult task, and the beneficial effect of co-design becomes more pronounced as the task becomes harder.


I. INTRODUCTION
Machine learning is a technology and associated design paradigm that has recently seen a resurgence largely due to advances in computational capabilities.Consequently, there has been increasingly active research in the areas of supervised and reinforcement learning, both in the underlying technology as well as in the development of design paradigms appropriate to using these technologies in diverse application contexts.This paper 1 is about seeing whether machine learning paradigms can be used to aid us with achieving interoperability in a wireless communication setting.The established paradigm for interoperation is that of standards -communication protocols are not only handcrafted by individual humans, these hand-crafted protocols are standardized and certified by authorized committees of people.Can we instead use machine learning techniques to learn how to communicate with minimal assumptions on shared information, and if so, how well can we learn?
Communication is a fundamentally cooperative activity between at least two agents.Consequently, communication itself can be viewed as both a special case of cooperation as well as a building block that can be leveraged to permit more effective cooperation.The fundamental limits to learning how to cooperate with a stranger have been studied in an abstract theoretical setting in [2], [3], [4], [5], [6].By asking how two intelligent agents might understand and help each other without a common language, a basic theory of goaloriented communication was developed in these papers.The principal claim is that for two agents to robustly succeed in the task of learning to collaborate, the goal must be explicit, verifiable, and forgiving.However the approach in these works took a fundamentally semantic perspective on cooperation.As Shannon pointed out in [7], the arguably simpler cooperative problem of communicating messages can be understood in a way that is decoupled from the issue of what the messages mean.To see whether existing machine learning paradigms can be adapted to achieve cooperation with strangers, we consider the concrete problem where two agents learn to communicate in the presence of a noisy channel.Each agent consists of a modulator and demodulator and must learn compatible modulation schemes to communicate with each other.This problem of learned communication has been tackled using learning techniques under different assumptions on the information that the agents are allowed to share and how tightly coordinated their interaction is.Early work in this area [8], [9], where gradients are shared among agents, demonstrated the success of training a channel auto-encoder using supervised learning when a stochastic model of the channel is known.Subsequent works relax the assumption on the known channel model by learning a stochastic channel model by using GANs as in [10], [11], [12] or by approximating gradients across the channel and using that for training.However, these approaches cannot be said to represent communication with strangers, and instead represent a way of having co-designed systems learn to communicate.If instead of sharing gradients we can only share scalar loss values then with access to a shared preamble, then reinforcement learning-style techniques must be used to train the system as demonstrated in [13] and [14] since they can work without access to an explicit channel model.
Moving closer to minimal co-design, if we further restrict ourselves to the case where the two agents only have access to a shared preamble, the "Echo" protocol, where an agent hears, understands, and repeats (echoes) back the message received from the other agent, as specified in [13] has been shown to work.By comparing the original message to the received echo, a learning agent can get feedback about how well the two agents understand each other 2 .The work in [13] considered a neural network based modulator that was trained using reinforcement learning via policy gradients [15], but the demodulator was nearest neighbors based and required no training -it used small-sample-based supervised learning.Our work in the present paper builds on this and studies the truly "blind" case where agents do not have access to a shared preamble.
We dub this "blind interactive learning" to acknowledge the motivational connection with the well known and traditional problems of blind equalization and blind system identification -where we have to deal with a channel and implicitly learn a model for it without knowing the actual input to the channel.(See, for example, the book [16] for a survey of well understood approaches.)Such blind approaches are fundamentally motivated by the desire for universality and the resulting robust modularity.Traditional blind approaches in signal processing are intellectually akin to what are called unsupervised learning approaches in machine learning.Reinforcement learning has always occupied a middle ground between supervised and unsupervised learning, because in a sense, it is self-supervised and carried out via interaction with an environment.For us, "blind interactive learning" involves agents interacting blindly via an environmentwhere the individual agents might be self-supervised but there is no explicit joint self-supervision.They are blind in the traditional sense of not really knowing what exactly went into the channel whose output they are observing, and in particular, not sharing a known training sequence.
It is important to differentiate here between our work on blind interactive modulation learning, and automatic modulation recognition (AMR) as in [17].AMR seeks to take an unknown signal present in the environment and classify as one of a set of known modulation schemes for subsequent demodulation.No interaction with the signal source occurs -indeed, for surveillance applications interaction might destroy all value!By contrast, our work requires interaction to learn how to demodulate any possible signal, even ones never before seen or imagined, and further to learn to modulate in a similarly arbitrary way understandable by an agent on the other side of a communication channel.This interaction takes the form of round-trip training so that both the modulation and demodulation functions can be updated.Although AMR-based techniques could play a role in the demodulation half of this process by speeding up learning for known signals, they are insufficient on their own to complete the circle.Our work introduces a new problem, blindly learn- 2 Round-trip stability is not by itself a sufficient condition to guarantee mutual comprehension.After all, one agent might only be doing simple mimicry -repeating back the raw analog signal value received with no attempt to actually demodulate.However, in [13], the key insight was that intelligent agents, though strangers, are believed to be cooperative and so wish to actually understand and communicate with each other.They do not need to actually coordinate with another designer to realize that sheer mimicry would not necessarily advance their goal of cooperation.Consequently, the Echo protocol can rely on good intentions to eliminate the possibility of agents just mirroring what has been heard instead of trying to understand what was sent and repeating it back.
ing an entire modulation and demodulation scheme, rather than introducing a new technique for classifying existing modulation schemes.
We also introduce the concept of "alienness" among agents.After all, if our goal is to understand learning of communication between strangers, we need to be able to test with strangers.A natural question is what it means for two agents to be alien.For now, we consider agents to be alien if they differ in their learning architecture or their hyperparameters.In this work we examine modulators and demodulators represented using different types of function approximators such as neural networks and polynomials.
Our main contribution is to show how to make the Echo protocol blind and investigate the extent to which it is universal, i.e. does it allow two agents to learn to communicate irrespective of their inner workings.To do this we study what level of information sharing is necessary for successful learning.Although we do not have a formal proof of universality yet, we provide some empirical evidence by pairing up agents with different levels of "alienness" based on the hyperparameters, architectures, and techniques used in their modulators and demodulators.By doing so we wish to separate the effect of the of the agents' implementations, such as those owed to specific function approximators, from the meta protocols (specifically the Echo protocol) used to do the interactive learning.To both connect to the literature and explore the spectrum between complete co-design and cooperative learning among strangers, we look at different levels of information sharing: shared gradients, shared loss information, shared preamble, and finally the case where only the overall protocol is shared.
Machine learning scholarship is notorious for producing results that are not easily reproducible, and failure to identify the source of and explain the reasoning behind performance gains [18].Keeping this in mind, in order to evaluate the ease, speed, and robustness of the learning task under various levels of alienness and information sharing, we conduct repeated trials for each setting using different random seeds and multiple sets of hyperparameters.We report the fraction of trials that succeeded as a function of the number of symbols exchanged, as well as aggregate statistics about the bit error rate achieved at different signal to noise ratios (SNR) by the learned modulation schemes.We compare our experimental bit error rates to those achieved by optimal modulation schemes for AWGN channels to provide a baseline for comparison.The code used to generate the results is available in [19].From our experiments we observe and conclude that the Echo protocol does enable two agents to learn a modulation scheme even under minimal shared information, and that as we decrease the amount of shared information the learning task becomes harder, i.e a lower fraction of trials succeed and the agents take longer to learn.
It appears that learning to communicate with "alien" agents is not necessarily more difficult than learning to communicate with agents of the same type.However, it is significantly easier to learn to communicate if one of the agents already uses a good modulation scheme, for example a hand-designed scheme like QPSK.Finally, as we increase the modulation order for communication the learning task becomes harder, especially so for settings with little information sharing.
Although a majority of the results reported in this paper were performed purely in simulation, we replicate our main results using USRP radios and observe similar results -two agents can learn to communicate in a decentralized fashion even using real hardware.

II. RELATED WORK
Deep learning has shown great success in tasks that historically relied on multi-stage processing using a series of welldesigned, hand-crafted features such as computer vision, natural language processing, and more recently robotics.Wireless communication is another area that historically uses hand-crafted schemes for various processing stages such as modulation, equalization, demodulation, and encoding and decoding using error correcting codes.Thus, as alluded to in [20] and [21], one might believe that bringing deep learning into wireless communication is a worthwhile endeavor.In fact, learning and deep learning have been present in the subfield of AMR since at least the 1980s.AMR has undergone a similar transition from hand-crafted features, such as phase difference and amplitude histograms as detailed in [17], to modern deep learning techniques [22], [23], [24], [25].
Beyond modulation recognition, the pioneering work in [8], [9] demonstrated the promise of the channel autoencoder model by using supervised learning techniques to learn an end-to-end communication scheme, including both transmission and reception.This approach assumes the knowledge of an analytical (differentiable) model of a channel and the ability to share gradient information between the receiver and transmitter.This approach was a natural first step given the known connection between auto-encoders and compression (see e.g.[26]) as well as the well-known duality between source-coding (compression) and channelcoding (communication) [7].
Building on this foundation, other works deal with the case where the channel model is unknown, as is the case when we perform end-to-end training over the air.In [14], a stochastic model that allows backpropagation of gradients to approximate the channel is used with a two phase training strategy.Phase one involves auto-encoder-style training using a stochastic channel model, whereas phase two involves supervised fine-tuning of the receiver part of auto-encoder based on the labels of messages sent by transmitter and the IQ-samples recorded at the receiver.This approach relies on starting out with a good stochastic channel model.Use of Generative Adversarial Networks to learn such models is explored in [10], [11], [12].In [27], instead of estimating the channel model, stochastic approximation techniques are used to calculate the approximate gradients across the channel.The idea of approximating gradients at the transmitter has also been used in [28] to successfully perform end-to-end training.
In the absence of a known channel model, reinforcement learning can also be used to train the transmitter as demonstrated in [13] and [14].In [13], the Echo protocol -a learning protocol where an agent hears, understands, and repeats (echoes) back the message received from the other agent -was used to obtain a scalar loss that was used to train the neural-network based transmitter using policy gradients.
Here the receiver used a lightweight nearest-neighbor based scheme that was trained afresh in each communication round.This work assumed that the agents have access to a shared preamble so we dub it Echo with Shared Preamble (ESP).In [14] both the transmitter and receiver were neural-network based.The receiver was trained using supervised-learning whereas the transmitter was trained using policy gradients by passing scalar loss values obtained at the receiver back to the transmitter.Reinforcement learning techniques have the added advantage of being implementable in software-defined radios to perform end-to-end learning over the air.To do this one must tackle the issue of time synchronization between the transmitted and received symbols as done in [29] and [30].In [31], the general problem of synchronization in wireless networks is addressed via the use of attention models.
Other parts of the communication pipeline such as channel equalization and error correcting code encoding and decoding have also been studied using machine learning techniques.The use of neural networks for equalization is studied in [32] and [33].Construction and decoding of error correcting codes is considered in [20], [34], [35], and [36].Joint source channel coding is an area where performance gains are possible through co-design as demonstrated in [37] for wireless communication, and in the application of wireless image transmission in [38].End-to-end auto-encoder style training continues to be an area of interest in wireless communication.There has been recent work demonstrating the success of convolutional neural network based architectures and block based schemes in this setting in [39], [40], [41], and [42].This approach has also been used successfully in OFDM systems [43] to learn the symbols transmitted over the sub-carriers.Deep learning techniques and autoencoder style training have been used in the fields of fiberoptic [44], [45] and molecular communication [46], [47] to model the channel and to leverage the channel model to learn communication schemes that achieve low error rates.
A theoretical analysis of the general learning to cooperate problem is done in the works [2], [3], [4], [5], [6].This body of work investigates the possibility for two intelligent beings to cooperate when a shared context is absent or limited.In particular, this work also does not presume a pre-existing communication protocol.In asking how two intelligent agents might understand each other without a common language, a theory of goal-oriented communication is developed.The principal claim is that for two agents to robustly succeed in the cooperative task, the goal must be explicit, verifiable, and forgiving.Agents should have feedback about whether the goal is achieved or not, and it should be possible for the agents to achieve the goal from any state that is reached after a finite set of actions.The works [48], [49], [50], [51] bring about these ideas in a limited setting.
From a psychological perspective, developmental psychology [52] provides a rich account of how human infants learn to communicate.How do babies come to understand sounds, words, and meaning?It begins in the development of 'categorical perception of sound' which creates discrete categories of sound perception, not unlike the task of demodulation.Later on, other tasks emerge such as word segmentation, attributed to statistical learning, where in the child grows increasingly aware of sounds and words that belong together.Soon after, the child engages in babbling as an exploration of language production, investigating rhythm, sound, intonation, and meaning, a task similar to modulation.Important to all the above processes, is social interaction and exchange, most often between child and caretaker, which provides the rich information required for learning to be successful.

III. OVERVIEW A. PROBLEM FORMULATION
We consider the setting where two agents communicate in the presence of a discrete-time additive white Gaussian noise (AWGN) channel.Each agent consists of an encoder (modulator) and a decoder (demodulator).We treat the modulator as an abstract (black box) object that converts bit symbols to complex numbers, i.e. we treat it as a mapping M : B → C where B refers to the set of bit symbols and C refers to the set of complex numbers.Similarly we treat the demodulator as an abstract object that converts complex numbers to bit symbols, i.e. a mapping D : C → B. The set of bit symbols B, is specified by the modulation order (bits per symbol).For instance, when bits per symbol is 1, B = {0, 1} and when bits per symbol is 2, B = {00, 01, 10, 11}.For the case where bits per symbol is 1, the classic3 BPSK (binary phase shift keying) modulation scheme is given by: The corresponding demodulator performs the demodulation as, In addition to agents that use fixed modulation schemes we also consider 'learning' agents.These agents use function approximators to learn the mappings performed by modulator and demodulator, and we denote these as M (•; θ) and D(•; φ) where θ and φ denote the parameters of the underlying function approximators and are updated during training.The specifics of the learning agents and their update methods can be found in Appendix B.
The main focus of our work is in learning modulation schemes, thus, to make it easier to conduct experimental simulations we make the following simplifying assumptions: 1) There are at most two agents, and they engage in perfect turn-taking.
2) The two agents are separated by a unit gain AWGN channel.There is no carrier frequency offset, timing offset or phase offset.3) Both agents encode and decode data using the same, fixed number of bits per symbol (i.e., the modulation order is preset).Section VI-A describes the modulation orders and their reference modulation schemes used in this paper.4) The environment is stationary and non-adversarial during the learning process.

B. MOTIVATION AND APPROACH -ECHO WITH PRIVATE PREAMBLE PROTOCOL
The main objective of our work is to specify a robust communication-learning protocol that allows two independent agents to learn a modulation scheme under minimal assumptions on information sharing beyond shared knowledge of learning protocol and the ability to take turns.No other information is shared a priori or via a side channel during training.We name this protocol Echo with Private Preamble (EPP).Details about the EPP protocol are provided in Section IV-A.The EPP protocol is a special variant of the Echo protocol described in Fig. 1.
The underlying premise of the Echo protocol is that an echo of the message-originating from one agent and repeated back to them by another agent-provides sufficient feedback for an agent to learn expressive modulation schemes.Under the Echo protocol, one agent (the "speaker") broadcasts a message and receives back an estimate of this message (preamble), an echo, from the other agent (the "echoer").The passage of the original message from the speaker to the echoer and back to the speaker as an echo is denoted as a round-trip.(A half-trip goes only from speaker to echoer.)After a round-trip, the speaker compares the original message to the echo and trains its modulator and demodulator to minimize the difference (usually measured in bit-errors) between the two messages.The two agents then switch roles and repeat.When the difference between the original message and the demodulated echo is small, we infer that the agents can communicate with one another.
The ideas here are similar to the approach to solving image-to-image mapping, or style transfer, popularized by CycleGAN [53].Both works solve the problem of learning mappings between domains with only weak supervision by introducing a round-trip and defining 'goodness' as how close the output of the round-trip is to the input.Having round-trip feedback from either the other radio agent or the other GAN crucially enables performance measurement, and hence training.
By contrast, in the Echo with Shared Preamble protocol from [13], the echo behavior is introduced only to train the modulator, and knowledge of a shared preamble between the two agents is assumed to facilitate direct supervised training of the demodulator after a half-trip exchange.In the EPP VOLUME TBD, 2020 protocol, the agents do not have access to a shared preamble and must learn to demodulate blindly, without knowing what was actually sent by the other agent.There is no shared preamble.
We believe that the EPP protocol minimizes the information sharing assumptions for learning modulation schemes for two reasons.First, some sort of feedback is required for learning, and the echo provides this feedback.Second, the EPP protocol treats the environment as a regenerative channel, i.e. a channel that provides helpful feedback without requiring assumptions about the nature of the other communicating agent.As long as the other agent is cooperative (in the sense that it echoes back what is heard), then the environment behaves like a regenerative channel.
Next, we argue that the EPP protocol is a plausible mechanism for learning modulation schemes when the channel is regenerative by considering the case of a learning agent communicating with an agent that uses fixed, classic schemes.In this setting, even random exploration would eventually find a modulation scheme that successfully interfaces with the fixed agent.By using feedback to guide exploration, we expect the EPP protocol to perform much better than random guessing and quickly converge to a suitable modulation scheme.We can think of such a fixed, friendly regenerative channel as a "game" that the learner plays where positive reward is achieved if what the channel echoes back can be decoded as what the learner encoded and sent in.Reinforcement learning is good at optimizing behaviors for simple games like this [54].One of our main contributions is to show that the EPP protocol works not only with fixed communication partners, but even in the case where two agents are learning simultaneously.
To verify the universality of the EPP protocol and understand its performance relative to more structured or complex procedures, we run experiments with: 1) Different learning protocols based on varying amounts of information sharing as described in Section IV. 2) Different levels of "alienness"4 among agents as described in Section V. 3) Different modulation order and levels of training SNR as described in Section VI-A.

IV. LEVELS OF INFORMATION SHARING
The EPP protocol introduced in Section III-B is designed to be minimalist in the sense that we share as little information as possible.However, using less information usually comes at the cost of performing worse.To quantify the value of shared information, in this section we describe the following protocols that allow an increasing amount of shared information: 1) Echo with Shared Preamble (ESP) protocol: Agents have access to shared preamble but can only get feedback via a round-trip during training; 2) Loss Passing (LP) protocol: Agents have access to a shared preamble and share scalar loss values directly (without using the channel) during training; and 3) Gradient Passing (GP) protocol: Agents have access to a shared preamble and share gradients directly (without using the channel) during training.Note that by sharing gradients or loss information directly across the channel, it is possible to truncate the learning process at step C in Fig. 1 and still update the modulator of the speaker.Examples of algorithms which stop at this step are shown later in Figs. 5 and 6.In fact, traditional autoencoder style training is like the gradient passing protocol described above.A reader who is already familiar with this concept may wish to read about the protocols in the reverse order from how we present them.
Our purpose in studying LP, GP, and ESP protocols is primarily to understand the effect of shared information on learning since these are not new and have been studied independently in previous works such as [21], [14] and [13].The LP and GP protocols are not implementable in real world systems without a side channel to pass losses and gradients -i.e. they can be used in simulation at design-time, but not really used at run-time among distributed agents without depending on some existing communication infrastructure between them.ESP, however, is practical and can be implemented by mandating that every agent use a common fixed preamble.The major difference is that ESP requires agents to establish a shared preamble through some other mechanism before they can learn to communicate, whereas EPP removes this requirement.Section VII-A reports the results of our experiments comparing the performance of these protocols and quantifying the value of shared information.
The following subsections describe the learning protocols for EPP, ESP, LP, and GP in detail, highlighting the important differences between them.

A. ECHO PROTOCOL WITH PRIVATE PREAMBLE
The EPP protocol is the main contribution of our paper.It is described in detail in Alg. 1 and Fig. 2. The key details when comparing to ESP, LP, and GP are the natures of the modulator and demodulator updates.For EPP, the demodulator updates use supervised learning, but have to rely on noisy feedback because only p is known, but the demodulator actually receives p.The modulator updates use reinforcement learning based on the round-trip feedback.Because the preamble is known only to the speaker, only the speaker's modulator and demodulator can be updated during a roundtrip.The choice of when to terminate training is arbitrary, but we choose to halt training after a fixed number of training iterations.Other implementations might halt training after a BER target is reached.
One important consideration that is unique to the EPP protocol is that there is no way to ensure that the bit sequence Agent 1 then demodulates the echoed preamble as p and does a policy update of its modulator using a bit loss between the original preamble that Agent 1 sent, p, and echoed preamble that Agent 1 received, p, as well as a supervised gradient update of its demodulator with cross entropy loss.Agent 1 and Agent 2 then switch roles so that now Agent 2 is the speaker and Agent 1 is the echoing agent.All implementations for the modulator currently use a Gaussian policy with mean and variance estimated by a function approximator as described in Section VI-C.Only the Speaker updates each round-trip end procedure sent by the modulator is interpreted as the same sequence after being demodulated.More formally, there is no way to ensure that For example, Agent 1 might modulate the sequence 11 as some symbol c 1 , but Agent 2 might interpret c 1 as 00.After a round-trip, however, any incorrect bit sequence to modulated symbol mappings will be reversed if the agents have trained properly.We can guarantee that Fig. 3 demonstrates how this might happen.We address how we evaluate agents when this mapping ambiguity is present in Section VI-D.In general, it would require a protocol higher up the communication stack to disambiguate symbol mappings without access to a shared preamble -some way of symmetry breaking is required, and this presumably requires knowing more about the context of communication.Im (c) Agent 2 modulation scheme Im An example modulation scheme learned by agents using the EPP protocol to demonstrate ambiguity of communication after a half-trip, but coherence after a round-trip exchange.In this scheme Agent 1, maps the bit sequence '00' to the complex number 1 − 0.5j, i.e M1('00') = 1 − 0.5j.Agent 2 demodulates this as the bit sequence '11', i.e D2(M1('00')) = '11' = '00'.However this mismatch is reversed when the round-trip is completed.Agent 2 modulates '11' as 0 − 1j and Agent 1 demodulates this as '00'.Thus D1(M2(D2(M1('00'))) = '00'.

B. ECHO WITH SHARED PREAMBLE
The ESP protocol is described in Fig. 4 and Alg. 2. ESP was first explored in [13] where the modulator was neural network based and trained using policy gradients but where the demodulator used clustering methods5 trained via supervised learning using the shared preamble.In our work, we use the ESP protocol to train agents whose modulators and demodulators both use function approximators.(See Appendix B for more information).
ESP is similar to the EPP protocol, but now both the speaker and echoer know the preamble p that is transmitted.This allows the echoer to update its demodulator after the first half-trip since it knows exactly what it was supposed to have received.This demodulator update is typically of higher quality than the updates in EPP, since those updates only have access to symbols based on the (possibly incorrect) estimate of the original preamble sent back by the echoer.The speaker agent does not bother to update its demodulator after the round-trip is complete, since it will receive higher quality feedback on the next training iteration after the speaker and echoer roles are switched.
Importantly, the speaker's modulator still requires a full round-trip before it can receive feedback and be updated.In the next Sections IV-C and IV-D this will no longer be the case.The consequence of round-trip feedback is that the speaker's modulator is actually optimizing for the performance of the speaker's demodulator, since that is the only loss is has access to.Our presumption is that improving the round-trip performance of the speaker's demodulator will indirectly improve the half-trip performance of the echoer's demodulator, since the half-trip BER limits the round-trip BER.The consequences of this indirection are illustrated in Section VII-A.

C. LOSS PASSING: HALF-TRIP
Now we remove the restriction that information can only be shared over the channel during training and allow the agents to magically pass losses back and forth.The loss passing protocol, as used in previous work such as [14], is detailed in Fig. 5 and Alg. 3.There is no longer a need for an echo from the second agent, since the speaker's modulator receives a loss value directly from the second agent's demodulator.This results in two major changes: a full training update can In the ESP protocol, the preamble p is modulated and sent from Agent 1 across the channel to Agent 2 and is demodulated as p.Using the shared preamble, Agent 2 performs a gradient update on its demodulator and also modulates and sends back an echo, an estimate of the preamble it received, p, through the channel back to Agent 1. Agent 1 then demodulates the echo as p and does a policy update of its modulator using the bit loss between the original preamble p and estimate of the echo p. Agent 1 and Agent 2 then switch roles and repeat the process.All implementations for the modulator currently use a Gaussian policy with mean and variance estimated by a function approximator as described in Section VI-C.The Echoer's demodulator updates, not the Speaker's end procedure be completed after only a half-trip, and the speaker's modulator is optimizing for the echoer's demodulator performance directly.
In the EPP and ESP protocols, the speaker's modulator has to optimize for the performance of the speaker's demodulator, only indirectly addressing the performance of the echoer's demodulator.The LP protocol allows the speaker's modulator to directly optimize for the performance of the echoer's demodulator since the speaker has access to the relevant loss values.Although the speaker's modulator still has to use reinforcement learning rather than supervised learning to perform parameter updates, we expect the agents to be able to train much faster when using loss passing.

D. GRADIENT PASSING: HALF-TRIP
If we further allow the agents to share gradients during training, the system can naturally be treated as an end-toend autoencoder 6 with channel noise introduced between the encoding and decoding sections.This method was employed Only a half-trip is required for updates end procedure successfully in [9].Our version of such an autoencoder-based training protocol, which we call the GP protocol, is explained in detail in Fig. 6 and Alg. 4.
As in the LP protocol, the speaker's modulator can be trained after only a half-trip because it has access to feedback from the echoer's demodulator.Instead of using reinforcement learning to train a Gaussian policy, however, the speaker in GP trains its modulator to encode bits directly as complex numbers, and the gradients from the echoer's demodulator are used for supervised learning updates.

V. ALIENNESS OF AGENTS
How can we determine if the EPP is universal?We need to determine if it allows us to learn to communicate with strangers.There are in principle three kinds of agents (strangers) that we might encounter with which we might wish to learn to communicate: 1) A fixed agent that knows how to communicate; 2) A learning agent that does not know how to communicate yet but is cooperative and willing to learn; or 3) An agent that does not know and will not learn how to communicate.The Classic agent uses a fixed modulation scheme known to be optimal for AWGN channels for the given modulation order, for e.g.QPSK for 2 bits per symbol, 8PSK for 3 bits per symbol, and 16QAM for 4 bits per symbol [55].This is an example of an agent of the first kind.An example of an agent of the second kind is one that uses a function The Speaker performs a gradient-loss update end procedure approximator for its modulator and demodulator that can be trained.We consider Neural agents, agents that use neural networks as function approximators, and Poly agents, ones that use polynomials as function approximators.We ignore the agents of the third kind since it is impossible to learn to communicate with such agents.
Note that there are several other examples of agents.A learning agent that has been pre-trained and frozen behaves like a fixed agent.We can in principle have learning agents with decision tree or nearest neighbor based function approximators.However, in this paper, we restrict ourselves to Classic, Neural, and Poly agents.Details about these agents, including the hyperparameters used and training methods employed, are provided in Appendices B and G.
We perform experiments by pairing two agents with different levels of alienness, where alienness is determined by: 1) Whether they are fixed agents or learning agents (e.g. a Neural-and-Classic matchup) 2) The class of function approximators used by the learning agents.We denote such agents as "Aliens" (e.g.Neural-and-Poly).
3) The random initialization and hyperparameters used by two learning agents using the same class of function approximators.We denote agents that use the same class of function approximators but different random initialization and hyperparameters as "Self-Aliens" (e.g.Neuraland-Self-Alien).4) The random initialization used by two learning agents using the same class of function approximators and the same hyperparameters.We denote agents that differ only in random initialization as "Clones" (e.g.Neuraland-Clone).
Results for these experiments that portray the effect of different levels of alienness on the performance of the EPP  protocol are provided in Section VII-B.

VI. EXPERIMENTS
In addition to the effects of different levels of information sharing and alienness, modulation order, training SNR, and modulator constellation power constraints are other factors that affect the performance of our learning protocols.

A. MODULATION ORDER AND TRAINING SIGNAL TO NOISE RATIO
Modulation order, determined by the bits per symbol (bps) used, determines the number of unique symbols that can be sent and received.A bps of b corresponds to 2 b unique symbols.For instance for bps = 2, we have 4 unique symbols: '00', '01', '10', and '11'.We consider settings where bits per symbol is either 2, 3, or 4. For Classic agents, bps determines the fixed scheme, optimal for AWGN channels, used as a baseline.These are provided in Table I and visualized in Appendix E. For Neural and Poly agents, bps determines the size of the inputs and outputs of the modulator and demodulator.Details about this are provided in Appendix B. Since higher modulation orders have higher bit error rates (BERs) at the same SNR, we must determine an appropriate SNR to use for training and testing to provide a fair comparison between different modulation orders.We do this by selecting the SNR based on the round-trip BER achieved when using the baseline (classic) schemes.For most experiments we use a training SNR corresponding to a BER of 1% and for all experiments we test on SNRs corresponding to BERs ranging from 0.001% to 10% as described in Table I.We explore the effect of modulation order and training SNR on the performance of EPP protocol in Appendix C.

B. CONSTELLATION POWER CONSTRAINTS
As described in Section III-A, the modulator maps symbols (bits) into complex numbers, i.e. points on the complex plane.Due to the presence of the AWGN channel, it is optimal to place these points as far away as possible to minimize the likelihood of an error.Thus to get non-degenerate solutions we must impose a constraint on how far these points can be from the origin.Note that this is similar to a real-world constraint on power used by a radio system.
We introduce a hard power constraint by requiring that the modulator outputs have an average power of less than 1.We experimented with other soft power constraints by including a penalty term in the loss function based on the power used while training, but chose not to use it in the end.For simulations, we observed that a hard power constraint was sufficient, and more importantly did not require tuning the hyperparameter corresponding to the weight of the power penalty.

C. TRAINING
For the Neural and Poly learning agents, the demodulator is trained using supervised learning with cross-entropy loss.In the GP protocol, the modulator output is equal to the output of the underlying function approximator and its parameters are updated using supervised learning.In the EPP, ESP, and LP protocols the modulator employs a Gaussian policy.The modulator output is sampled from a Gaussian distribution with mean and variance determined by the output of the underlying function approximator whose parameters are updated using vanilla policy gradients.More details about the update procedure are provided in Appendix B.
We conduct multiple trials using different random seeds for each experiment to accurately estimate the performance of our protocols and agents.An experiment fixes the learning protocol, the agent types, training SNR, and modulation order.Each trial is run for a maximum number of training iterations that we determine empirically for each experiment.Easier learning tasks are run for fewer iterations to speed up the simulations.Note that instead of measuring training iterations we can also measure the number of preamble symbols transmitted.These two measurements are related via the preamble length, the number of symbols in the preamble.For all our experiments we set the preamble length to 256 symbols in order to allow fair comparison across experiments.This also reduces the relative cost of overheads in the implementation on real hardware radios.Certain protocols and modulation orders require fewer transmitted symbols to achieve good performance.Details about the maximum iterations (and thus maximum number of preamble symbols transmitted) can be found in Table VI in Appendix F, and in the code itself at [19].

D. EVALUATION
How do we determine the metrics that should be used to measure the performance of a learning protocol?These metrics should allow for a fair comparison across different protocols (GP, LP, ESP, and EPP) and must be informative in determining the effect of different levels of information sharing, alienness, and modulation order on the learning task.We are primarily interested in quantifying 'efficiency', how long the protocol takes to learn a modulation scheme, and 'robustness', how reliably the protocol learns this scheme.
First, we must decide on a metric to determine if the learned modulation scheme is 'good'.Bit error rate (BER) is a natural choice in communication settings but since we have two agents we must determine whether to measure cross-agent BER (half-trip BER) or round-trip BER.In the GP, LP, and ESP protocols both cross-agent and round-trip BER are indicative of performance.In the EPP protocol, since the two agents have no shared preamble, measuring cross agent BER is not a good indicator of performance since the two agents may have different bit interpretations of the same modulated symbol, as described in Section IV-A.However, round-trip BER is a valid measure of performance in this case.Consequently, we choose the round-trip BER to allow for a fair comparison between different protocols.Note that when measuring the BER to evaluate performance of agent(s), instead of sampling from the Gaussian policy, the modulators deterministically use the mean of the Gaussian policy.This avoids introducing additional errors from "exploration." Next, we must determine the SNR that we measure the round-trip BER at and whether the measured BER is indicative of good performance.As discussed in Section VI-A, we decide on the test SNRs based on the modulation order depending on the performance of the baseline.To determine if the measured BER is indicative of good performance we measure the metric 'dB off optimal', illustrated in Fig. 7. To compute this metric we first measure the test BER achieved by our protocol at the SNR where the corresponding baseline scheme achieves 1% BER.Then we compute the difference between this SNR(dB) and the minimum SNR(dB) required for the baseline scheme to achieve the measured BER.We measure this at different stages of the learning process corresponding to different numbers of preamble symbols transmitted.
Using the round-trip BER and dB off optimal metrics we look at the following two graphs:

1) Round-trip BER vs SNR
Here we plot order statistics of the round-trip BER achieved by the learning protocol after it has converged or reached the maximum number of iterations allowed in our setup alongside that achieved by the baseline.This graph measures the limit of BER performance of our protocols subject to the maximum number of training iterations symbols we allow.Fig. 10a is a representative example.2) Fraction of trials that are 3 dB off optimal Here we plot the fraction of trials that achieve dB-offoptimal values of less than 3 vs the number of preamble symbols transmitted.This tells us how robustly and how fast we are learning the modulation scheme.Fig. 10b is a representative example.
Neither of these metrics is novel on its own; however, we are unaware of works which report both metrics.Our contribution is to combine these metrics to understand the performance of a learning protocol.

VII. RESULTS
In our experiments we consider the following agents: Classic, Neural-fast, Neural-slow, Poly-fast, and Poly-slow.The Neural-fast and Neural-slow agents (similarly Poly-fast and Poly-slow) are Self-Aliens; they share the same architecture but differ in learning rates and exploration parameters.They are named with -fast or -slow depending on their relative modulation scheme learning times using the EPP protocol when paired with a clone.To choose hyperparameters for our agents, we performed coarse hand-tuning for the EPP protocol in the Agent-and-Clone setting and used the same hyperparameters for the ESP and LP protocols.The hyperparameters chosen this way were sufficient to obtain performance close to the baseline in terms of BER for all of these protocols in the Agent-and-Clone setting.However for the GP protocol we observed that using the same hyperparameters as EPP led to sub-optimal performance and thus we tuned parameters separately for the GP case.While tuning hyperparameters, we further ensured that Poly-fastand-Poly-fast had similar convergence times to Neural-slowand-Neural-slow.It is known that hyperparameters matter, and finding good hyperparameters is hard (see for example [56]).However, we trust our conclusions about the relative performance of protocols in our experiments because we see order of magnitude differences across many trials.Hyperparameters for each agent are listed in Appendix G.
We categorize the different pairings based on the different levels of alienness introduced in Section V as follows:   As expected, the results show learning using the GP and LP protocols to be fast (compared to ESP and EPP in Table III).Furthermore, the GP protocol converges faster than equivalent experiments using LP.Gradients can carry more information than scalar loss values, so it makes sense for GP to be a more effective learning protocol.For the experiments in this section we use 2 bits per symbol and train at 8.4 dB SNR, corresponding to 1% BER for the QPSK baseline.Tables II and III contain numerical results for our experiments on the effects of information sharing and alienness on the performance of modulation learning schemes.Sections VII-A and VII-B present additional figures and discuss the meaning of these results.Experiments on the effect of modulation order and training SNR on the performance of the ESP and EPP protocols can be found in Appendix C.

A. EFFECT OF INFORMATION SHARING
In this first set of experiments, we explore the effect of information sharing on our learning protocols, seeking to quantify the value of shared information and understand the performance trade-off incurred by reducing shared information in the ESP protocol.For this, we primarily consider the case of an agent learning to communicate with its clone.We choose this case because, in order to succeed at learning to communicate with others, one must first be able to communicate with (a copy of) oneself.Fig. 8 compares the performance of the GP, LP, ESP and EPP protocols for a Neural agent learning to communicate with its clone.From the round-trip BER curves in Fig. 8a, we observe that all protocols achieve similar values for the median BER and upper and lower percentiles.Furthermore, the median BER is close to the QPSK baseline.This is one of the main results of our work.EPP can perform as well as ESP, LP and GP and achieve performance close to an optimal baseline.From Fig. 8b we observe that all protocols are robust, with the fraction of trials that converge going to 1 after sufficient preamble symbols are exchanged.The EPP protocol needs the most preamble symbols to converge, followed by the ESP protocol, and both these protocols take a much larger number of preamble symbols to converge than the GP and LP protocols.Thus, we conclude that with decreasing amount of information sharing it takes longer to learn to communicate, highlighting the value of shared information.
Tables II and III tabulate the number of preamble symbols that have to be exchanged for more than 90% of trials to converge within 3 dB-off-optimal for the different protocols.From these tables we see that there is an order of magnitude or more difference in the number of preamble symbols required between the protocols that use a side channel (GP and LP) and ones that don't (ESP and EPP).We performed similar experiments using Poly agents and observed the same behavior, as shown in Fig. 20.These results are included in Appendix C.   The BER plot (a) shows that all protocols achieve BER close to that of QPSK baseline.From the convergence plot (b), we observe that EPP is much slower than ESP, which in turn is an order of magnitude slower than LP and GP.Protocols with greater information sharing lead to faster convergence.
In the rest of the experiments, we determine the effect of alienness on the EPP protocol to address the universality of the Echo protocol.

B. EFFECT OF ALIENNESS
We explore the effects of alienness with the following cases: We are primarily interested in answers to the following questions: 1) Is it possible to learn to communicate with self-aliens and alien agents using the EPP protocol?2) Is it intrinsically more difficult to learn to communicate with aliens or self-aliens than learning with clones?3) Can we say something about the performance of the EPP protocol with alien agents based on the individual performances when trained with clones?(e.g Can we say something about the performance of Neural-slowand-Poly-fast by looking at the performances of Neuralslow-and-Neural-slow and Poly-fast-and-Poly-fast?)

1) Learning with Fixed Agents
We first address the question of whether it is possible for learning agents to learn to communicate with fixed agents.This is important, since we are likely to encounter agents that use fixed modulation schemes in the real world and our learning agent must be compatible with them.To do this, we run experiments with a Neural agent learning to communicate with a fixed Classic agent, Neural-fast-and-Classic.After confirming that learning agents can work with fixed Classic agents, we compare this to the case when a Neural agent trains with another learning agent.In particular, we examine whether learning to communicate with a clone is harder than learning to communicate with a fixed agent, and whether increasing alienness (self-alien and completely alien) further increases the difficulty of the task.Fig. 9 compares the performance of the GP, LP, ESP and EPP protocols for a Neural agent learning to communicate with a Classic agent.From the round-trip BER curves in Fig. 9a we observe that all protocols achieve round-trip BER close to the QPSK baseline.Fig. 9b shows the EPP and ESP protocol have similar convergence behavior but are an order of magnitude slower than the GP and LP protocols.All protocols lead to robust convergence.Furthermore, comparing against Fig. 8b, we see that learning to communicate with a Classic agent is much easier than learning to communicate with a clone learning agent.Table III shows a difference in convergence speed of up to 5.5× between these two cases when using the EPP and ESP protocols.This matches what we expect intuitively, since when both agents are learning each agent is trying to improve its own behavior and simultaneously track the behavior of the other agent.When one agent is fixed, the learning agent only has to match a static behavior.Graphical results for a Poly agent learning to communicate with a Classic agent can be found in Fig. 21 in Appendix C.    Learning to communicate with a fixed agent, Neural-fast-and-Classic: The BER plot (a) shows that all protocols achieve BER close to that of QPSK baseline.From the convergence plot (b), we observe that EPP and ESP have similar convergence behavior and are an order of magnitude slower than LP and GP.For the ESP and EPP protocol convergence is much faster than when learning with a clone (Fig. 8) 2) Learning with Self-aliens Fig. 10 compares learning with clones to learning with a self-alien for Neural agents using the EPP protocol.Here we have one "fast" agent and one "slow" agent, defined based on speed of convergence when paired with a clone.From the BER curves in Fig. 10a we see that all cases achieve round trip accuracy close to the QPSK baseline.From the convergence plot in Fig. 10b we make an interesting observation.The fast agent helps the slower agent to learn more quickly,   resulting in convergence times for the self-alien pairing in between those of the clone pairings.This is very encouraging since it suggests that not only is learning with self-aliens possible, it can also be faster than learning with clones for a slow agent.We repeated this experiment using Poly agents and found similar behavior, shown in Fig. 22 in Appendix C.   The BER plot (a) shows that in all cases, the BER achieved is close to the QPSK baseline.From the convergence plot (b), we observe that when both clone parings show similar convergence behavior, the alien pairing also has the same behavior.Learning to communicate with an alien is not intrinsically more difficult than learning to communicate with a clone.

3) Learning with Aliens
Next we compare learning with aliens to learning with clones.Fig. 11 depicts the results for the case where a Neural and Poly agent that show similar convergence behavior when learning with a clone are paired with each other.From the BER curve in Fig. 11a we see that the Neural agent when paired with a clone has slightly lower BER than the Poly    (a) shows that in all cases, the BER achieved is close to the QPSK baseline.From the convergence plot (b), we observe that when the clone pairings have vastly different convergence behavior, the alien pairing shows convergence behavior somewhere in between.All alien pairings converged to within 3 dB off optimal, even though the Poly-slow with clone failed at least once.The robustness of the Neural-fast learner may have helped the Poly-slow agent converge more reliably.
agent paired with clone, but more importantly the BER for the alien pairing is very similar to the others and close to the QPSK baseline.From the convergence plot in Fig. 11b we see that when the two clone pairings show similar convergence behavior the alien pairing does not deviate.This is another main result of our work.Learning to communicate with an alien agent using EPP is not intrinsically more difficult than learning with a clone.
In the next experiment we investigate whether it is possible for an agent to learn to communicate with an alien agent when the two agent types have vastly different convergence behaviors while learning with clones.From the BER curve in Fig. 12a we see that the two clone pairings have round trip error rates close to the QPSK baseline.Interestingly, Fig. 12b shows that the alien pairing always learned a good modulation scheme even though the Poly-slow-and-clone pairing occasionally failed.The alien pairing convergence speed lies in between the two clone pairings.These results suggest that a fast, reliable agent can help a slow agent, alien or not, learn more quickly and more robustly.This phenomenon is not unique to the EPP protocol, as can be seen from results with the ESP protocol in Fig. 23 in Appendix C. Another interesting result from this experiment found in Table III is that for ESP the difference between convergence times for Neuraland-Classic and Neural-and-Clone (and similarly for Polyand-Classic and Poly-and-Clone) is smaller than for EPP.This could be partially because the hyperparameters were tuned for the EPP agent-and-clone setting, but also suggests that as we increase information sharing the performance gap between learning with fixed agents and other learning agents shrinks.
We can now provide answers to the questions we raised at the start of this subsection.It is possible to learn to communicate with self-aliens and aliens using the EPP protocol.In fact, neither of these tasks is intrinsically more difficult than learning with a clone.Furthermore, a self-alien/alien pairing shows convergence behavior in between the two clone pairings.A fast agent paired with a slower agent can help the slow agent learn faster.An interesting experiment would be to map out the range of conditions where these observations continue to hold.Can learning become impossible if the difference in convergence behavior of the two agents is large enough?Can two agents have convergence behaviors that are similar, but differ so fundamentally in the way they learn that they fail to learn in the alien setting?We leave this as an area for future research.

VIII. IMPLEMENTATION IN SOFTWARE DEFINED RADIOS
In order to corroborate our simulation results, we implement the ESP (IV) and EPP (III-B) protocols on Ettus USRP software defined radios using GNU Radio [57].The goal of this implementation is not to provide a real-time implementation of the Echo protocol, since in general the real-time components of radio communications are implemented in ASICs, and even software components are run in special real-time operating systems to achieve deterministic or bounded latencies.The focus of our work is to learn modulation schemes, so the primary goal of the GNU Radio implementation is to demonstrate that the learning protocols work not only in simulations but also when trained in real, physical systems.Other work such as [29] and [30] have also demonstrated that end-to-end learning of communication schemes is possible over the air in real radio systems.We plan to address other components of radio communications such as channel equalization and error correction coding in future works.Only after all of these processing components have been addressed will it be necessary to have real-time hardware implementations of components such as the modulation learning.

A. ADDITIONAL PROCESSING
The GNU Radio implementation attempts to abstract away the details of packet transmission, reception, and non-AWGN channels in order to provide as close an approximation as possible to the training environment of the previous sections.The implementation corrects for carrier frequency offset (CFO), multitap channels, and arbitrary packet arrival times using several algorithms implemented with NumPy [58].We detect packets using correlation against a fixed prefix and constant false alarm rate detection [59].CFO and channel effects are corrected using the same prefix.We perform coarse sample timing synchronization by upsampling to two samples per symbol for transmission, then downsampling after the start of the packet has been detected.
The additional processing adds significant overhead to each round-trip training cycle.The results from a typical run with a 50-unit single hidden layer modulator and demodulator and 256 symbols per preamble are shown in Table IV.As shown in the table, the packet wrapper comprises more than one third of the execution time during a run.In addition to the computation time, the GNU Radio implementation introduces latency by sending data between packet processing blocks and modulator or demodulator blocks.

B. TRAINING PROCEDURE MODIFICATIONS
Constraints introduced by running on physical radios required several changes to the Neural agent training procedure before we could successfully train these agents.The constraints and the modifications necessary to overcome them are detailed in Sections VIII-B1 and VIII-B2.

1) Maximum Transmit Amplitude
Signals sent through USRP radios cannot exceed a maximum amplitude, and any signals sent to the radio which exceed this amplitude are silently clipped to the maximum amplitude.However, the EPP implementation in our simulations only restricts the average energy of a constellation.This means that any individual constellation point can have almost arbitrarily large amplitude, and exploration can drive the amplitude of a transmitted symbol even higher.It turns out that clipping a significant number of transmitted symbols breaks the training process for neural modulators, and they never converge to a reasonable constellation.In order to prevent clipping, we restrict the average power of a constellation during training to significantly less than the radio's cap, and rely on the vast majority of symbols which are not clipped to produce good training feedback.
Because we control the environment for our tests, we can ensure that we train and test at the desired SNRs for any given constellation.However, in the real world a system may need to use all of its transmit power to achieve a usable SNR.In such a case, restricting the average power of a constellation to less than the maximum would prevent learning from taking place.We hope to address the problem of exploring out to a bounding box, without exceeding it, while maintaining training performance in future work.

2) DC Offset Correction
USRP radios use an adaptive DC offset canceler in the receive chain which causes the IQ that the demodulator eventually receives to be centered around the origin, regardless of the originally transmitted constellation.However, the base Echo implementation does not place any restriction on the mean of a constellation.The most energy efficient constellation possible is always centered at the origin, so the constellations achieved after training are approximately centered at the origin as well.Unfortunately, the constellation center commonly moves far from the origin during the training process before being forced back as the constellation is optimized.This causes a significant DC offset in the transmitted signal.The receive chain DC offset corrections change the round-trip feedback that a modulator receives significantly enough that neural modulators fail to train.
The adaptive DC offset cancellation can be disabled, but this would require a calibration period at the start of each run, or even after each received packet, to measure the true DC offset and set the DC offset canceler manually.Instead, we explored methods of forcing the constellations to be approximately centered while training.We settled on a loss term for the squared magnitude of the constellation centerthis was done individually at each agent and so did not violate the spirit of the problem.See Appendix D for more details.

C. EXPERIMENTS
The radio experiments were conducted using two Ettus USRP X310 software defined radios (SDRs) connected to each other with SMA cables as shown in Fig. 13.75 dB of attenuation was added between the radios both to simulate path loss and to allow us to achieve desired SNRs with the available internal transmit and receive gains.We tuned hyperparameters for the radio experiments separately from the main simulation hyperparameters because of the extra hyperparameter introduced for DC offset correction.We use these same hyperparameters in simulations when comparing with the radio experiment results.After coarse hand-tuning we achieved performance similar to Neural-slow-and-clone.The hyperparameters are listed in Appendix H.Each experiment was run 20 times with random seeds at an SNR which resulted in 1% round-trip BER for two classic agents (Section B-A).

1) Echo with Shared Preamble Comparison
Figs. 14 and 15 compare the performance of the GNU Radio implementation to our simulations for ESP neuralclone training.Fig. 14 shows that the additional processing required to handle channel equalization and CFO correction requires 2 dB additional empirical SNR to achieve the same baseline BER performance for classic agents.In Fig. 14, the agents trained on SDRs perform slightly worse relative to the baseline than agents trained in simulation.Fig. 15 shows that learning agents train at approximately the same rate on SDRs as in simulation.Although the simulation curve comes from sampling one set of seeds over time as they train, each data point on the software radio curve comes from a separate set of seeds trained for a given amount of time.There is some variance in how many seeds eventually converge which causes the droop in the curve around 600000 symbols transmitted.For the ESP case with neural agents, the simulated performance is similar to that obtained while using SDRs.This is evidence in support of Echo style protocols being practically implementable procedures for learning to communicate.

2) Echo with Private Preamble Comparison
Figs. 16 and 17 compare the performance of the GNU Radio implementation to our simulations for EPP neural-clone training.Apart from the additional SNR required to achieve the same baseline performance, the trained neural agents show a similar spread in final BER performance across SNRs.This is another main result of our work, EPP is successful at learning modulation schemes over the wire while using The GNU Radio agents were only trained at 1% BER SNR, equivalent to SNR_dB=8 among the simulation curves.Unlike the simulation curve which is sampled over time for one batch of agents, the GNU Radio data points come from separate batches with different seeds.The dip in performance around 600000 symbols is a result of variance in how many seeds converge, not agents losing performance after they've initially reached the performance threshold.software defined radios.Fig. 16 compares the convergence rate for many trials with training time for the GNU Radio implementation to simulation.Clearly it takes longer for the GNU Radio agents to converge to 3 dB off of optimal BER than the simulation agents, but the final proportion of successful trials is similar.We speculate that there may be more noise in the feedback given to agents during the GNU Radio training process than in the simulation training.This could slow down convergence by reducing the consistency of feedback without reducing its average quality, i.e. some very good feedback mixed with poor feedback.Over time the good feedback would prevail, since it will be self-consistent, whereas the poor feedback will not be consistent and will eventually be averaged away.We will address this discrepancy further in future work.

IX. CONCLUSION
In this work we studied whether the Echo protocol enables two agents to learn modulation schemes with minimal information sharing.We proposed a variation of the generic Echo protocol, denoted EPP (Echo with private preamble), that assumes no shared knowledge apart from knowledge of the echo protocol and the ability to perform turn taking.To evaluate the cost of minimal information sharing, we explored a range of protocols varying in the amount of information shared.We observed that reduced information sharing comes at the cost of slower convergence, meaning more symbols need to be exchanged before a good modulation scheme is learned.A learning agent when paired with a clone can robustly learn a two bits per symbol modulation scheme in 2 × 10 3 symbols if we allow gradient passing and in 3 × 10 3 symbols if we allow loss passing.If we restrict information sharing further, the number of symbols required to learn a scheme robustly goes up exponentially.Allowing only sharing of preambles takes 2.5 × 10 4 symbols while the case without shared preambles takes 10 5 symbols.
Despite the increase in sample complexity, we showed that even under these minimal assumptions, agents can learn to communicate.The EPP protocol is universal, in that it allows agents of diverse types to learn to communicate with each other, and also works when one of the agents uses a fixed communication scheme.
Our results suggest that learning with "alien" agents is not intrinsically more difficult than learning with agents of the same type.For instance, with the learning agents Neural-slow and Poly-fast we observed that the clone pairings (Neuralslow and Neural-slow, Poly-fast and Poly-fast) as well as the alien pairing (Neural-slow and Poly-fast) required a very similar number of training symbols of around 7 × 10 5 to robustly learn a modulation scheme.However, learning to communicate with an agent that uses a fixed modulation scheme is much easier with Neural-fast and Classic requiring only 10 4 symbols before a good scheme is learned.
In Appendix C-A we investigated performance of the learning protocols for higher modulation orders and noticed that the difficulty of the learning task increases substantially with modulation order, and the number of preamble symbols that must be transmitted before a good scheme is learned increases exponentially.Confirming the results of others ( [9], [60]), we observed that moderate levels of noise have a regularizing effect and facilitate learning but too much noise can be detrimental to the learning process.
Overall, learning modulation schemes has a high up-front cost in complexity and some cost in loss of optimality for AWGN channels, relative to a designed optimal scheme.For simple known channels it is possible to design a scheme which is provably optimal and has no cost in time spent converging to a common method.However, our goal is to extend this learning protocol to more complex channels for which optimal schemes are not known.Only if it is able to achieve near-optimal performance for a simple channel can we hope that it will also perform well on a harder channel.
This work raises some intriguing questions and opens up several exciting new avenues of research.On the universality of the learning process, one might wonder if it is always possible for two alien agents to learn to communicate with each other when each has the ability to learn to communicate with a clone.What happens when these two agents have vastly different convergence behavior in terms of how fast they learn, measured in terms of number of preamble symbols transmitted?Is learning still possible?Or is there something fundamental to the learning process that determines whether two agents can learn when paired together and two agents with seemingly similar convergence behavior can fail to learn to communicate with each other because their inherent learning behavior is different?What would optimal blind learning look like?
Meta-learning techniques have shown promise for decreasing the sample complexity of learning tasks and have enabled few shot learning in several applications [61], [35].Can we apply meta-learning techniques to initialize our learning agents in a favorable state that allows them to learn to communicate with others much faster than they would when initialized at a random state?
We also hope to relax some of the most restrictive assumptions of this paper.Although the EPP protocol aims to share as little information as possible, currently we assume a fixed and known number of bits per symbol.Removing this assumption would be a further step towards a complete learning protocol.Currently a single pair of agents take turns in perfect order, but in real-world environments there are likely to be many agents with imperfect turn-taking.We would like to explore how the Echo protocol works with multiple agents, and when agents do not always echo the most recent message or even echo at all.
Can other parts of the communication pipeline, such as equalization and error correcting codes, be integrated into the learning process?Can all these processing stages be learned end-to-end, and does that provide a benefit in terms of training time or communication performance?End-toend training might allow us to discover new communication strategies for certain types of channels that beat the current best known strategies for the channel.All these research avenues are aimed at bringing us closer to a world where a machine learning-based communication "standard" can become a reality.Such a standard would be a minimal set of guidelines which, if followed by agents, would enable them to learn how best to communicate with each other based on the current channel conditions. .

APPENDIX A CODE
Our code for the Echo protocol, simulation environment, and experiment runs can be found at https://github.com/ml4wireless/echo in the ieee-paper branch.Code for the GNU Radio implementation of the Echo protocol can be found at https://github.com/ml4wireless/gr-echo[19].

A. CLASSIC
Modulator -The modulator uses a fixed strategy known to be optimal for AWGN channels (e.g., Gray coded QPSK for 2 bits per symbol, 8PSK for 4 bits per symbol, and 16QAM for 4 bits per symbol) [55].Demodulator -The demodulator uses the 1 nearestneighbor method to return the closest neighbor from the constellation of the corresponding optimal modulator.Essentially, the demodulator partitions the complex plane into different regions and demodulates based on which region the input to the demodulator lies in.When using classic demodulation schemes for the GP protocol we require that the output be differentiable, and here we output probabilities for each symbol by taking a softmax of the squared distance of the point to each symbol from the optimal constellation.

B. NEURAL
First we describe parameter settings that are common to both modulators and demodulators: Network architecture: We use one layer networks with fully connected layers with the 'tanh' activation.Input and output sizes are different for the modulator and demodulator, as described below.Initialization: The weights for each layer are initialized by sampling from the distribution where n is the number of input units to the layer; the biases are initialized as 0.01.Optimizer: We use the Adam optimizer [62].
Next we describe the modulator and demodulator specific parameters and details about their update methods.For the rest of the section let b denote bits per symbol (equivalently the modulation order).

1) Modulator
Input width: b.We take in input in bit format (but treat these 0-1 values as floats).Output width: 2. The output width is fixed since it represents a complex number to be sent over the channel.Parameters: In addition to the network weights and biases, θ, we also include a separate learned parameter σ, a scalar denoting the standard deviation of the Gaussian distribution we sample from for our policy.Modulation procedure: The neural net outputs µ.Here µ is the output of the neural network and is the mean of the Gaussian distribution that we sample from.Note that if the input is of size [N, b], µ will have size [N, 2] (first dimension corresponding to the real part of a complex number and the other corresponding to the imaginary part of the complex number).While training, the modulator outputs symbols s sampled from a Gaussian distribution with mean µ and standard deviation σ (σ is bounded by minimum and maximum values.),i.e. s ∼ N (µ, σ 2 I).Update procedure: Suppose for our given actions s we receive the reward r, the negative of the number of incorrect bits (comparing the original bit sequence to the received echo).The log probability for each action is given by, for some constant C. The loss function we minimize is given by, In some settings we modify the reward r to include penalty terms such as one for distance of average output from origin as detailed in Appendix D. We update our parameters as, where η µ and η σ denote the separate learning rate parameters for the network parameters and the standard deviation σ.
Update procedure: Suppose after applying the softmax layer we have probability q i,c corresponding to the true class label of symbol i, i = 1 . . .N .We compute the crossentropy loss as We update our parameters as where η φ is the learning rate parameter for the demodulator updates.

C. POLYNOMIAL
First we describe parameter settings that are common to both modulators and demodulators: Network architecture: The inputs to the network are used to form a polynomial of degree d.We use a single fully connected linear layer to connect the polynomial terms to the output.Input and output sizes are different for the modulator and demodulator, as described below.Initialization: The weights for each layer are initialized by sampling from the distribution U where n is the number of input units to the layer; we do not use biases for polynomial agents.Optimizer: We use the Adam optimizer [62].Next we describe the modulator and demodulator specific parameters and details about their update methods.For the rest of the section let b denote bits per symbol and d the degree of the polynomial.

1) Modulator
Input width: b.We take in input in bit format (but treat these 0-1 values as floats).Output width: 2. The output width is fixed since it represents a complex number to be sent over the channel.Parameters: Internally, the input bits are used to calculate all unique polynomial terms of order d.Since the bits b i are in {0, 1}, terms including b 2 i , b 3 i , . . .are redundant and omitted from our calculations, thus allowing us to determine a unique maximum-degree polynomial.The polynomial terms are fed into the single fully connected layer with parameters θ.We also include a separate parameter σ, a scalar denoting the standard deviation of the Gaussian distribution we sample from for our policy.Modulation procedure: The polynomial network outputs µ.
Here µ is the mean of the Gaussian distribution that we sample from.Note that if the input is of size [N, b], µ will have size [N, 2] (first dimension corresponding to the real part of a complex number and the other corresponding to the imaginary part of the complex number).While training, the modulator outputs symbols s sampled from a Gaussian distribution with mean µ and standard deviation σ, i.e. s ∼ N (µ, σ 2 I).Update procedure: The update procedure for polynomial modulators is identical to the procedure for neural modulators.

2) Demodulator
Input width: 2 Output width: 2 b .The demodulator is a classifier which outputs logits for each class that, on application of the softmax layer, correspond to the probabilities of the classes.The classes are the set of possible bit sequences for the modulation order.Parameters: Internally, the input symbols is used to calculate all unique polynomial terms of order d containing the real part and imaginary part of the symbol.For example, The polynomial terms are fed into the single fully connected layer with parameters φ.Demodulation procedure: The demodulation procedure is the same as the neural agent.Update procedure: The update procedure is the same as the neural agent, except for an L1 penalty added to the demodulator's loss term.

APPENDIX C ADDITIONAL RESULTS
This appendix contains additional experimental results which, although not required to support our primary conclusions, we believe are of interest to anyone who wants to replicate or build upon our work.Appendix C-A shows the effects of modulation order and training SNR on the performance of the ESP and EPP protocols with clone agents.Since any learning communications system in the wild will be exposed to multiple SNR conditions and desired signalling rates, understanding performance variation across SNR and modulation order will be crucial.Our results indicate that moderately high training SNR leads to the best performance confirming observations by others ([9], [60]).Appendix C-B presents experiments with Poly clone and self-alien agents demonstrating similar behavior to Neural clone and self-alien agents.

A. EFFECT OF MODULATION ORDER AND TRAINING SIGNAL TO NOISE RATIO
In the experiments detailed in Sec.VII we learned to modulate with 2 bits per symbol.Here we explore whether the learning protocols continue to work for higher modulation orders, i.e. more bits per symbol.We conduct experiments using the EPP protocol for a Neural agent learning to communicate with a clone for 3 and 4 bits per symbol.We compare these cases, and the 2 bits per symbol case, in Fig. 18.From Fig. 18a we observe that, at higher modulation orders, there is a larger gap between the BER curves of the learned agents and the corresponding baselines.Although some agents continue to approach the baseline BERs, as evidenced by the error bars, the median agent no longer achieves near-optimal performance at high SNRs.Fig. 18b shows that, for higher modulation orders, fewer trials learn a good modulation scheme and it takes longer to learn good schemes.From Table V we see that the increase in convergence times is exponential, with EPP requiring 2× and 24× more symbols for convergence for 8PSK and 16QAM, respectively.ESP requires 1.5× and 6× more symbols for convergence.Still, even for the highest modulation order examined (16QAM), 96% of trials eventually converge to a good scheme.This phenomenon of performance degradation with increasing modulation order is expected since the modulation functions for higher order modulation schemes are more complex.Next, we investigate the effect of training SNR.In all other experiments we trained our agents at the SNR corresponding to 1% BER for the baseline scheme of the given modulation order.Is this the optimal SNR to train at?Does the learning protocol work at lower SNRs?We explore answers to these questions by conducting experiments using the EPP protocol for the setting where a Neural agent learns to communicate with a clone at various training SNRs corresponding to 0.001%, 0.1%, and 10% BERs for the baseline modulation scheme.From our results in Fig. 19 we observe that in all 3 settings we achieve a BER close to the QPSK baseline.It takes longer to learn a modulation scheme at lower SNRs.However, not every trial converges when trained at high SNR.This can be explained by a regularization role that noise seems to have on our learning task.At very low SNRs, some trials fail to converge but those that do converge achieve similar BERs to agents trained at higher SNRs.It is possible that the agents are taking gradient steps which are too large and being forced into local minima by steps  in poor directions caused by noisy feedback.This suggests the question of whether there is a "speed limit" to how fast agents can reliably (i.e.having all trials converge to within 3 dB off optimal) learn at a given SNR.We hope to answer this question in future work.

B. POLYNOMIAL AGENT EXPERIMENTS
Here we include results using Poly agents which demonstrate the same behaviors as the Neural agents in Section VII.Figs.20 to 22 demonstrate the effects of information sharing and that the EPP protocol works with fixed Classic and self-alien agents using polynomial function approximators.
As with Neural agents, more information sharing leads to faster training.Similarly, Classic agents speed up training, and two self-alien agents converge at a rate in between their individual convergence speeds.Fig. 23 shows the performance of several combinations of Classic, clone, and selfalien agents using the ESP protocol.The relative ordering of performance is the same as when using EPP, even though each combination trains faster with ESP.(a) Round-trip median BER curves for Neural agent learning with clone with mod orders (bits per symbol) 2, 3, 4 using the EPP protocol at training SNRs corresponding to 1% BER.Alongside the BER curves of the learned modulation schemes is the baseline QPSK (order 2), 8PSK(order 3) and 16QAM(order 4).In all cases, modulation constellations are normalized to constrain the average signal power.Neural agents learning with a clone using the EPP protocol for different modulation orders.The BER plot (a) shows that the gap between the median BER of the learned scheme and the corresponding baseline increases for higher mod orders.From the convergence plot (b), we observe that higher mod orders take exponentially longer to converge to a good strategy.Neural agents learning with a clone using EPP for different training signal to noise ratios.The BER plot (a) shows that training at higher SNR leads to lower BERs across all SNRs.From the convergence plot (b), we observe that limited noise plays a regularizing effect, helping more trials to converge.Too much noise, however, has a detrimental effect and slows down convergence.Training at higher SNRs helps agents to converge more quickly, although not every trial converges.

APPENDIX D UNSUCCESSFUL CONSTELLATION CENTERING METHODS
We investigated several methods for forcing modulator constellations to be centered to avoid the training problems caused by DC offset correction in the USRP radios.Because not all of them worked, it is important to report the results for scientific integrity.The first method we investigated was to add a function, implemented in PyTorch, which calculated the center of the means output by the Gaussian policy and subtracted that value from the modulated symbols.
where µ i are the means of each possible constellation point and s is the set of complex symbols modulated by the current policy.
The rate of successfully trained trials improved while using this centering method but we discovered that QPSK constellations were often unable to split out from a pseudo-BPSK constellation, where two pairs of constellation points existed in nearly the same location.Fig. 24 shows an example pseudo-BPSK constellation reached during one training run.We hypothesize that this hard centering required two constellation points to split out in tandem, which is difficult using noisy feedback.As an alternative to the hard centering method, we applied 'soft' centering by adding a term for the constellation center's distance from the origin to the loss function.With this soft centering we were able to achieve successful training rates similar to the baseline simulation results.We verified that setting the weight of the constellation center location loss term to infinity reproduced the behavior seen in the hard forcing method above, namely that modulators reach a pseudo-BPSK constellation but were unable to split into a true QPSK constellation.Similarly, reducing the weight of the loss term to zero produced results seen in the baseline method where DC offset correction caused the modulators to be unable to train.

APPENDIX E CLASSIC MODULATION SCHEMES
Figure 25 illustrates the fixed modulation schemes used by Classic agents.These schemes are known to be optimal for AWGN channels [55].

APPENDIX F SIMULATION SETTINGS
Unless specified otherwise, the training SNR values default to the values in Table I.Testing SNR values default to those corresponding to 0.001%, 0.01%, 0.1%, 1%, and 10% BER from Table I

FIGURE 1 .
FIGURE 1. Visualization of the Echo protocol.(A) Speaker Agent (A1) modulates a bit sequence and (B) sends it across a (AWGN) channel.(C) Echoer Agent (A2) receives the sequence and demodulates it.(D) A2 then modulates the recovered sequence and (E) sends it back over the channel.(F) A1 receives this echoed version of its original sequence and demodulates it.(G,H) Then A1 uses the received echo to update its modulator and demodulator.The agents switch roles and repeat until convergence.Details of the protocol are elaborated in Fig.2and Sec.IV-A.

FIGURE 7 .
FIGURE 7.Example of dB-off-optimal calculation used to determine convergence for round-trip exchange.
fast and Neural-fast ESP Neural-fast and Neural-fast GP Neural and Neural LP Neural-fast and Neural-fast QPSK Baseline (a) Round-trip median BER.The error bars reflect the 10 th to 90 th percentiles across 50 trials.All agents are evaluated at the same SNR but error bars have been dithered for readability.
Converged within 3 dB Fraction of Trials within 3 dB EPP Neural-fast and Neural-fast ESP Neural-fast and Neural-fast GP Neural and Neural LP Neural-fast and Neural-fast (b) Convergence of 50 trials to be within 3 dB at testing SNR 8.4 dB.

FIGURE 8 .
FIGURE 8. Effect of Information Sharing, Neural-fast-and-Neural-fast: The BER plot (a) shows that all protocols achieve BER close to that of QPSK baseline.From the convergence plot (b), we observe that EPP is much slower than ESP, which in turn is an order of magnitude slower than LP and GP.Protocols with greater information sharing lead to faster convergence.

1 )
Learning with fixed: An agent learning to communicate with a fixed agent.(Neural-fast, Neural-slow, Poly-fast, Poly-slow and Classic.) 2) Learning with clone: An agent learning to communicate with its clone.(Neural-fast and Neural-fast, Neural-slow and Neural-slow, Poly-fast and Poly-fast, Poly-slow and Poly-slow.)3) Learning with self-alien: An agent learning to communicate with its self-alien.(Neural-fast and Neural-slow, Poly-fast and Poly-slow.)4) Learning with alien.An agent learning to communicate with an alien (Neural-slow and Poly-fast, Neural-fast and Poly-slow.) (a) Round-trip median BER.The error bars reflect the 10 th to 90 th percentiles across 50 trials.All agents are evaluated at the same SNR but error bars have been dithered for readability.
Converged within 3 dB Fraction of Trials within 3 dB EPP Neural-fast and Classic ESP Neural-fast and Classic GP Neural and Classic LP Neural-fast and Classic (b) Convergence of 50 trials to be within 3 dB at testing SNR 8.4 dB.

FIGURE 9 .
FIGURE 9.Learning to communicate with a fixed agent, Neural-fast-and-Classic: The BER plot (a) shows that all protocols achieve BER close to that of QPSK baseline.From the convergence plot (b), we observe that EPP and ESP have similar convergence behavior and are an order of magnitude slower than LP and GP.For the ESP and EPP protocol convergence is much faster than when learning with a clone (Fig.8) and Neural-fast Neural-fast and Neural-slow Neural-slow and Neural-slow QPSK Baseline (a) Round-trip median BER.The error bars reflect the 10 th to 90 th percentiles across 50 trials.All agents are evaluated at the same SNR but error bars have been dithered for readability.
Converged within 3 dB Fraction of Trials within 3 dB Neural-fast and Neural-fast Neural-fast and Neural-slow Neural-slow and Neural-slow (b) Convergence of 50 trials to be within 3 dB at testing SNR 8.4 dB.

FIGURE 10 .
FIGURE 10.Learning with clones (Neural-fast-and-Neural-fast, Neural-slow-and-Neural-slow) compared to learning with self-alien (Neural-fast-and-Neural-slow) using the EPP protocol.The BER plot (a) shows that the round-trip BER for Neural-slow-and-Neural-slow is slightly lower than the others but all BERs are close to the QPSK baseline.From the convergence plot (b), we observe that Neural-fast-and-Neural-fast is much faster than Neural-slow-and-Neural-slow.However, pairing Neural-fast with Neural-slow helps the latter learn faster, resulting in a convergence time for Neural-fast-and-Neural-slow between those of the clone pairings.
and Neural-slow Neural-slow and Poly-fast Poly-fast and Poly-fast QPSK Baseline (a) Round-trip median BER.The error bars reflect the 10 th to 90 th percentiles across 50 trials.All agents are evaluated at the same SNR but error bars have been dithered for readability.
Converged within 3 dB Fraction of Trials within 3 dB Neural-slow and Neural-slow Neural-slow and Poly-fast Poly-fast and Poly-fast (b) Convergence of 50 trials to be within 3 dB at testing SNR 8.4 dB.

FIGURE 11 .
FIGURE 11.Learning with clones (Neural-slow-and-Neural-slow, Poly-fast-and-Poly-fast) compared to learning with an alien (Neural-slow-and-Poly-fast) using the EPP protocol.Here the Neural-slow and Poly-fast agents have similar convergence behavior when paired with a clone.The BER plot (a) shows that in all cases, the BER achieved is close to the QPSK baseline.From the convergence plot (b), we observe that when both clone parings show similar convergence behavior, the alien pairing also has the same behavior.Learning to communicate with an alien is not intrinsically more difficult than learning to communicate with a clone.
and Neural-fast Neural-fast and Poly-slow Poly-slow and Poly-slow QPSK Baseline (a) Round-trip median BER.The error bars reflect the 10 th to 90 th percentiles across 50 trials.All agents are evaluated at the same SNR but error bars have been dithered for readability.
Converged within 3 dB Fraction of Trials within 3 dB Neural-fast and Neural-fast Neural-fast and Poly-slow Poly-slow and Poly-slow (b) Convergence of 50 trials to be within 3 dB at testing SNR 8.4 dB.

FIGURE 12 .
FIGURE 12.Learning with clones (Neural-fast-and-Neural-fast, Poly-slow-and-Poly-slow) compared to learning with an alien (Neural-fast-and-Poly-slow) using the EPP protocol.Here the Neural-fast and Poly-slow agents have vastly different convergence behavior when paired with a clone.The BER plot (a) shows that in all cases, the BER achieved is close to the QPSK baseline.From the convergence plot (b), we observe that when the clone pairings have vastly different convergence behavior, the alien pairing shows convergence behavior somewhere in between.All alien pairings converged to within 3 dB off optimal, even though the Poly-slow with clone failed at least once.The robustness of the Neural-fast learner may have helped the Poly-slow agent converge more reliably.

FIGURE 14 .FIGURE 15 .
FIGURE 14. Round-trip median bit error curves for Neural-and-Clone python simulation and GNU Radio agents learning QPSK under the ESP protocol at training SNRs corresponding to 1% BER.Alongside the bit error curves of the learned modulation schemes is the baseline.In all cases, modulation constellations are constrained as detailed in Section VIII-B.Although 2 dB SNR extra is required to achieve the same baseline performance due to processing losses in the GNU Radio implementation, the trained agents show only slightly greater loss in performance against the baseline than the pure simulation agents.

FIGURE 16 .FIGURE 17 .
FIGURE 16.Round-trip median bit error curves for Neural-and-Clone python simulation and GNU Radio agents learning QPSK under the EPP protocol at training SNRs corresponding to 1% BER.Alongside the bit error curves of the learned modulation schemes are the baselines.In all cases, modulation constellations are constrained as detailed in Section VIII-B to constrain the average signal power.Although 2 dB SNR extra is required to achieve the same baseline performance due to processing losses in the GNU Radio implementation, the trained agents show similar loss in BER performance compared to the baseline.

2
Output width: 2 b .The demodulator is a classifier which outputs logits for each class that, on application of the softmax layer, correspond to the probabilities of the classes.The classes are the set of possible bit sequences for the modulation order.Parameters: The network weights and biases denoted as φ.Demodulation procedure: Given input of size [N, 2] , the neural net outputs logits (logits) of shape [N, 2 b ].On applying the softmax operation these correspond to a probability distribution over classes.The demodulated symbols, p, are computed by choosing the class with the highest probability, p = arg max(softmax(logits))).
Neural-fast and Neural-fast QAM16 Baseline QAM16 Neural-fast and Neural-fast QPSK Baseline QPSK Neural-fast and Neural-fast Converged within 3 dB Fraction of Trials within 3 dB 8PSK Neural-fast and Neural-fast QAM16 Neural-fast and Neural-fast QPSK Neural-fast and Neural-fast (b) Convergence of 50 trials to be within 3 dB (at testing SNR corresponding to 1% BER) of the corresponding baseline for EPP trials of at training SNR corresponding to 1% BER for increasing modulation order.16QAM, with the highest modulation order 4, takes much longer to converge than QPSK (order 2) and 8PSK (order 3).

FIGURE 18 .
FIGURE 18.Neural agents learning with a clone using the EPP protocol for different modulation orders.The BER plot (a) shows that the gap between the median BER of the learned scheme and the corresponding baseline increases for higher mod orders.From the convergence plot (b), we observe that higher mod orders take exponentially longer to converge to a good strategy.
13 dB SNR Neural and Neural 4.2 dB SNR Neural and Neural 8.4 dB SNR Neural and Neural Mid SNR QPSK Baseline (a) Round-trip median BER curves for a Neural agent learning with a clone using the EPP protocol at training SNRs 13.0, 8.4, and 4.2 dB corresponding to 0.001%, 0.1%, and 10% BERs for the baseline.The error bars reflect the 10 th to 90 th percentiles across 50 trials.All agents are evaluated at the same SNR but error bars have been dithered for readability.

10
Converged within 3 dB Fraction of Trials within 3 dB 13 dB SNR Neural and Neural 4.2 dB SNR Neural and Neural 8.4 dB SNR Neural and Neural Mid SNR (b) Convergence of 50 trials to be within 3 dB at testing SNR 8.4 dB at training SNRs 13.0, 8.4, and 4.2 dB.Training at higher SNR reduces the number of symbols required for most trials to converge.

FIGURE 19 .
FIGURE 19.Neural agents learning with a clone using EPP for different training signal to noise ratios.The BER plot (a) shows that training at higher SNR leads to lower BERs across all SNRs.From the convergence plot (b), we observe that limited noise plays a regularizing effect, helping more trials to converge.Too much noise, however, has a detrimental effect and slows down convergence.Training at higher SNRs helps agents to converge more quickly, although not every trial converges.

FIGURE 23 .
FIGURE 23.Learning under ESP protocol: The BER plot (a) shows that all protocols achieve BER close to that of QPSK baseline.From the convergence plot (b), we observe that learning with fixed agents is faster than learning with clones for both Neural and Poly agents.The alien pairing, Neural-fast and Poly-fast, has convergence times in between those of the individual clone pairings.The Neural agent helps the Poly agent learn faster when paired together.

11 FIGURE 24 .
FIGURE 24.An example of a pseudo-BPSK constellation reached during one training run with the GNU Radio EPP implementation.Two pairs of constellation points exist antipodally just like a BPSK constellation.There is not enough separation between constellation points within the pairs to reliably demodulate the correct bit sequences.

FIGURE 25 .
FIGURE 25.Figures (a) through (c) show the fixed, optimal modulation used Classic models; (d) through (f) show the corresponding demodulation boundaries.
In the GP protocol, the preamble p is modulated and sent from Agent 1 through the channel to Agent 2 and is demodulated as p.Using the shared preamble, the modulator of Agent 1 and the demodulator of Agent 2 are updated using the cross entropy loss.

TABLE I .
SNRs corresponding to round-trip BER values for the modulation orders we investigate.The SNR-to-BER mappings are used to set test and train SNRs for performance measurements.The SNR corresponding to 1% BER (shaded column) is the default training SNR for our experiments.

TABLE II .
Number of symbols exchanged before ≥ 90% of trials reached 3 dB off of optimal BER at 8.4 dB test SNR for the GP and LP protocols with various combinations of modulator and demodulator type and 2 bits per symbol.

TABLE III .
Number of symbols exchanged before ≥ 90% of trials reached 3 dB off of optimal BER at 8.4 dB test SNR for the ESP and EPP protocols with various agent types and 2 bits per symbol.An order of magnitude more symbols have to be exchanged before the learning agents converge compared to the GP and LP protocols in TableII.Learning with a fixed agent (Classic) is much easier than a clone learning agent, with convergence happening 1.3 − 3.3× faster for ESP and 3.8 − 6× faster for EPP.The extra shared information in ESP seems to compensate for the increased difficulty of learning with another learning agent.

TABLE IV .
Average execution times for Neural agent training update and GNU Radio wrapper processing.The additional processing required for transmission over USRP radios is about 1/3 of the total execution time.These times do not account for the additional latency of moving data between components of the GNU Radio processing chain.

TABLE V .
Number of symbols exchanged before ≥ 90% of trials reached 3 dB off of optimal BER for the ESP and EPP protocols with Neural agents and varying bits per symbol (BPS) and SNR.The results show the increased difficulty of learning at higher modulation orders.The EPP protocol is impacted much more than ESP by high modulation order, requiring 24× more symbols for 16QAM than QPSK compared to 6× for ESP.*The experiments comparing performance with training SNR were conducted using a different set of hyperparameters, found in Table XIV in Appendix G. Lower training SNRs require longer to converge.
. Table VI describes other simulation settings such as the number of iterations, preamble length, and testing frequency.

TABLE VI .
Experiment settings for the different protocols and modulation orders.(* Because Neural-fast-and-Neural-fast converged so fast, we only trained for 500 iterations in order to adequately sample the convergence curve.)