Toward More Human-Like AI Communication: A Review of Emergent Communication Research

In the recent shift towards human-centric AI, the need for machines to accurately use natural language has become increasingly important. While a common approach to achieve this is to train large language models, this method presents a form of learning misalignment where the model may not capture the underlying structure and reasoning humans employ in using natural language, potentially leading to unexpected or unreliable behavior. Emergent communication (EmCom) is a field of research that has seen a growing number of publications in recent years, aiming to develop artificial agents capable of using natural language in a way that goes beyond simple discriminative tasks and can effectively communicate and learn new concepts. In this review, we present EmCom under two aspects. Firstly, we delineate all the common proprieties we find across the literature and how they relate to human interactions. Secondly, we identify two subcategories and highlight their characteristics and open challenges. We encourage researchers to work together by demonstrating that different methods can be viewed as diverse solutions to a common problem and emphasize the importance of including diverse perspectives and expertise in the field. We believe a deeper understanding of human communication and human-AI trust dynamics are crucial to develop machines that can accurately use natural language in human-machine interactions.


Introduction
In the initial phase of AI research following the second AI winter, the focus was on identifying new areas where AI could outperform humans, with famous examples including chess [Silver et al., 2018], Go [Silver et al., 2016], and Starcraft [Vinyals et al., 2019].While this was a limited application to games, it set the tone for research to prioritize building AI agents with superhuman capabilities.However, over the last decade, the research community has witnessed a shift towards a human-centric approach that aims to leverage AI to aid humans in everyday tasks and relieve them of repetitive duties [Xu, 2019, Riedl, 2019, Shneiderman, 2021].
The interaction between humans and machines is a crucial aspect of human-centric AI [Mikolov et al., 2016], and it should take place in domains where humans are already familiar and require little to no training.Therefore, applications that involve niche practices, such as coding and mathematics, should be avoided in favor of language-based applications.In particular, human-machine communication should be grounded in natural language, which presents the challenge of teaching artificial agents to communicate in multiple languages.Recent advances in natural language processing (NLP) have led to the emergence of the transformer architecture [Vaswani et al., 2017], which has become the preferred approach for language-based applications, as exemplified by Language Models (LMs) such as GPT3 [Brown et al., 2020], LLaMA [Touvron et al., 2023], and Lamda [Thoppilan et al., 2022].
One of the challenges for language model architectures is their focus on predicting the next word in a sentence rather than comprehending the broader context and purpose of language usage.While humans use language as a tool for coordination and communication to thrive in a shared environment, artificial intelligence may struggle to understand the subtleties and complexities of language fully.

Emergent Communication
Linzen [2020] investigated the phenomenon of learning misalignment in addressing challenges associated with natural language processing.This phenomenon has been a driving force behind the development of the Emergent Communication (EmCom) field [Wagner et al., 2003].To explain EmCom, we refer to King [2009], where the author defines an emergent communication strategy as a communication construct derived from the interaction between reader/hearer response, situated context, and discursive patterns.As the author states, the definition is derived from a plethora of works spanning from business strategy to emergence in organizational complexity theory and other communication theories.Building on the previous definition, we define EmCom as the research area involving learning to communicate by interacting with other agents to solve collaborative tasks in complex and diverse environments.
General description All these fields focus on the interaction between agents, whether artificial or human.These interactions typically occur in a specific context, referred to as an environment, defined by a set of rules that mimic certain aspects of the real world.Agents can perceive the environment through observation and alter their states by performing specific actions.
Two fields heavily influence this pipeline in artificial intelligence, Multi-Agent Systems (MAS) and Reinforcement Learning (RL), which have shown promise in mimicking aspects of human society and mind in AI systems [Yang et al., 2018].MAS techniques can model social interactions and coordination, while RL captures human learning and decision-making aspects.By merging these two approaches, researchers in EmCom aim to create environments that more effectively replicate the complexity and adaptability of human-like behavior.
These environments often incorporate game mechanics, which have been shown to significantly impact the learning process in the animal kingdom [Spinka et al., 2001], the development of social skills in children [Tahmores, 2011], and for educational purposes [Kirriemuir and McFarlane, 2004].Furthermore, RL's primary application is in games, making EmCom heavily reliant on game frameworks.
Objectives As mentioned, Emergent Communication (EmCom) intersects with numerous other fields, as illustrated in Figure 1.Consequently, various objectives may arise depending on the specific research question being addressed.In this regard, EmCom can be considered more as a framework than a field in itself, and as with many other scientific disciplines, it is challenging to delineate precise boundaries.PREPRINT Given the involvement of multiple fields and the predominance of Deep Neural Networks (DNN) works in the literature [Lazaridou and Baroni, 2020], it is the case that there are numerous potential research questions.However, one general overarching objective can be identified as the development of an artificial language that evolves to resemble human language within an artificial setting [Lazaridou et al., 2016].Despite this broad objective, several questions emerge, such as: How should the similarity between artificial and human languages be defined?Which characteristics of human language do we want to develop?And how should the artificial setting be structured?

Computer
In this work, we attempt to address these questions and more by concentrating on a sub-objective of EmCom, namely human-machine interaction.Indeed, when the broader goal is to create artificial agents that communicate using a language similar to humans, these agents can be employed for interactions with humans.As a result, this literature review primarily focuses on this perspective, although we occasionally cover other aspects and goals of EmCom as well.

Methodology
In conducting this review, we employed a two-stage methodology for selecting the papers included in our analysis.During the initial phase of our study, we adopted an exploratory approach, navigating through the emergent communication literature by following the references cited in the papers we read and identifying prominent names in the field.While this approach may not adhere to a strict methodology, it facilitated a broad understanding of the state of the art.
To complement and refine our initial selection, we used Connected Papers [Eitan et al., 2020] during the second stage of our methodology.This tool allowed us to visualize the network of references and connections between the works we had already covered, identifying the most interconnected and influential papers.Moreover, it enabled us to spot additional influential publications not encountered in the first phase, ensuring a more comprehensive and robust field review.

Contribution
While the following review aims to be accessible to a broad audience, it is worth noting that a background in reinforcement learning and natural language processing is recommended to comprehend the technical details fully.
With that being said, our goal is twofold: • First, we address new researchers interested in this new and exciting field.Our work can be seen as an introduction to EmCom, where we include relevant literature from the past years and give a general overview of challenges and methodologies.
• Second, we address researchers already involved in the field.We connect several pieces of work apparently unrelated to each other under the broader umbrella of human-machine interaction.We advocate that EmCom would benefit from more interconnection with other fields such as linguistics, cognitive science, and sociology, Figure 1.By all means, this benefit would also be reciprocal.

Paper Structure
This review is structured as follows.In Section 2, we identify four common characteristics in the EmCom literature and point out their parallels to human interactions: the game environment, Sec.2.1; the learning paradigm, Sec.2.2, analyzing the different learning methodologies that can be found in EmCom; interaction types, Sec.2.3, defining the possible configuration between agents in a shared environment; and Theory of Mind, Sec.2.4, where agents are aware of other intelligent entities in the environment and actively try to model their cognitive states.
Next, we introduce two main categories of EmCom being: Machine-centered EmCom in Section 3.1, dealing with artificial emergent languages (AELs) through disentangled pre-defined representations, and Human-centered EmCom, in Section 3.2, whose characteristic is to use Human Natural Language (HNL), e.g.English, in artificial settings.These sections also provide a large collection of related works from various researchers and an analysis of the state of the art.
Finally, in Section 4, we carry out a brief summary and provide a complete table of referenced papers, Table 2, which includes the categories each paper falls under, corresponding to the proprieties discussed throughout this work.

Common Proprieties
In Section 1, we mentioned how EmCom spans across multiple fields with different characteristics, each concerned with a different aspect of human communication.In this section, we define a set of intrinsic proprieties of the examined literature and point out the connections to diverse aspects of human interactions.

Game environment
Game design is an essential aspect of emergent communication research, as it defines the environment in which agents interact and communicate with each other.The literature presents a multitude of game environments, each tailored to address specific research questions.In this review, we identify two fundamental properties that feature prominently in the design of game environments used in emergent communication studies.
Firstly, the role of communication is a crucial aspect of game design and can take on varying degrees of importance depending on the research objectives.Communication can either serve as the game's primary objective or as an auxiliary tool to help agents achieve other goals.For instance, in some studies, the focus is on the emergence of a communication protocol between agents, while in others, agents are tasked with coordinating their actions to accomplish a shared goal, and communication merely aids in their cooperation and coordination.
Secondly, the choice of input representation is another critical aspect of game design that can significantly impact the emergence of language.In particular, how information is represented and presented to agents can influence the type of language that emerges from their communication.For example, representing images as raw pixels or as a bag of attributes can affect the level of ambiguity in the emergent language and the agents' performance in completing the task.

Role of Communication
Communication can be either the primary goal or a supporting feature in EmCom, which we categorize as either Communication-focused or Communication-assisted.It's worth noting that Communication-focused games are a subset of Communication-assisted settings, as the latter requires non-communicative actions in addition to communicative ones.

Communication-focused
In the first category of communication games, communication functions as both the method employed by agents and the ultimate goal of the study.Although this may seem like a simplified version of human language, it is observed in the animal world, as seen in the referential gestures exhibited by ravens [Allee et al., 1949].

Referential Games
The most well-known game in this category is the discrimination Referential game (Ref ), Figure 2, first introduced by Lewis [1969].This game involves a sender and a receiver with distinct roles, each presented with a PREPRINT Type of Ref. game Description

Discrimination
The receiver has to discriminate between a set of stimuli, comprised of the target stimulus observed by the sender/speaker and some additional distractor stimuli, and find the target.

Generative
The receiver has to generate an output, which can for instance, be the task of reconstructing the target stimulus itself or some of its (symbolic) attributes.

Multi-step
The game's rules are the same as the discrimination game, but the receiver can choose to ask the sender for more information which will start another step in the game.The game ends when the receiver chooses a target or after a fixed number of steps.

Multi-modal
The sender and receiver have access to different modalities: usually either vision or textual information.Table 1: Different types of referential games.set of images.The sender must generate a message to describe a given image, which the receiver must then identify from a pool of images.The first instance in the RL community can be found in Das et al. [2017], which compares a synthetic world made up of primitive geometries to a real-world image using a visual dialog system.Although there are several variations of the discrimination referential game, as illustrated in Table 1, the primary focus of research has been analyzing the language that emerges from it.This has been demonstrated in various studies, including [Yuan et al., 2020, Rodriguez et al., 2019, Graesser et al., 2019, Li and Bowling, 2019, Dagan et al., 2020, Chaabouni et al., 2020, Havrylov and Titov, 2017, Lazaridou et al., 2017, Wang et al., 2021].
Task and Talk In contrast, Kottur et al. [2017] developed a question-answering game called Task and Talk (TnT) where a sender bot is given an object1 unseen by the receiver.The receiver must ask the sender questions to determine two of its attributes.Unlike Ref, the TnT architecture is more dynamic and iterative, with a question-and-answer format that more closely resembles human communication.However, this architecture also introduces complications in the training procedure, such as agents retaining memory of previous conversations [Sally, 1995].Another difference is that Kottur et al. andothers [Liang et al., 2020, Cogswell et al., 2019] focus on objects defined by predetermined properties rather than perceptually realistic inputs, distinguishing between symbolic and realistic inputs.
Communication-assisted Unlike previous Communication-focused games, this second category utilizes communication as a means to achieve a goal different from the communicative act itself.The agents' action space includes both communication signals and other game-dependent actions, which range from physic simulators [Grover et al., 2018, Mordatch andAbbeel, 2018], navigation tasks [Das et al., 2019, Lowe et al., 2017, Eccles et al., 2019, Chaabouni et al., 2019a, Zhu et al., 2021], negotiation settings [Bachrach et al., 2020, Cao et al., 2018, Chen et al., 2020] to social deduction games [Brandizzi et al., 2021, Nakamura et al., 2016, Jaques et al., 2018].These games aim to recreate the environment in which language emerged, emphasizing the view that human language did not emerge as a goal itself but rather as a means of coordinating actions between humans.While Communication-assisted games may be considered closer to real human interaction, recent works typically rely on one-shot communication signals rather than dialogue systems.

Input representation
Input representation is a crucial aspect of language emergence in artificial communication systems.How information is encoded and presented to agents can significantly affect the type of language that emerges from their communication.In particular, differences in input representations can profoundly impact the learnability and generalizability of emergent languages.In this section, we explore several studies investigating the effects of different input representations on the emergence of language in artificial communication systems.
The impact of input type on emergent communication in a referential game is significant, as demonstrated by Lazaridou et al. [2018].The authors conducted two referential games with different input types: one using symbolic representation, where objects were represented as a bag of attributes, and the other using raw pixels.They observed a high level of ambiguity in the raw pixel input game due to the difficulty of the exploration task.To address this, they developed PREPRINT an experiment where distractors were selected from a target-specific context distribution reflecting normalized object co-occurrence statistics.The results showed an above-random performance in generalization, tied to language compositionality, with a high topographic measure suggesting that similar objects received similar messages.In the raw pixel input experiment, the authors investigated which image attributes were primarily captured by the agent to complete the task.They found that the encoder-decoder architecture overfits the dataset, resulting in an unstable protocol.
Furthermore, other studies have also examined the impact of input representation on emergent communication.For example, Guo et al. [2019] investigate the effect of different input types, such as image representation, a concatenation of one-hot vectors representing the count of each object type, and a bag of one-hot vectors denoting the quantity of different object types.They demonstrate a significant relationship between input design and language learnability, revealing the emergence of compositionality in the first two cases.Similarly, Yuan et al. [2020] applies their setups to two different datasets, a symbolic dataset called a number set and a 3D Object dataset, encoded with one-hot-vector and a convolutional neural network, respectively.The latter showed easier learning2 and faster convergence, suggesting a reduced possibility space compared to the symbolic dataset.
Additionally, Denamganaï and Walker [2020] examines the correlation between the structure of input and generalization abilities in a referential game.They experiment with the number of attributes in the dSprites dataset [Higgins et al., 2017], providing results for 2, 3, 4 attributes and visual representations.Their results support the hypothesis advanced by Chaabouni et al. [2020], that generalization occurs naturally when the input space is large, although they do not experiment with multiple input structures simultaneously.
While human communication in a referential game appears to rely solely on visual information, the human mind can access additional tools when referring to objects in the real world.Human categorization [Anderson, 1991, Wierzbicka, 1984] plays a crucial role in the semantic structure of language and has been linked to hierarchical relationship attribution between super-ordinate and lower-level categories [Merriman, 1991].Although the works mentioned above examine the differences between inputs in the form of visual or categorical representations, no effort is made to utilize more input representations at once.

Information bottleneck
The input space plays a significant role in the emergence of language, as demonstrated by previous studies.When the input space is large enough, agents tend to develop a more generalized language.This aspect relates to the human mind and language, as discussed by Zaslavsky et al. [2018].They argue that language is used to compress ideas into words efficiently, and this compression involves a trade-off between lexical complexity and accuracy.The authors conducted a color-naming game across human participants and demonstrated that languages achieve near-optimal efficiency based on the information bottleneck principle3 (IB) [Tishby et al., 2000].The information bottleneck effect has been studied in multi-agent communication with message pruning [Mao et al., 2019], limited bandwidth [Wang et al., 2019], and message entropy [Kharitonov et al., 2020]; however, few studies have taken into account its evolutionary advantage.
A study by Kirby et al. [2015] explores the same line of work on human participants, where simulated rational learners were tested to validate the trade-off between expressiveness and compressibility under different constraints.Simulating cultural evolution4 , they proved how a hierarchical organization of language emerges where learners experience pressure on both the learning and communication side.
Similarly, Kottur et al. [2017] carried out a TnT reference game involving two agents and investigated the necessary constraints for a generalized language to emerge.They discovered that a limited vocabulary size and memory-less models fostered the development of a language in which individual symbols were grounded in attributes.
According to Resnick et al. [2020], there is a connection between learnability, capacity, bandwidth, and the use of structured language for language learning.They hypothesize that learning compositional communication requires less capacity than learning a non-compositional code, shedding new light on the problem of artificial language learning.Unlike other works that advocate for deeper and larger architectures to emulate the human brain, Resnick et al. identify the problem as the overpowering ability of machines to memorize input spaces.
In conclusion, the input space and the trade-off between informativeness and complexity are essential for language emergence.A recent study by Tucker et al. [2022] introduces the Vector-Quantized Variational Information Bottleneck (VQ-VIB) method, which combines task-specific utility maximization with general communicative constraints.VQ-VIB agents can adapt to changing communicative needs, develop meaningful embedding spaces, and demonstrate improved PREPRINT utility and faster convergence rates.This framework offers new insights into human language evolution and artificial emergent communication, paving the way for future research in complex domains and human-agent interactions.

Open Challenges
Drawing from the studies discussed thus far, particularly those by Resnick et al. [2020] and Kottur et al. [2017], we observe that artificial agents possess exceptional memory capabilities.These capacities enable agents to discover communicative shortcuts, resulting in degenerate languages that are more akin to holistic (one-to-one mapping-like) languages rather than compositional ones.A common solution emerging in the literature is to increase the environmental complexity, i.e., the input space, towards more human-like environments.Indeed, the majority of approaches discussed in this and subsequent sections primarily focus on implementing more challenging environments rather than examining sub-optimal architectures.Nevertheless, enhancing the environmental complexity and limiting an agent's communication capacity are complementary solutions to deter machines from excessive memorization.Although the former is a valid approach, it introduces supplementary variables into the system, which must be considered when analyzing the resultant language.We argue that researchers in this field should emphasize preventing memorization from both perspectives, giving increased attention to the utilization of smaller, more manageable neural networks that necessitate generalization in order to solve the task.
Furthermore, as mentioned in the introduction, this review focuses on the improvement of human-machine communication.To achieve this goal, developing machines with similar generalization requirements could be beneficial.An intriguing research question arises: how can we design learning paradigms and neural networks that inherently favor generalization over memorization?
From this inquiry, we can also consider whether merely expanding the input size is sufficient to achieve generalization.This may be explored by drawing parallels with human learning in relatively simple tasks: do humans resort to memorization when possible, or is generalization an inherent aspect of human learning?Addressing these questions may help researchers devise strategies to encourage generalization over memorization in artificial agents, leading to the emergence of more structured and organized languages that better reflect human-like learning processes.

Learning paradigm
In EmCom research, exploring the vast space of possible communication utterances, or state space, can be too complex for simple networks to handle.Deep neural networks have become a popular choice for training agents on a discrete set of symbols for communication.To train DNN, multiple learning frameworks can be deployed, the main ones being reinforcement and supervised learning.

Reinforcement Learning
Reinforcement learning is a crucial approach in the field of emergent communication for two main reasons.First, it allows artificial agents to learn from interactions with a game environment, which mirrors the human ability to adapt to changing circumstances.In this context, it is essential to differentiate between single-agent and multi-agent reinforcement learning, as the latter introduces additional complexities arising from the need for coordination and communication among multiple agents.
The second motivation for utilizing reinforcement learning in EmCom is its ability to facilitate backpropagation through symbols.. Indeed, training agents to communicate through a set of discrete symbols presents a challenge: backpropagation is difficult due to the non-differentiability of the variables involved.This problem arises specifically because researchers aim to model human language on a discrete, predominantly word-based level, which captures the essential structure and characteristics of natural language.
To overcome this, various techniques have been developed, including the reparameterization trick (such as VQ-VIB [Tucker et al., 2022]), semantic hashing [Salakhutdinov andHinton, 2009, Kaiser andBengio, 2018], Gumbel Softmax [Jang et al., 2017, Maddison et al., 2017] and the REINFORCE algorithm.These methods enable backpropagation through non-differentiable variables, allowing for effective training of communication networks REINFORCE The REINFORCE algorithm [Williams, 1992] is a well-known method for estimating loss function gradients with respect to stochastic policy parameters, and it has been used in many works in the field of EmCom to backpropagate through symbols, as shown in Table 2. Its simplicity and effectiveness have made it a popular choice, but it can suffer from high variance and instability during training.To mitigate these issues, various modifications have been proposed, such as the use of baselines [Mnih andGregor, 2014, Gu et al., 2016].
Gumbel Soft-max & Concrete distribution As noted earlier, previous research on stochastic gradient estimation has primarily focused on addressing the high variance issues of the REINFORCE algorithm by augmenting it with Monte Carlo variance reduction techniques or biased path derivative estimators for Bernoulli variables [Bengio et al., 2013].Until the recent introduction of the Gumbel Soft-max and Concrete distributions [Jang et al., 2017, Maddison et al., 2017], no gradient estimator had been specifically designed for categorical variables to facilitate backpropagation through symbols.Unlike one-hot encoding of categories, which does not provide a gradient, the Gumbel Soft-max and Concrete distributions provide a continuous relaxation of the categorical distribution, with a noise component, as shown in Figure 3. Additionally, the relaxation intensity can be regulated with the parameter λ, where the Concrete distribution converges to the categorical distribution with λ approaching zero.
On one hand, this kind of relaxation has been reported to achieve better performance than REINFORCE in several works [Havrylov and Titov, 2017, Omidshafiei et al., 2019, Chaabouni et al., 2020, Zhu et al., 2021, Rita et al., 2022, Chen et al., 2020, Kharitonov et al., 2020].For instance, Havrylov and Titov [2017] demonstrated a positive correlation between message length and communication protocol convergence speed, not observed in REINFORCE-like algorithms.They also combined REINFORCE and Gumbel-softmax, and show improved loss function gradient estimation and game setting differentiability, resulting in a structured, hierarchical encoding scheme with faster convergence than standard RL frameworks.
On the other hand, one could question whether allowing the gradient to flow directly through the communication channel bears any resemblance to human cognition.Studies in cognitive science and neuroscience have suggested that human communication and learning processes might not rely on such direct optimization methods, but rather on more intricate processes, such as episodic memory [Buzsáki and Tingley, 2018], analogy [Gentner et al., 2001], and social learning [Rumjaun and Narod, 2020].As a result, the extent to which these relaxation techniques can inform our understanding of human language evolution and cognition remains an open question.

Supervised Learning
Supervised learning (SL) is another common learning framework used in EmCom.Unlike RL, SL uses labeled data.In communication games with human HNL (usually English), supervised learning can be used for tasks such as language pre-training [Das et al., 2017, Cogswell et al., 2020, Li et al., 2020], distribution shift mitigation [Lu et al., 2020a, Hawkins et al., 2020, Lu et al., 2020b], and visual classification [Lazaridou et al., 2017[Lazaridou et al., , 2020]].However, in games with symbolic languages, SL is rarely used, with some exceptions such as the modified reference game presented in Graesser et al. [2019], where RL is used to coordinate agents through a communication signal, while SL is used to estimate a probability distribution for captions.
Supervised learning is also used in works that incorporate Theory of Mind, Section 2.4, which equips agents with a prediction module to estimate other agents' beliefs and future actions.In these cases, SL can be used to predict actions given current observations [Jaques et al., 2018, Raileanu et al., 2018, Jaques et al., 2019, Rodriguez et al., 2019] or coupled with the obverter technique to influence policy based on an agent's own understanding [Choi et al., 2018, Bogin et al., 2018].

Open challenges
The field of EmCom involves a diverse range of learning techniques.Given the interactive and game-like nature of the environment, reinforcement learning plays a central role, as can be seen from Table 2. RL enables the emergence of novel behaviors by exploring and exploiting the system's intrinsic dynamics.Consequently, EmCom draws heavily on human experience, specifically on our understanding of dynamic real-world systems and our ability to adapt to them.Supervised learning is also utilized in certain cases, particularly for auxiliary tasks, as it mirrors the human ability to learn through demonstration.The capacity of SL to learn from demonstration complements RL, which can build upon knowledge and experiences gained by previous generations without always starting from scratch (see Iterative Learning in Section 2.3.2).A common challenge in these cases is striking the right balance between supervised language tasks and reinforcement-based referential games, as discussed in Section 3.2.2.
To improve human-machine communication, bridging the gap between human and machine learning paradigms may be a promising approach.This requires the integration of multiple learning aspects within a single framework as well as exploring new learning paradigms.However, in the literature analyzed for this review, a surprisingly limited number of studies explore alternative learning approaches, such as unsupervised learning [Grover et al., 2018, Mao et al., 2019], evolution strategies [Dagan et al., 2020], stochastic computational graphs [Kharitonov et al., 2020] and self-supervised learning [Dessì et al., 2021].
Moving forward, future research should prioritize investigating these alternative learning methods and their potential synergies with existing paradigms.By expanding the scope of learning techniques employed in the field, researchers may uncover novel strategies that more effectively mimic human learning processes, ultimately leading to enhanced human-machine communication and collaboration.Additionally, interdisciplinary approaches that draw from cognitive science, sociology, and linguistics may provide valuable insights to inform the development of more human-like learning algorithms.

Interaction type
In this section, we categorize the types of interactions among agents and study their contributions.
We differentiate between inner and outer interaction types.Given a set A N of N agents, we define a team T m ⊆ A N as a subset of agents and we denote with {T 1 , . . .T k } a partition of all the agents in k disjoint teams with i=1,...,k T i = N .
Given a partition of agents (i.e., a set of teams), we define a game to be cooperative when any negotiation between teams leads to a sub-optimal outcome S so for all the parties involved.While each team actively tries to reach an optimal state S o , through negotiation, it can avoid the worst outcome (pessimistic) S p where S o < S so < S p .On the other hand, a zero-sum game competitive setting always implies the worst outcome for all teams but one.In the latter, negotiation becomes ineffective, and each team must prevail over the others.
In the following section, we introduce three categories that focus on the spatial aspect of human interaction, distinguished by the number of agents or teams involved.However, humans also engage in interactions across the temporal dimension, not only through the transfer of information and experiences from generation to generation but also in turn-taking during conversations.This temporal aspect leads to a type of interaction known as iterative learning, which contrasts with the spatially-oriented categories.To visually represent these differences, Figure 4 illustrates the various interaction types across both spatial and temporal dimensions, emphasizing the need to consider both aspects in EmCom research.

Space-oriented
We define inner cooperation as the setting where all the agents of one team are cooperative with one another.On the other hand, outer cooperation is defined as two or more disjoint teams of agents sharing a common goal, thus being incentivized to cooperate.Differently, competition arises in zero-sum games, where one team's victory implies a pessimal outcome for all other teams.In these cases, negotiation is not useful.

Inner Cooperation
Cooperation is crucial for communication to arise [Smith, 2010, Nowak andKrakauer, 1999].As seen in Table 2, most works focus on inner cooperation.For instance, Lazaridou et al. [2017]   of cooperation in referential games for successful communication.Moreover, Cao et al. [2018] studies how pro-social agents favor cheap talks and achieve better results than selfish ones.On the same line, Graesser et al. [2019] suggests how intricate language evolution can emerge from simple social interactions between agents.
Outer Cooperation Outer cooperation in emergent communication can occur in two different ways.The first is the standard instance of outer cooperation where disjoint teams collaborate by interacting with the environment and each other, as explored in various works [Evtimova et al., 2018, Lowe et al., 2017, Bachrach et al., 2020, Chen et al., 2020].
A distinctive characteristic of this method is that every agent is initialized at the same time and has no prior experience of the environment.In contrast, population learning involves training teams together before introducing them to other teams.
This setting is influenced by linguistics and sociology, specifically by the study of language development in different cultures [Briscoe, 2002, Kirby et al., 2014].Studies have shown that a population of agents generalizes better than a pair [Tieleman et al., 2019] and how the amount of group connectivity determines the evolution of mutually intelligible languages [Graesser et al., 2019].Moreover, the interaction between populations of agents can lead to the emergence of a language that is easier to teach and understand [Li and Bowling, 2019, Lowe et al., 2019a, Fitzgerald, 2019, Zhu et al., 2021, Wang et al., 2021, Cogswell et al., 2019].
Competition Competition can lead to improved communication protocols and general performance, as shown by Liang et al. [2020].They suggest that competition among agents can prioritize compositionality, performance, and convergence in communication protocols.Similarly, Nakamura et al. [2016] set up a social deduction game5 where agents must infer the trustworthiness of others through interactions and hard-coded communication actions.Brandizzi et al. [2021] also study this setup, but in a canonical EmCom environment without communication constraints.Despite highly adverse scenarios, such as incomplete information and strategic deception, the authors show that communicating agents can still defeat opponents through effective communication and collaboration. PREPRINT

Time-oriented
Notably, the first three categories focus on the spatial component of human connection, using the number of agents/teams to characterize their differences.However, humans exist in a temporal dimension where information and experiences are passed from generation to generation, leading to a type of interaction called iterative learning, which is diametrically opposed to those previously addressed.
Iterative Learning The concept of iterative (or iterated) learning is closely linked to population learning and it is heavily inspired by the field of language development and evolution.Iterative learning occurs when a population of agents passes on their knowledge to a new one, repeating indefinitely.
In linguistics, iterative learning is compared to a bottleneck in the learning system, which enables generalization [Kirby, 2001, Nowak and Krakauer, 1999, Scott-Phillips and Kirby, 2010].Studies such as Kirby et al. [2014] and Graesser et al. [2019] have explored iterative learning's role in the emergence of natural language, demonstrating that it can shift languages towards consistency with prior biases and lead to language conversion to a lower complexity majority protocol with linguistic contact over time.
The transmission of language across generations requires older agents to teach younger ones.To investigate the learnability and generalization in referential games, several works have focused on the aspect of teaching.Li and Bowling [2019] have implemented referential games in which old agents are periodically swapped with new ones and receiver have their parameter reset periodically.Ren et al. [2020] trains new agents on data generated by older agents.
Both of these approaches have shown a strong correlation between ease of teaching and the speed of convergence to a generalized language.
Interestingly, Zhou et al. [2022] draws a connection between parameter reset and the lottery ticket hypothesis6 [Frankle and Carbin, 2018], where the authors regard forgetting as a valuable part of iterative learning and point out its usefulness in language evolution [Barrett and Zollman, 2009].
Focusing more on the teaching aspect, Omidshafiei et al. [2019] define explicit teacher-student roles and demonstrate how their teaching agents not only learn significantly faster but also learn to coordinate in tasks where existing methods fail.Interestingly, there is a parallel between teaching in multi-agent communication and multi-agent transfer learning [da Silva et al., 2020], as suggested by Omidshafiei et al. [2019].This connection highlights the potential for cross-fertilization between these fields and the importance of further exploring their interconnections.Indeed, reinforcement learning is a commonly used training paradigm for iterative learning, with Self Play [Tesauro, 1994] being the preferred approach [Lowe et al., 2020, Graesser et al., 2019, Gupta et al., 2019].Other approaches include seeded iterative learning [Lu et al., 2020b,a], and evolution strategies [Dagan et al., 2020].
In conclusion, these works demonstrate how language structures are transmitted across generations of learning agents and refined with each subsequent iteration, resulting in more efficient, communicable, and learnable languages [Ren et al., 2020, Tieleman et al., 2019, Chaabouni et al., 2019a, Cogswell et al., 2019].

Open challenges
As evident from the literature, there is a considerable focus on iterative and population learning, as they appear to impose the necessary constraints for the emergence of easily learnable languages.However, there is limited work on investigating the effects of competition and, more specifically, on balancing cooperation and competition.Exploring the optimal balance between cooperation and competition in mixed human-robot teams presents a significant challenge that could improve the overall performance and effectiveness of such teams.
Furthermore, non-verbal communication is fundamental in human-human communication, and although slightly beyond the scope of emergent communication, developing artificial agents that can both interpret and produce non-verbal cues remains an open challenge for enhancing human-machine communication.

Theory of Mind
The concept of Theory of Mind7 (ToM) is a crucial aspect of human behavior that has been modeled in the EmCom field.It refers to our ability to form beliefs about how others might react to certain stimuli and update them with new observations, as shown in various studies [Gopnik andWellman, 1992, Premack andWoodruff, 1978].More recently, Rabinowitz et al. [2018] applied ToM to let artificial agents build a model of other agents' observations and behavior.

Modeling Agents Influencing others
Figure 5: Illustration of Theory of Mind in artificial agents: Agent 2 must choose between pizza and gelato.In the modeling agents approach, Agent 1 predicts Agent 2's choice based on their preferences or past behavior.In the influencing others approach, Agent 1 takes action to influence Agent 2 to select a specific option.
In EmCom, there are two main approaches to augment agents with ToM: (i) Agent's modeling, where artificial agents actively model others' behavior to some extent; and (ii) Influencing others, where agents manipulate other agents' behavior based on their objectives, an extension of agent's modeling, see Figure 5.

Modeling Agents
The concept of modeling other agents is well-established in the field of multi-agent reinforcement learning.However, machines require specific formulations to approximate the same behavior as humans do.
One way to approach this is to leverage the similarities between agents' belief systems.This approach called the obverter technique, has been shown to be effective for the emergence of compositional languages [Choi et al., 2018, Bogin et al., 2018].Interestingly, the obverter technique bears a striking resemblance to the Rational Speech Act (RSA) [Frank andGoodman, 2012, Goodman andFrank, 2016], a prominent linguistic framework that models how speakers and listeners use reasoning to communicate effectively.This parallel between the obverter technique and RSA further validates the interdisciplinary nature of EmCom, as it demonstrates the potential for cross-pollination between computer science and linguistics.
Alternatively, mental models can also be based on other agents' actions and perceptions without assuming similar belief systems.For instance, Raileanu et al. [2018] augments agents' policy with predictions of other agents' behavior and demonstrates that agents can learn better policies using their estimates of other players' goals in cooperative and competitive situations.However, this work does not consider environments where communication is present.
Several studies focusing on communication in artificial agents model their mental states and adjust communication protocols accordingly.For example, Lowe et al. [2017] describe how agents adapt to each other when trained in conjunction, and this finding led to the study of agents who can reason about other agents and adjust the communication protocol accordingly [Andreas and Klein, 2016, Hawkins et al., 2020, Zhu et al., 2021].

Rodriguez et al.
[2019] let agents model the conceptual understanding of others by switching partners with different proprieties8 .Grover et al. [2018] split the representation learning into two parts: a generative embedding simulates an agent's policy, while a discriminative one distinguishes one agent from another.These works demonstrate how agent PREPRINT modeling allows communication to quickly adapt and specialize to the task at hand, but this specialization can lead to languages that are difficult to interpret by humans (see Section 3.2.3).
The identity of the recipient in a multi-agent environment can be just as important as the message being conveyed.For instance, Das et al. [2019] introduce TarMac, a targeted multi-agent communication architecture that enables agents to choose their communication target using soft attention.This method assigns high attention weights when both the sender and receiver predict similar signature and query vectors.The authors evaluate their approach in four environments, including cooperative and competitive settings, and show improved performance and faster convergence across all scenarios.This finding opens new possibilities for increasing multi-agent system performance without requiring signal sharing between each agent.

Influencing Others
Foerster et al. [2018] takes the next step by introducing the Learning with Opponent Learning Awareness (LOLA) framework, which models the opponent's policy and attempts to actively influence it.As a result, higher-order LOLA emerges, in which agents are aware that opponents are trying to influence them, resulting in computationally expensive third-order derivatives.
A number of agents' modeling efforts focus on leveraging the ToM for steering the behavior of other agents.For example, Zhu et al. [2021] presents a referential game where a speaker interacts with a population of agents with different linguistic abilities9 and uses model-agnostic meta-learning [Finn et al., 2017] to improve the prediction accuracy of listener's actions.Similarly, Hawkins et al. [2020] formulates the problem as a continual learning framework and tests the ToM model in real-time interactions with humans, finding a significant increase in the probability of a correct response with successive repetitions.
In addition, Jaques et al. [2018] investigates the influence of intrinsic social agents equipped with a ToM framework to stir the decision of other agents in two sequential social dilemmas.They report above state-of-the-art performance when their agent is equipped with a ToM model and influences the behavior of other agents, leading to effective emergent communication protocols.Furthermore, Xie et al. [2020] builds an RL environment where the agent employs an encoder-decoder architecture to model the action of a human being in a mixed human-robot setting and uses it to approximate the human policy to maximize the total discounted reward.

Open challenges
While the majority of the literature in the EmCom field focuses on Multi-Agent Systems (MAS) with artificial agents only, there are a few notable papers that explore mixed human-robot teams [Jaques et al., 2019, Hawkins et al., 2020, Xie et al., 2020].To enhance human-machine communication, it is crucial to involve humans in the loop and investigate mixed teams' dynamics further [Brandizzi and Iocchi, 2022].The presence of a human in the system can provide valuable insights into how artificial agents can better adapt to human behavior, preferences, and expectations.The abovementioned papers serve as pioneering examples in this direction, and future research should build upon these foundations by expanding the investigation into various domains and settings.This may include, for instance, exploring different communication modalities, incorporating diverse human characteristics, and adapting to changing human-agent team compositions.
Finally, as artificial agents become more capable of understanding and influencing human behavior, ethical and privacy concerns will arise.It is crucial for the research community to consider these aspects when developing new methodologies and frameworks.For example, ensuring that artificial agents do not exploit vulnerabilities in human decision-making or manipulate human users for unintended purposes is vital to maintaining trust and safety.

Dichotomy of Emergent Communication
In Section 2, we outlined the common proprieties of EmCom literature, which underscored the similarities between works and how they relate to human interaction.In contrast, in this section, we aim to define two distinct categories of works that investigate language from opposite perspectives.
The first category is Machine-centered EmCom (Mac-EmCom) where Artificial Emergent Languages (AELs) without pre-defined (linguistic) structures are considered.The goal of Mac-EmCom is to identify the environmental, architectural, and structural factors required for natural language properties to emerge.We define this approach as bottom-up, meaning PREPRINT that it begins with an emergent artificial language and gradually develops it into a human-like natural language over time.
The second category, Human-centered EmCom (Hum-EmCom), emphasizes the use of natural language.In this approach, agents are provided with knowledge of a human natural language (HNL), typically English, which they then learn to apply in dynamic environments.We define Hum-EmCom to be a top-down approach, where agents begin with the necessary knowledge to speak a language and learn how to use it for cooperative behavior in a multi-agent system.
Both methodologies are essential for understanding the dynamics that govern the emergence of language in artificial environments.By examining the strengths and weaknesses of each category, we can gain a more comprehensive understanding of what conditions and factors are necessary for human language to emerge.In fact, many papers cover both methodologies simultaneously, demonstrating their complementary nature.

Machine-centered EmCom
This section is concerned with exploring the properties of emergent languages, rather than their interpretability by non-expert humans.The literature discussed in this section is focused on AELs without any direct mapping to HNL.This step is essential in understanding the differences in structure and learning between human and machine languages.
Although there may be some similarities with Section 3.2, we will highlight the differences between the two sub-fields to provide a clear distinction between them.

Characteristics
In Machine-centered EmCom, languages are composed of symbols and numerical vectors, without any direct correspondence to HNL.As a consequence, there is no requirement for a direct mapping between a symbol and a meaning.Therefore, this approach provides greater flexibility in selecting symbols and facilitates the examination of structural and learning differences between human and machine languages.
Regarding communication channels, similarity to human communication is not a prerequisite.Utterances are introduced as a discrete set of symbols that the agents must map to other modalities, such as visual input.However, some research has investigated the use of continuous communication channels, offering an interesting approach to examining what a non-bottleneck form of interaction might look like., 2018], DIAL takes advantage of the artificial setting by allowing gradients to flow through the communication channel.The authors show that DIAL is capable of achieving faster convergence than RIAL, demonstrating that gradient provides a more robust and richer source of information.Similar results are also reported in [Mahaut et al., 2023, Sukhbaatar et al., 2016, Kong et al., 2017].

Non-verbal communication
In addition to verbal communication, humans also make use of non-verbal communication strategies, such as gestures and signs, to convey meaning.While most Machine-centered EmCom research focuses on symbolic signaling, some studies have explored the role of non-verbal communication in the emergent language.
For instance, Bullard et al. [2020] investigated emergent non-verbal communication in embodied agents within highdimensional simulated environments.They designed a referential game in which agents produced a sequence10 of limb motion in a simulated 3D world.By providing explicit latent features, such as an energy-based structure, the agents were able to generalize to novel patterns.
Similarly, Mordatch and Abbeel [2018] explore the emergence of language in the context of agents embodied in a physic simulator, and notices how non-verbal communication, such as pushing, pointing, and guiding, arises as a by-product.
In contrast, Mihai and Hare [2021] focused on sketching as a form of non-verbal communication.By leveraging the differentiability of the drawing procedure, they developed a referential game and demonstrated how agents could communicate effectively.With the appropriate inductive bias, the drawings became interpretable by humans, although the authors were unsure if this was due to the visual pretrained network bias or if the agents captured some fundamental generalization of visual perception.

PREPRINT
In a similar vein, Qiu et al.
[2021] created a referential game where agents used sketches as a medium of communication.
Unlike Mihai and Hare [2021], Qiu et al. [2021] used a framework more akin to Task and Talk, where the sender continuously improved the sketch until the receiver is ready for prediction.The authors reported successful communication and developed a set of evaluation metrics inspired by cognitive science.They showed how mutual adaptation and sequential decision-making could encourage symbolicity, defined as the consistent separability of drawings in high-level visual embeddings, which facilitated easy categorization of drawings by new communication participants.
Although this research is not directly related to Mac-EmCom, it can provide new insights into communication in general, given that a significant portion of human communication (70% to 93%) is non-verbal [Mehrabian et al., 1971, Mehrabian, 2017].Furthermore, nonverbal communication is essential to human-robot interaction [Vasconez et al., 2019, Bacim et al., 2012], although such interactions are often programmed manually.

Hunt for Generalization
The field of Mac-EmCom, as well as EmCom more broadly, strives to achieve a set of desirable features when emergent language is developed.These features include learning meaningful token representations [Tucker et al., 2021] and achieving non-trivial compositionality [Steinert-Threlkeld, 2020], which ultimately enables the language to generalize to new concepts and ideas without having encountered them before.In this field, generalization is often studied in conjunction with compositionality, the latter is the idea that the meaning of a complex expression is determined by the meanings of its constituent and the rules governing their combination [Frege, 1892].However, it is important to note that achieving these features, especially compositionality, can be challenging and require careful interpretation of the results.
Study of compositionality Compositionality is a highly desirable property in both human and artificial languages [Baroni, 2020], as it allows for the generalization of concepts and ideas [Pelletier, 1994, Janssen andPartee, 1997].However, defining compositionality can be challenging [Korbak et al., 2020, Andreas, 2019], and its necessity for generalization has been debated in the literature [Kharitonov and Baroni, 2020].In their study, Chaabouni et al. [2020] investigate input reconstruction in a simplified signaling game.Their findings suggest that although compositionality is not a strict requirement for generalization, its presence considerably enhances the learning speed and accuracy of newly introduced agents.
Other works, such as Choi et al. [2018] and Kottur et al. [2017], attribute the emergence of compositionality to environmental constraints rather than specific model architecture, highlighting the importance of constructing appropriate settings for language development.For example, Korbak et al. [2019] employs curriculum learning, gradually increasing the difficulty of referential games, and reports an emergence of compositionality using topographic similarity and zero-shot generalization accuracy.Similarly, the introduction of iterative learning, see Section 2.3, in the language development process, has been shown to lead to increasingly compositional languages with each generation [Cogswell et al., 2019, Ren et al., 2020, Tieleman et al., 2019, Chaabouni et al., 2019a].The complex and dynamic chaotic environments where languages can emerge offer a rich foundation for language development [Larsen-Freeman, 1997].Consequently, a significant portion of the research efforts is focused on creating settings that closely resemble real-world conditions.
However, it is important to note that while compositionality is a desirable property, defining it can be challenging, and care should be taken when interpreting results that rely on it.Furthermore, recent studies highlight the importance of inductive biases on both the training framework and the data for the development of compositional communication [Bullard et al., 2020, Mihai andHare, 2021].Kuciński et al. [2021] theoretically and experimentally demonstrate that inductive biases on the training framework and the data are necessary for the development of compositional communication and that a noisy communication channel11 can promote compositionality in signaling games.
These studies suggest that inductive biases should be carefully considered when evaluating and designing models for language development.This opens up new and interesting research paths for investigating the relationship between inductive biases and language learning in both humans and machines.By understanding the biases that influence language development, researchers can design more effective models and evaluation metrics for artificial language learning, as well as gain insights into the fundamental mechanisms underlying natural language evolution.

Evaluating performance
The ability to recognize properties such as compositionality, verbal agreement, and deception can be challenging when dealing with symbolic languages.Therefore, several metrics have been proposed to evaluate the effectiveness of such  (1) Reward/task completion, (2) message mutual information, (3) embedding analysis, and (4) similarity measures.Each type is further divided into specific metrics used to assess different aspects of emergent communication.
languages.The importance of this analysis lies in the fact that artificial learning differs significantly from natural learning.
Cheating behaviors As a matter of fact, the evaluation problem has been acknowledged by Lowe et al. [2019a], who highlighted that most research in this area is focused on enhancing task performance rather than examining the semantics of the language.The authors examined how the neural network's capacity affects its ability to learn compositional languages.In a related work by the same authors, Resnick et al. [2020], they investigated the relationship between the size of the language space, denoted by |L|, and the necessary number of bits required to solve a task, which is given by = log(|L|).They found that for large enough neural networks, agents were able to memorize the environment, which allowed them to solve the task without actually using the language, effectively bypassing the intended communication requirement.
Similar cheating behavior has been reported by Bouchacourt and Baroni [2018] in a referential game, where the agents achieve perfect results by communicating low-level details of the image rather than conceptual properties.The authors validate this result by providing the agents with noise images and observing that the performance is not adversely affected by such inputs.
Evaluation metrics While concepts such as compositionality and generalization have clear definitions in linguistic contexts, there is a lack of formal measurement implementations in experimental settings.As a result, researchers have developed various metrics to evaluate emerging languages.In this section, we present 14 metrics divided into four categories.Figure 6 shows a hierarchical view of the metrics, with four main types of evaluation metrics and their respective subcategories.Furthermore, we provide a detailed analysis of the most commonly used evaluation metrics in the surveyed literature.It is noteworthy that we found a strong preference for five specific metrics among the studies we reviewed, as illustrated in Figure 7.These metrics provide a useful framework for evaluating the effectiveness of emergent communication models.However, it is important to note that there is no single best metric, and the choice of metric(s) should depend on the research question and specific context.Further research can be done to develop new metrics or refine existing ones to better capture the complex nature of emergent communication.

Reward and task completion
In Machine-centered EmCom (and EmCom in general), the focus is on reinforcement learning in game-like environments (as discussed in Section 2.1).As a result, metrics for evaluating agents' learning behavior, such as performance and game score, are essential.The most intuitive evaluation metric is task success, which measures the final agent's performance.All of the examined papers so far have used this metric to demonstrate the advantages of their methods.
To address the problem of catastrophic forgetting that RL agents may experience, Graesser et al. [2019] introduce a metric called mutual intelligibility, which estimates each agent's ability to play against itself12 .According to the authors: 'if a shared communication protocol has emerged, the agent would not have any trouble playing a game with itself during test time'.
However, reward and task completion metrics do not account for novel, unseen stimuli.Therefore, zero-shot performance is mentioned as a measure of generalization in [Choi et al., 2018, Mordatch and Abbeel, 2018, Lazaridou et al., 2018, Cogswell et al., 2019, Bouchacourt and Baroni, 2019].For this metric, the authors either remove specific samples from the input space or generate unseen distributions during training and evaluation.Once the model is ready for testing, these samples are reintroduced, and the performance on these previously unseen instances is reported to assess the model's generalization capabilities.
Despite their usefulness in emphasizing the adaptation capabilities of models in a game environment, these metrics do not provide insights into the characteristics of the emerged language or how it is influencing agents' behavior in the game environment [Lowe et al., 2019b].

Message mutual information
In order to address the issue of evaluating the language content and its influence on agent behavior, some works analyze the relationship between the message content, speaker, listener, and context.One such metric is speaker consistency [Jaques et al., 2018], which measures the alignment between an agent's message and its future action, delivering a normalized score.This measure shows how consistently a speaker agent emits a particular symbol when it takes a particular action and vice versa.While this method is reported in [Liang et al., 2020, Chaabouni et al., 2019a, Eccles et al., 2019] as a reliable metric, but it fails to capture the listener behavior.
To account for this, the same authors [Jaques et al., 2018] introduced instantaneous coordination, a similar measure of mutual information between the speaker's message and the listener's next action.It is only natural that the same authors that used speaker consistency above also reported this metric [Liang et al., 2020, Eccles et al., 2019, Bouchacourt and Baroni, 2019, Chaabouni et al., 2019a].
Lastly, context independence [Bogin et al., 2018] measures the alignment between an agent's message and the task concept, such as the number of objects or colors in a categorical feature.While this formulation provides an interesting point of view for the correlation between objects and concepts, and it is used frequently in the literature [Choi et al., 2018, Mordatch and Abbeel, 2018, Korbak et al., 2019, Cogswell et al., 2019, Chaabouni et al., 2020, Dessì et al., 2019], it requires the dataset to be feature-based.
Similarly, the message distribution's entropy is reported in [Choi et al., 2018, Graesser et al., 2019, Liang et al., 2020, Lazaridou et al., 2018, Chaabouni et al., 2019a, Dagan et al., 2020, Bouchacourt and Baroni, 2019, Chaabouni et al., 2019b, Wang et al., 2019, Kharitonov et al., 2020] as a measure of the correlation between the speaker's input and the message used to describe it.When the entropy is low, the speaker is consistently using the same message to describe that input, thus showing some kind of communication protocol.
Embedding Analysis Symbolic languages are typically represented as discrete or one-hot vectors.While this is necessary for computational modeling, it also enables the use of statistical analysis and clustering techniques developed in machine learning research.Dimensionality reduction techniques such as Principal Component Analysis (PCA) [Pearson, 1901, Hotelling, 1933], and t-SNE [ Van der Maaten and Hinton, 2008] are often used in Mac-EmCom [Cao et al., 2018, Lazaridou et al., 2017, Sukhbaatar et al., 2016, Denamganaï and Walker, 2020] to identify meaningful clusters of data with respect to symbolic messages13 .For example, Figure 8 shows a t-SNE projection of object vectors color-coded by majority symbols, revealing a cluster of fruits in blue on the bottom right and demonstrating how the sender relates symbols and features.
While other clustering techniques can be used for the same purpose, Cao et al. [2018] employs the encoder/decoder architecture of an LSTM to analyze the correlation between messages and agents' decisions.In their study, the authors trained an LSTM on generated messages and used it to predict agents' actions, assuming that the two must be correlated for the LSTM to function effectively.With this metric, they were able to identify the intention of an agent inside the message that was being generated.
Comparative Measures While artificial settings allow for direct measurement of beliefs and intentions, studying HNL requires evaluating their syntactical, grammatical, and semantic components.Frequency analysis is a common statistical method that considers the frequency of symbols or sequences of symbols.N-grams are particularly useful for this purpose, as they count the frequency of patches of symbols and can be used to derive meaningful distributions, e.g.word lengths.For example, Zipf's law [Zipf, 2013] is a mathematical distribution often found in human natural languages, which states that the rth most frequent word has a frequency of 1 r α , with the most common word occurring twice as often as the second most frequent word, three times as often as the subsequent word, and so on: Through this simple frequency analysis, many works [Choi et al., 2018, Graesser et al., 2019, Liang et al., 2020, Lazaridou et al., 2018, Chaabouni et al., 2019a, Dagan et al., 2020, Bouchacourt and Baroni, 2019, Chaabouni et al., 2019b] try to verify under which constraints an artificial language is similar to Zipf's law or other significant distributions.
On the same line, topographic similarity has been introduced in [Brighton and Kirby, 2006] to study the correlation between the distance of all the possible pairs of meaning and the corresponding pairs of signals.It is the most referenced evaluation metric in the literature [Li and Bowling, 2019, Lazaridou et al., 2018, Guo et al., 2019, Korbak et al., 2019, Ren et al., 2020, Chaabouni et al., 2020, Dagan et al., 2020, Chaabouni et al., 2022, Bouchacourt and Baroni, 2019] positively correlating with compositionality and indicating the close relationship between EmCom and computational linguistic analysis of human languages.
Topographic similarity is not the only measure taken from an area outside of artificial intelligence.Indeed, researchers such as Bouchacourt and Baroni [2018] and Tieleman et al. [2019], have borrowed representational similarity analysis (RSA) [Kriegeskorte et al., 2008] from computational neuroscience to compare the similarity structure of input in the speaker and listener space.RSA is typically used to compare the similarity between evoked fMRI responses in selected brain regions.However, researchers in EmCom have found it useful to measure similarity between different kinds of input representation or to build a test set on which to analyze the generalization capabilities of emerging languages [Tieleman et al., 2019].
Finally, positional disentanglement (posdis) and bag-of-symbols disentanglement (bosdis) are metrics introduced by Chaabouni et al. [2020] to evaluate the compositional structure of emerging languages.Posdis measures whether symbols in specific positions tend to univocally refer to the values of a specific attribute, capturing the intuition that each position of the message should only be informative about a single attribute.In contrast, bosdis captures the intuition of a permutation-invariant language, where only symbol counts are informative, and symbols univocally refer to distinct input elements independently of where they occur.These metrics offer supplementary perspectives on the structure and compositionality of artificial languages, as demonstrated in the literature by [Korbak et al., 2020, Kuciński et al., 2021], who utilize them to examine various aspects of emerging communication systems.

Open challenges
Despite the challenges and limitations, Mac-EmCom has emerged as a fascinating and rapidly evolving sub-field of research with the potential to shed light on the fundamental mechanisms of human communication and language evolution.As we have seen, researchers have proposed a variety of metrics to evaluate the performance of artificial communication systems and to analyze the properties of the emergent languages they produce.However, there is still much work to be done in identifying the most relevant aspects of human natural language to emulate in artificial systems, and in establishing the corresponding evaluation metrics.To achieve this, interdisciplinary approaches should be adopted, drawing on insights from fields such as linguistics, cognitive science, and anthropology.
One possible future direction for research is to explore how Mac-EmCom can be integrated with more traditional approaches to NLP, such as rule-based or statistical methods.For example, researchers could investigate how Mac-EmCom could be used as a pretraining regime for large language models, allowing them to learn faster and with less data, similar to what has been done by [Lowe et al., 2020, Yao et al., 2022, Dessì et al., 2023].
Although not relevant to human-machine interaction, another promising area of research is to explore how Mac-EmCom could be used to develop new communication protocols for multi-agent systems.By allowing agents to communicate in an emergent language, it may be possible to develop more efficient and effective communication strategies than those currently used in multi-agent systems.This could have important implications for a wide range of applications, from robotics and automation to online gaming and social media.
Finally, another interesting path of research is to explore how Mac-EmCom could be used to study the evolution of language itself.By creating artificial communication systems that mirror the basic principles of human language evolution, researchers may be able to gain new insights into the origins and development of language in humans, as well as the factors that contribute to the diversity of human languages around the world.This research direction can significantly benefit the fields of human language evolution and development, where artificial settings are frequently employed.
In conclusion, the study of Mac-EmCom represents an exciting and rapidly evolving area of research that holds great promise for advancing our understanding of language acquisition and communication.By continuing to develop new PREPRINT evaluation metrics, explore new applications, and integrate Mac-EmCom with other fields of research, we may be able to unlock new insights into the nature of language and the ways in which it evolves over time.

Human-centered EmCom
In this section, we shift our focus to a new framework based on HNL, something that we define as Human-centered EmCom (Hum-EmCom).While some references have already been cited in the previous section on Machine-centered Emergent Communication, in this section we provide a fresh perspective on these works by examining their contributions to the Hum-EmCom sub-field.
To incorporate human natural language into the EmCom pipeline, researchers utilize datasets with human captions, such as COCO [Lin et al., 2014] or the Abstract Scenes Dataset [Zitnick and Parikh, 2013], often in conjunction with pretrained language models.These datasets equip artificial agents with prior knowledge about human language during training.Notably, some works also investigate the possibility of training these models within the pipeline itself.
Through the exploration of Human-centered EmCom, we aim to understand the ways in which the incorporation of natural language differs from that of symbolic language, and how it poses unique challenges to the field.Our examination of this emerging sub-field includes a review of its key works and an analysis of the various techniques and models used to overcome the obstacles presented by the complexity and variability of human language.
Human-centered EmCom and Image Captioning Before introducing Hum-EmCom, we should point out its differences with the field of Image Captioning (IC).Both utilize datasets with human captions and/or pretrained language models and aim to develop agents capable of perceiving multi-modal settings, such as vision and language, and reasoning about them using natural language.However, there are subtle differences in their respective methodologies.Human-centered EmCom research is developed in game settings and thus employs reinforcement learning, whereas IC predominantly uses supervised learning.While Hum-EmCom explicitly models the interaction among multiple agents with a shared goal, IC is focused on architectures capable of mimicking the human ability to use language in a visual context, which may not align with human understanding [Dessì et al., 2022].As both fields aim to refine artificial languages to better resemble human-like ones, they should be regarded as complementary components of the broader challenge of achieving this goal.

Characteristics
The first instance of Hum-EmCom can be traced back to the work of Andreas and Klein [2016].The authors developed a reference game where a speaker generated pragmatic14 captions.Utilizing a reasoning speaker (see Theory of Mind in Section 2.1) that employed multi-modal representation, the authors evaluated their approach using two metrics: accuracy, measured by the success rate of the game, and fluency, measured by showing isolated sentences to human evaluators and asking them to rate their language quality.The authors introduced a trade-off parameter λ, which allowed for weighting the joint probability of a sentence uttered by the speaker and correctly interpreted by the listener.They found that small λ values led to highly specific utterances with low fluency while increasing λ caused the captions to become more generic.
While Andreas et al. focused on grounding the problem around a natural language captioned dataset, Lazaridou et al. [2017] concentrated on porting a Mac-EmCom pipeline to Hum-EmCom using a model pretrained on an image classification dataset.They developed a pipeline that allowed the sender to switch between two tasks equiprobably: a referential game and a supervised language captioning task.This approach aimed to ground the sender in HNL while simultaneously teaching it to communicate using that grounding.As a result, the relationship between images and captions made the mapping between pairs of images and supervised categories humanly interpretable.For a follow-up experiment, they used one dataset for supervised image captioning and another for referential games (ReferItGame [Kazemzadeh et al., 2014]) and asked human evaluators to determine which image a sender caption referred to.They reported an accuracy rate of 68% for the latter task, concluding that supervised learning can provide a foundation for communication with humans that is generalizable beyond the distribution of image caption datasets.
While both Andreas and Klein [2016] and Lazaridou et al. [2017] aimed to create artificial agents capable of reasoning in a multimodal environment, the methods used to achieve this goal differed, as previously noted.Lazaridou et al. [2017] started from a Mac-EmCom setup and expanded it to include HNL, an approach that we identify as Hum-EmCom.
In this regard, we report related works that follow a similar approach [Das et al., 2017, Havrylov and Titov, 2017, Lu et al., 2020a,b, Lazaridou et al., 2020, Lowe et al., 2020].Conversely, Andreas and Klein [2016] approached the task following the classical IC pipeline, a methodology often used in the literature [Hawkins et al., 2020, Zhu et al., 2021, PREPRINT Wang et al., 2021].These seemingly dual designs can blend to form numerous alternatives: Lee et al. [2018] developed a referential game where two agents were pretrained on different languages and must evolve a common interpretation to solve the task; Cogswell et al. [2020] used pretrained language models but then extended the game to a Task and Talk setting, which is more similar to Andreas and Klein [2016]; and Li et al. [2020] used Mac-EmCom as a pretraining framework for machine translation.

Balancing Supervised and Reinforcement Learning
As previously noted, Mac-EmCom involves the use of both supervised and reinforcement learning, which introduces distinct challenges, particularly regarding the balance between the two.
To address this balance, Lazaridou et al. [2020] split functional learning, where agents are focused on maximizing a task-specific reward, and structural learning, where the aim is to keep the language correct and fluid.They propose various training techniques, including reward finetuning, multi-task learning, and reward-learned rerankers.
In Evtimova et al. [2018], a multi-modal15 , multi-step16 referential game was set up to simulate more realistic settings, with the sender accessing the visual portion of the game and the receiver accessing the textual portion.The authors showed that a robust and efficient communication protocol emerges, and their work demonstrated a positive correlation between the length of conversation and the receiver's prediction confidence.They observed that sender entropy increases as the receiver asks more specific questions and investigated how agents' communication protocols become highly specialized with limited bandwidths.Lowe et al. [2020] explore emergent communication with respect to supervised learning and self-play.They define two test cases, one inspired by the Lewis singling game and the other from Lee et al. [2018], and they investigate what kind of scheduled learning achieves the best performance and generalization.Their study concludes that the initial supervision phase helps overcome the discovery problem while starting the learning process with self-play leads to inconsistent language drift.Furthermore, population-based learning was found to outperform the previous method in mitigating both language drift and accuracy.
Similarly, Lowe et al. [2019a] employed pre-trained agents, each trained on different communication protocols, as a foundation for training a meta-learning agent.This meta-learning agent was then further trained through collaboration with other agents as they learned to communicate.The resulting meta-learner could be introduced to new populations of agents with different languages and adapt more efficiently.In practice, the authors developed a dataset with a diverse range of communication paradigms, using it to train an agent based on a standard structure for general communication.
These studies suggest that Hum-EmCom puts more emphasis on balancing the competing learning paradigms of supervised language tasks and reinforcement-based referential games than on the environmental constraints of emerging compositional languages, which were the main focus of Mac-EmCom research.This shift in focus may indicate that Mac-EmCom researchers could encounter similar difficulties in their efforts to create artificial agents that can generalize to new concepts through learned communication protocols.Therefore, the lessons learned from Hum-EmCom research could potentially benefit Mac-EmCom research in the future.

Language drift
The phenomenon of human language drift (LD) has been the subject of study since the early 20th century.In his work, Sapir [2014] expresses his fascination with the apparent paradox of dialect variation and analyzes how drifts are formed as a historical product, drawing similarities with the iterative learning process discussed in Section 2.1.However, Sapir also highlights an important aspect of LD, namely the cumulative shifting in some special direction, and emphasizes that these shifts are not ultimately random, of course, only relatively so.Lakoff [1972] similarly argues that LD is not accidental but an inherent part of human linguistic ability.
Language Drift in machines The phenomenon of language drift in machines shares a fundamental aspect with human language: the co-evolution and adaptation to conventional agreements between speakers, leading to an interest in computational linguistics for studying such phenomena in artificial settings [Hamilton et al., 2016].However, language drifts are also seen as a misalignment between emergent communication and human language, leading to recent studies exploring how to reduce language drift by finding learning constraints.These drifts, identified as behaviors that degrade a learned language's syntactic and semantic performance, arise when a supervised language task is coupled with a reinforcement learning one.Some of the recent works exploring this topic include [Lee et al., 2019, Lu et al., 2020a,b, Cogswell et al., 2020, Lazaridou et al., 2020, Li et al., 2020, Lowe et al., 2020, Wang et al., 2021].Drift Detection To accurately evaluate and mitigate language drift in artificial settings, clear evaluation metrics are necessary, and the majority of research in Hum-EmCom has relied on metrics such as BLEU [Lee et al., 2019, Lu et al., 2020b, Li et al., 2020], Negative Log-Likelihood17 [Lu et al., 2020b,a], uncertainty [Wang et al., 2021, Cogswell et al., 2020], cosine similarity [Lu et al., 2020a], and machine translation [Lee et al., 2018, Li et al., 2020].
Furthermore, Lee et al. [2019] introduce a way to estimate syntactic and semantic drifts and propose constraints to mitigate them.On one hand, syntactic constraints (LM) are expressed at the level of the pretrained language model with an auxiliary loss to measure the "Englishness" of the message.On the other hand, semantic constraints (G) are implemented on a visual ground and capture how much of the message is based on the original semantic content [Kiela et al., 2017].Figure 9 illustrates how the communication becomes more human-interpretable and grounded in visual context when both constraints are applied (LM+G).
Additionally, Lazaridou et al. [2020] introduces an automatic method for detecting the canonical language drifts, structural and semantic, as well as a third type of drift, pragmatic drift, arising from the divergence between the human interpretation of a message and the interpretation assumed by the speaker agent due to the co-adaptation of the speaker and listener agents.
Iterative and population learning As discussed in Section 2.1, Iterative learning is a framework that promotes the emergence of desirable properties, such as transmissibility, efficiency, and ease of teaching, by passing knowledge down to generations of agents.In Hum-EmCom, the use of iterative learning frameworks is aimed at achieving these properties in artificial languages and mitigating the negative effects of language drift.For instance, Gupta et al. [2019] identify multiple algorithms 18 dealing with communication in multi-agent environments and supervised dataset and introduce the term supervised self-play (S2P).The S2P loss encourages agents to maximize task completion while PREPRINT staying close to the initial distribution of languages by playing with past versions of themselves.Various combinations are tested, and S2P is shown to act as a regularizer between the two learning tasks.
In contrast, Lu et al. [2020a] propose what they call Seeded Iterated Learning (SIL), a setting more similar to human experience, where a teacher-student architecture is coupled with an imitation learning to play a simple referential game.According to their findings, BLEU scores increased when SIL was applied compared to S2P, although no human evaluation was conducted.
Based on the previous two works, Lu et al. [2020b] propose a mix called Seeded Iterated Learning (SSIL), where agents in a SIL pipeline are trained using the S2P loss on a translation game.In comparison with other baselines, the authors demonstrate that their architecture is robust to distributional shifts using BLEU and NLL.
Machine Translation Incorporating language tasks, such as machine translation, into reinforcement learning environments offers a promising solution to mitigate language drifts in Hum-EmCom.For instance, Lee et al. [2018] employs two agents, grounded in different languages, to play a visual referential game, resulting in a translation module based on emergent communication.Although this approach relies on supervised learning for their labeling task, it shows a significant increase in learning speed in an experiment with a multilingual community of agents.To address the need for unsupervised learning in machine translation, Li et al. [2020] proposes a three-step process consisting of training two agents on an unlabeled referential game, fine-tuning the resulting model on a linguistic dataset, and regularizing the model parameters.The authors report significant increases in BLEU scores on the machine translation task using this architecture.While challenges remain regarding unsupervised learning and data availability, these successes demonstrate the potential for machine translation to contribute to the development of more robust and adaptive artificial languages.

Open Challenges
Natural Emergent Communication aims to develop artificial agents capable of using human natural language (HNL) in a way that goes beyond simple prediction and can effectively communicate and learn new concepts.To achieve this, researchers balance supervised language tasks and reinforcement-based referential games, creating a unique challenge.Recent studies propose various training techniques, including reward finetuning and multi-task learning.The phenomenon of language drift is a common issue in Hum-EmCom research, leading to the exploration of various ways to reduce drift, such as Seeded Iterated Learning and supervised self-play.While Hum-EmCom research has made significant progress in creating artificial agents capable of using natural language, there are still many open challenges that need to be addressed.
Long-term understanding Most Hum-EmCom studies have focused on simple referential games, but real-world communication is much more complex.To effectively communicate in HNL, agents need to understand context, infer intentions, and reason about long-term goals.Developing techniques that enable agents to learn these skills will be a crucial step toward achieving HNL communication.
Ethics As Hum-EmCom research moves towards more complex communication and decision-making, there is a growing concern about the ethical implications of these technologies.Developing ethical guidelines and incorporating ethical considerations into the design of Hum-EmCom agents will be necessary to ensure that they are used in a responsible and beneficial manner.
Human interaction Ultimately, one goal of Hum-EmCom research is to develop agents that can effectively communicate with humans.To achieve this goal, it will be necessary to explore how humans interact with artificial agents and how to design agents that can effectively communicate with humans.
In conclusion, Hum-EmCom research has made significant progress in creating artificial agents capable of using natural language.However, there are still many challenges that need to be addressed to achieve robust and adaptive communication protocols that can handle a wide range of situations.

Conclusion
The present review provides an analysis of the state of the emergent communication (EmCom) literature.Our aim is to establish a link between specific characteristics of this field and human interactions, by drawing parallels with various fields including linguistics, cognitive science, computer science, and sociology, as shown in Figure 1.

PREPRINT
To achieve this objective, we begin by examining the common properties that are prevalent in the literature, as outlined in Section 2. Our analysis identifies four key components that frequently arise in real-world interactions and we investigate their parallels within EmCom.
In Section 2.1, we delve into the role of environment design.Specifically, we examine the distinction between communication as the primary objective (Communication-focused) versus communication as a tool to achieve other tasks (Communication-assisted), as discussed in Section 2.1.1.Additionally, we explore the influence of input representation on EmCom, as outlined in Section 2.1.2.
For newcomers to the field, the question of how to train artificial agents in the EmCom pipeline can be overwhelming and hinder understanding.To address this issue, in Section 2.2, we identify the most common learning paradigms employed in EmCom research.Specifically, we focus on two popular methods: reinforcement learning (Section 2.2.1) and supervised learning (Section 2.2.2), while drawing comparisons to human learning capabilities.
The design of the environment in emergent communication research is influenced not only by task-oriented goals but also by the type of interaction that occurs between agents.To further explore this aspect, Section 2.3 focuses on identifying two types of interactions that can take place.The first type, grounded in a spatial component, is discussed in Section 2.3.1.This section introduces the concept of both inner and outer interactions, which can be cooperative or competitive and occur between agents belonging to the same or different teams.While these interactions are rooted in a spatial context, we also emphasize the importance of the temporal aspect by introducing the concept of iterative learning, discussed in Section 2.3.2.
Researchers in the field of EmCom have drawn inspiration from various disciplines, including cognitive science.The presence of multiple agents in the system has led to the adoption of ideas from cognitive science, such as the Theory of Mind (ToM).Section 2.4 examines how ToM enables artificial agents to model other intelligent entities as distinct individuals separate from their environments, as discussed in Section 2.4.1.Furthermore, this modeling naturally extends to the concept of influencing other agents, which is explored in Section 2.4.2.
In our analysis of EmCom, we distinguish between two primary sub-fields that vary in their approach, as outlined in Section 3. The first, Machine-centered EmCom (Mac-EmCom), presented in Section 3.1, predominantly employs symbolic languages represented as numerical vectors, with an emphasis on discovering the appropriate constraints to observe common properties of natural languages, such as generalization and compositionality.Our discussion in Section 3.1.2revolves around the quest for generalization and the various techniques used to detect it within the field, as explored in Section 3.1.3.
Conversely, Human-centered EmCom (Hum-EmCom), as presented in Section 3.2, encompasses works that utilize natural language in their settings, with a focus on balancing task-oriented learning and language supervision.In Section 3.2.1,we delineate the characteristics of this sub-field, and in Section 3.2.2,we outline its objectives.Lastly, in Section 3.2.3,we discuss the issue of language drift, which is often regarded as the most significant challenge in this sub-field.

Implications
This review has made two significant contributions to the field of emergent communication.First, we have provided an extensive review of the relevant literature, distinguishing the commonalities and differences among various approaches.
The list of references presented in Table 2 serves as a valuable resource for researchers interested in this dynamic field.
Second, we have emphasized the robust connection between emergent communication and human-machine interaction.
Although the majority of the analyzed literature has concentrated on multi-agent systems comprising solely artificial agents, we propose that incorporating a human-in-the-loop approach, particularly in mixed human-robot teams, offers great potential for future research.This approach facilitates a more realistic approximation of human communication in real-world settings, allowing researchers to more accurately model and analyze the intricate interplay of language, cognition, and social interaction.
In conclusion, our examination of the emergent communication literature demonstrates that this field holds great potential for enhancing our understanding of human communication and for developing more robust and trustworthy artificial communication systems.By underscoring the connection between emergent communication and humanmachine interaction, we aim to inspire future research that considers the rich complexity of human communication and interaction. PREPRINT

Figure 1 :
Figure 1: Exploring the multidisciplinary nature of Emergent Communication: A Venn Diagram showcasing the intersections between Linguistics, Cognitive Science, Computer Science, and Sociology.Each field contributes unique characteristics to the study of EmCom (shown in the figure as encompassing the other fields), with some commonalities across multiple fields.At the center of our analysis lies the crucial area of Human-Machine Interaction.

Figure 2 :
Figure2: General pipeline for a discriminative referential game.The sender is shown a target image (a pencil) and is tasked to generate a message.The receiver sees a pool of images (distractors) containing the target and must choose the correct one based on the message.

Figure 4 :
Figure 4: Visualization of interaction types in Emergent Communication, with a horizontal space dimension and a vertical time dimension.The horizontal dimension is split into three parts for inner, outer cooperation, and outer competition.Teams are represented by squares, and their interconnections are indicated by arrows of different colors: green for cooperative, red for competitive, and gray dotted lines for time.

Figure 6 :
Figure 6: Hierarchical view of evaluation metrics used in Emergent Communication literature, divided into four types:(1) Reward/task completion, (2) message mutual information, (3) embedding analysis, and (4) similarity measures.Each type is further divided into specific metrics used to assess different aspects of emergent communication.

Figure 7 :
Figure 7: Bar plot showing the frequency of use of different evaluation metrics in emergent communication research, with the height of each bar representing the number of papers the metric has been used in.

PREPRINTFigure 8
Figure 8: t-SNE plots of object fc vectors color-coded by majority symbols assigned to them by informed sender.Image and Caption taken from Lazaridou et al. [2017].

Table 2 :
This table contains all the relevant literature cited in this review sorted by year.A first row reports the broad categories mentioned in Section 2 and Section 3, whereas a second row describes their specific features.The interaction type category has the following abbreviations as mentioned in Section 2.3: inner Most columns are marked with either a cross or a white space, indicating whether or not the corresponding feature is present.Three categories, however, contain some abbreviations.The game environment is divided into two features communication-focused (Com-focus) and communication-assisted