Introduction
A central challenge of multi-agent systems (MASs) is coordinating the actions of multiple autonomous agents in time and space, to accomplish cooperative tasks and achieve joint goals [1], [2]. Developing successful MASs requires addressing controllability challenges [3], [4] and dealing with synchronization control [5] formation control [6], task allocation [7], and consensus formation [8], [9], [10].
Research in cognitive science may provide guiding principles to address the above challenges, by identifying the cognitive strategies that groups of individuals use to successfully interact with each other and to make collective decisions [11], [12], [13], [14]. An extensive body of research studied how two or more people coordinate their actions in time and space during cooperative (human–human) joint actions, such as when performing team sports, dancing or lifting something together [15], [16]. These studies have shown that successful joint actions engage various cognitive mechanisms, whose level of sophistication plausibly depends on task complexity. The simplest forms of coordination and imitation in pairs or groups of individuals, such as the joint execution of rhythmic patterns, might not require sophisticated cognitive processing, but could use simple mechanisms of behavioral synchronization—perhaps based on coupled dynamical systems, analogous to the synchronization of coupled pendulums [17]. However, more sophisticated types of joint actions go beyond the mere alignment of behavior. For example, some joint actions require making decisions together, e.g., the decision about where to place a table that we are lifting together. These sophisticated forms of joint actions and joint decisions might benefit from cognitive mechanisms for mutual prediction, mental state inference, sensorimotor communication and shared task representations [16], [18], [19]. The cognitive mechanisms supporting joint action have been probed by numerous experiments [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], sometimes with the aid of conceptual [31], computational [32], [33], [34], [35], [36], [37], [38], [39], [40], and robotic [41], [42], [43], [44] models. Despite this progress, there is a paucity of models that implement advanced cognitive abilities, such as the inference of others’ plans and the alignment of task knowledge across group members. Furthermore, we still lack a formal theory that explains the cognitive mechanisms of joint actions from first principles; for example, from the perspective of a generic inference or optimization scheme.
We advance an innovative framework for cooperative joint action and consensus in MASs, inspired by the cognitive framework of active inference. Active inference is a normative theory that describes the brain as a prediction machine, which learns an internal (generative) model of the statistical regularities of the environment—including the statistics of social interactions—and uses it to generate predictions that guide perceptual processing and action planning [45]. Here, we use the predictive and inferential mechanisms of active inference to implement sophisticated forms of joint action in dyads of interacting agents. The model presented here formalizes joint action as a process of interactive inference based on shared task knowledge between the agents [2], [46].
The main contribution of this article is showing that effective cooperative behavior and sensorimotor communication can emerge in dyads of active inference agents that jointly optimize their beliefs about the joint goal and their plans about how to achieve it. We exemplify the functioning of the model in a “joint maze” task. In the task, two agents have to navigate in a maze, to reach and press together either a red or a blue button. Each agent has probabilistic beliefs about the joint task that the dyad is performing, which covers his own and the other agent’s contributions (e.g., Should we both press a red or a blue button?). Each agent continuously infers what the joint task is, based on his (stronger or weaker) prior belief and the observation of the other agent’s movements toward one of the two buttons. Then, he selects an action (red or blue button press), in a way that simultaneously fulfils a pragmatic (i.e., utility maximization) and an epistemic (i.e., uncertainty minimization) objective. Here, the pragmatic objective prioritizes policies that achieve the joint task efficiently (e.g., by following the shortest route to reach the to-be-pressed button). Rather, the epistemic objective prioritizes policies that help the other agent inferring what the joint goal is (e.g., by selecting a longer route that the other agent can easily associate with the goal of pressing the red button).
The next sections are organized as follows. First, we introduce the consensus problem (called “joint maze”) we will use throughout this article to explain and validate our approach. Next, we illustrate the main tenets of the interactive inference model of joint action. Then, we present two simulations that illustrate the functioning of the inter active inference model. The first simulation shows that over time, the inter active inference aligns the joint task representations of the two agents and their behavior, as observed empirically in several joint action studies [18], [24], [47], [48], [49], [50]. In turn, this form of “interactive alignment” (or “generalized synchrony”) optimizes the performance of the dyad. The second simulation shows that when agents have asymmetric information about the joint task, the more knowledgeable agent (or “leader”) systematically modifies his behavior, to reduce the uncertainty of the less knowledgeable agent (or “follower”), as observed empirically in studies of sensorimotor communication [16], [18]. This social epistemic action ensures the success of joint actions despite incomplete information. Finally, we discuss how our model of interactive inference could help better understand various facets of (“leaderless” and “leader–follower”) human joint actions, by providing a coherent formal explanation of their dynamics at both brain and behavioral levels.
Problem Formulation and Scenario
To illustrate the mechanisms of the inter active inference model, we focus on the consensus problem called “joint maze” task, which closely mimics the setting used in a previous human joint action study [34], see Fig. 1. In this task, two agents (represented as a gray hand and a white hand) have to “navigate” in a grid-like maze, reach the location in which the red or blue button is located, and then press it together. The task is completed successfully when both agents “press” the same button, whatever its color (unless stated otherwise).
Schematic illustration of the “joint maze” task. The two (gray and white) agents are represented as two hands. Their initial positions are L3 (gray) and L19 (white). Their possible goal locations are in blue (L12) and red (L10). The agents can navigate in the maze, by following the open paths, but cannot go through walls).
At the beginning of each simulation, each agent is equipped with some prior knowledge (or preference) about the goal of the task. This prior knowledge is represented as a probabilistic belief, i.e., a probability distribution over four possible task states; these are “both agents will press red,” “both agents will press blue,” “the white agent will press red and the gray agent will press blue” and “the white agent will press blue and the gray agent will press red.” Importantly, in different simulations, the prior knowledge of the two agents can be congruent (if both assign the highest probability to the same state) or incongruent (if they assign the highest probability to different states); certain (if the probability mass is peaked in one state) or uncertain (if the probability mass is spread across all the states). This creates a variety of coordination problems, which span from easy (e.g., if the beliefs of the two agents are congruent and certain) to difficult problems (e.g., if the beliefs are incongruent or uncertain). Each simulation includes several trials, during which each agent follows a perception-action cycle. First, the agent receives an observation about his own position and the position of the other agent. Then, he updates his knowledge about the goal of the task (task goal inference) and forms a plan about how to reach it (plan inference). Finally, he makes one movement in the maze (by sampling it probabilistically from the plan that he formed). Then, a new perception-action cycle starts.
Methods
Here, we provide a brief introduction to the active inference framework for single agents (see [45] for details) and then we illustrate the novel, inter active inference model developed here to address multi-agent, cooperative joint actions.
A. Active Inference
Active Inference agents implement perception and action planning through the minimization of variational free energy [45]. To minimize free energy, the agents use a generative model of the causes of their perceptions, which encodes the joint probability of the stochastic variables illustrated in Fig. 2, using the formalism of probabilistic graphical models [51]. The agent’s generative model is defined as follows:\begin{align*}&P\left ({o_{0:T},s_{0:T},u_{1:T},\gamma |\boldsymbol {\Theta }}\right) \\&\;= P\left ({\gamma |\boldsymbol {\Theta }}\right) P\left ({\pi |\gamma,\boldsymbol {\Theta }}\right) P\left ({s_{0}|\boldsymbol {\Theta }}\right) \prod _{t=0}^{T} P\left ({o_{t}|s_{t},\boldsymbol {\Theta }}\right) P\left ({s_{t+1}|s_{t},\pi _{t},\boldsymbol {\Theta }}\right)\quad \tag{1}\end{align*}
Generative model for multi-agent active inference. The circles denote stochastic variables. The filled circles denote observed variables and the unfilled circles denote hidden variables that are not observed and need to be inferred. The squares indicate the categorical probability distributions that parameterize the generative model. See the main text for an explanation of the symbols A, B, C, D, G, S, and
The set
An active inference agent implements the perception-action loop by applying the above matrices to hidden states and observations. In this perspective, perception corresponds to estimating hidden states on the basis of observations and of previous hidden states. At the beginning of the simulation, the agent has access through \begin{equation*} Q\left ({s_{0:T},u_{1:T},\gamma }\right)=Q(\pi)Q(\gamma)\prod _{t=0}^{T} Q\left ({s_{t}|\pi _{t}}\right) \tag{2}\end{equation*}
\begin{align*} \boldsymbol {s}_{t}^{\boldsymbol {\pi }}\approx&\sigma \left ({\ln \mathbf {A}\cdot o_{t}+\ln \left ({\mathbf {B}\left ({\pi _{t-1}}\right)\cdot \boldsymbol {s}_{t-1}^{\boldsymbol {\pi }}}\right)}\right) \tag{3a}\\ \boldsymbol {\pi }=&\sigma \left ({\ln \mathbf {E}-\boldsymbol {\gamma }\cdot \mathbf {G}\left ({\pi _{t}}\right)}\right) \tag{3b}\\ \boldsymbol {\gamma }=&\frac {\alpha }{\beta -\boldsymbol {G}\left ({\boldsymbol {\pi }}\right)} \tag{3c}\end{align*}
\begin{equation*} \mathbf {G}\left ({\pi _{t}}\right) = \sum _{\tau =t+1}^{T}D_{KL}\left [{Q\left ({o_{\tau }|\pi }\right)\| P\left ({o_{\tau }}\right)}\right] +\mathbb {E}_{\tilde {Q}}\left [{H\left [{P\left ({o_{\tau }|s_{\tau }}\right)}\right]}\right] \tag{4}\end{equation*}
The EFE can be used as a quality score for the policies and has two terms. The first term of (4) is the Kullback–Leibler divergence between the (approximate) posterior and prior over the outcomes and it constitutes the pragmatic (or utility-maximizing) component of the quality score. This term favors the policies that entail low risk and minimize the difference between predicted
After scoring all the policies using EFE, action selection is performed by drawing over the action posterior expectations derived from the sufficient statistic
B. Multi-Agent Active Inference
The key advancement of this article is extending the active inference framework to a multi-agent setting [1], in which multiple agents (here, two) perform a joint task consisting in navigating in a “joint maze” (Fig. 1) to simultaneously reach either the red or the blue location. The “joint maze” of Fig. 1 includes 21 locations L1, L2,
For each trial, each agent can choose between 25 action sequences or policies
Each agent has a separate generative model, whose structure is shown in Fig. 2. In simulation 1, the two agents are equipped with identical generative models, except for a different estimate of their starting locations, L3 or L19. In simulation 2, there are some differences in the
When the two generative models are considered together, they can be defined as
The observations
The set of tensors
The first factor
The salience of the location \begin{equation*} v_{b}^{\rm D^{\mathrm{ i}}}\left ({s_{1}}\right)=\left ({\frac {d\left ({s_{1},\mathrm {L}10}\right)}{d\left ({s_{1},\mathrm {L}10}\right)+d\left ({s_{1},\mathrm {L}12}\right)}}\right)\cdot \left ({1-\left ({\max \left ({{\mathrm{ D}}_{3}^{\mathrm{ i}}}\right)-0.5}\right)}\right).\quad \tag{5}\end{equation*}
Equation (5) depends on two terms. The first term implies that the smaller the Euclidean distance between the agent location and the blue goal location (L12 in Fig. 1), the greater the evidence that the agent is pursuing the blue goal (note that we could have used a more complex measure that also considers, for example, direction of movement, but we found that Euclidean distance is sufficient in this setting). The second term implies that the more peaked the mode of
The tensor
We calculate the (absolute) difference between the salience of one’s own and the other agent’s location, with respect to the blue goal
and the red goal$v_{b}^{\rm D^{\mathrm{ i}}} (s_{k})$ , as$v_{r}^{\rm D^{\mathrm{ i}}} (s_{k})$ , where$\Delta v^{i}_{k}=|v_{b}^{\rm D^{\mathrm{ i}}}(s_{k})-v_{r}^{\rm D^{\mathrm{ i}}}(s_{k})|$ is$s_{k}$ when considering one’s location and$s_{1}$ when considering the other’s location.$s_{2}$ We define the likelihood
, where$p(o_{2} |s_{1},s_{2}) = {\mathrm{ sig}}(\Delta v_{1}^{i}\cdot \Delta v_{2}^{i})$ is the parametric logistic function. We assume that${\mathrm{ sig}}(x)=1/(1+\sigma \cdot e^{-\rho x})$ ranges in the interval (0.75, 1), which we obtain by fixing the parameters of the logistic function as$p(o_{2} |s_{1},s_{2})$ ,$\sigma =10$ .$\rho =4$
The tensor \begin{equation*} {\mathrm{ A}}_{3 1,2,3}^{\mathrm{ i}}=p\left ({o_{3}|s_{1},s_{2},s_{3}}\right)\equiv v_{s_{3}}^{\rm D^{\mathrm{ i}}}\left ({s_{1},s_{2}}\right)=v_{s_{3}}^{\rm D^{\mathrm{ i}}}\left ({s_{1}}\right)\cdot v_{s_{3}}^{\rm D^{\mathrm{ i}}}\left ({s_{2}}\right).\end{equation*}
The tensor
Tensor B encodes a deterministic mapping between hidden states, given the control state u. Note that here the control state u corresponds to a joint action, not to the action of a single agent; hence it is specified as the tensorial product between the vector of the five possible movements of one agent (“up,” “down,” “left,” “right,” and “wait”) and the vector of the same five movements of the other agent. The tensor B describes how the spatial locations
The action-perception cycle of the multi-agent active inference model is the same as the single-agent active inference (see Fig. 2), except that the two agents exchange observations between them. Specifically, agent
The two key mechanisms of the model are Task goal inference and Plan inference. Task goal inference corresponds to inferring what the goal of the joint task is, i.e., updating the belief about the four possible task goals (blue-blue, blue-red, red-blue, and red-red). As the task goal is specified at the level of the dyad, in order to infer it, each agent needs to consider both his prior knowledge about the task goal and the movements of the other agent, which are informative about the other agent’s task knowledge. Specifically, task goal inference follows a principle of rational action; namely, the expectation that the other agent will act efficiently to achieve his goals [55]. Put simply, if an agent observes the other agent moving toward the red (or blue) goal, he updates his joint goal context, by increasing the probability that the goal is red (or blue). Furthermore, both agents update their beliefs about the task goal at the end of each trial, when they receive feedback about success (“win” observation) or failure (“lose” observation).
Plan inference corresponds to inferring the course of action (or plan) that maximizes task success, on the basis of the inferred joint task. In this model, each agent infers both his own and the other agent’s plan—although, of course, he can only execute his own plan. The inference about one’s own and the other agent’s plans needs to consider the utility of following different routes (which privileges the shortest route) and the uncertainty about the goal (which prompts “pistemic” behavior and the selection of informative routes). The balance between utilitarian and epistemic components of planning will become important in Simulation 2, see later.
A key thing to notice is that the perception-action cycles of the two agents—and their inferential processes—are mutually interdependent, as the movements of one agent determine the observations of the other agent at the next time step. Our simulations will show that this interactive inference naturally leads to the alignment of belief states and behavioral patterns of the two agents, analogous to the synchronization of neuronal activity and kinematics in socially interacting dyads [15], [30], [49]. Furthermore, the simulations will show that “social epistemic actions” that aim at reducing the uncertainty of the other agent increase the alignment and task success, especially in tasks with asymmetric knowledge.
Simulations of Interactive Inference
We present two simulations of interactive inference using the “joint maze” task of Fig. 1. The first simulation illustrates the case in which both agents know that the goal can be either the blue or the red button and the same goal should be reached by both of them simultaneously. This simulation illustrates that even if the two agents start with uncertain prior on joint goal contexts, during time their beliefs (and behavior) gradually align, which permits the agents to successfully complete the task most of the time. The second simulation illustrates the case of two agents that initially have asymmetric task knowledge. One of them (the leader) knows which one of blue or red is the goal. The other (follower) agent does not know this, but knows that the same goal should be achieved with the leader simultaneously. This simulation illustrates sensorimotor communication—and the importance for the leader to select (epistemic) actions that reduce the follower’s uncertainty, in order to complete the task successfully. Note that in these simulations, the agent’s beliefs about hidden states
A. Simulation 1 (Leaderless Interaction)
The goal of Simulation 1 is testing whether and how interactive alignment favors the alignment of behavior and belief states of two agents engaged in the “joint maze” task. This simulation comprises 100 trials. For each trial, two identical agents (apart for their prior on the joint goal context, see later) start from two opposite locations of the “joint maze”: the gray agent starts in location L3 and the white agent starts in location L19. They can move one step at a time, or wait (i.e., remain in the same location), until they reach one of the locations that include colored goals (red in L10, blue in L12). There are multiple sequences of actions (aka “policies”) that each agent can take to reach the goal locations, which correspond to shorter or longer paths, with or without “wait” actions, etc. The 25 policies used in the simulation are specified in Section III. What is most important here is that irrespective of the selected policy, a trial is only successful if both agents reach the same goal / button location, red (L19) or blue (L12). Specifically, if at the end of the trial both agents are in the red (L10) or the blue (L12) button location, then the trial is successful and the agents receive the preferred observation (“win”). Otherwise, if the two agents fail to reach the same button location simultaneously (e.g., one is in L10 and the other is in L12), the trial is unsuccessful and the agents receive an undesirable observation (“lose”).
Fig. 3 shows the results of one example simulation. At trial 1, the agents start with the same prior on the joint context goal. This uncertain belief assigns 0.5 to “both agents will press red” (in short, red-red), 0.5 to “both agents will press blue” (in short, blue-blue) and zero to the two other possible states (red-blue: “the white agent will press red and the gray agent will press blue” and blue-red: “the white agent will press blue and the gray agent will press red”).
Results of Simulation 1. The first two panels show the prior beliefs of the white (first panel) and gray (second panel) at the beginning of each trial. The vertical bars indicate moments in which we manually change the mind of the white agent. We invert his belief about the joint goal context assigning a higher value (0.7) to blue-blue (if its prior belief assigned higher probability to red-red) or to red-red (if its prior belief assigned higher probability to blue-blue). The third panel shows the outcome of the trials. These include successful trials in which the agents press the blue button (blue bars) or the red button (red bars) and failures (black bars).
The first two panels of Fig. 3 show the prior beliefs about the joint goal context of the white and gray agents, respectively, at the beginning of each trial, from 1 to 100. In this and the subsequent simulations, the agents’ prior beliefs for the first trial are set manually, as discussed above. Then, the prior beliefs are updated within trials, as a result of inference. Furthermore, they are updated across trials: they are the posterior beliefs at the end of the previous trial (as usual in Bayesian inference), but multiplied by a fixed (volatility) factor. This ensures that the prior probability of red-red or blue-blue cannot be higher than 0.7. This is because in many trials, the posterior beliefs reach the value of 1 for red-red or blue-blue, so the agent is sure about the shared task goal. If this posterior value were used as the prior value for subsequent trials, there would be little place for changes of mind. Introducing the fixed factor amounts to assuming that the agents are not fully sure that joint task goal would remain the same across trials—or in other words, believe that the environment has some volatility. From time to time (vertical bars) we manually “change the mind” of agent 1 from red-red to blue-blue or vice versa, to introduce variability in the simulation.
The third panel of Fig. 3 shows whether the agents completed successfully the trial by pressing the same button (the blue bars indicate that both pressed the blue button, whereas the red bars indicate that both pressed the red button) or unsuccessfully (black bars). Finally, the gray vertical bars show trials in which the white agent “changes mind” about the goal (e.g., from blue-blue to red-red or vice-versa). Following a “change of mind,” the dyad usually requires one or a few trials before realigning on the new joint task goal.
Fig. 3 shows that the two agents end up the trials with aligned belief states most of the times, except in the first trials (in which they started with uncertain beliefs) and immediately after the changes of mind (vertical bars). Furthermore, the two agents are successful during most of the trials in which their beliefs are aligned and unsuccessful when their beliefs are not aligned. As shown in Fig. 3, the errors occur in the very first trials, immediately after the gray agent changes mind and in one trial afterward. The errors on the first trials may occur because the agents are uncertain about what to do and they assign the same (EFE) “score” to the two policies that go straight to the red button and the blue button; see Section III for an explanation of EFE and Fig. S3 in the supplementary material for an illustration of the EFE of the policies of the white agent at the beginning of the first trials. When the two agents are very uncertain, there are two possible behaviors:
Both agents may select their task goals randomly, which might or might not result in an error (see Fig. S4 in the supplementary material for an illustration of the results of 100 replications of the same experiment, without changes of mind).
One agent might simply follow the other and be successful. This “follower effect” is particularly apparent when the agents’ prior beliefs are weak, as in the first trials.
To better quantify the interactive alignment of belief states between the agents across trials and its effects on performance, we executed 100 runs of Simulation 1 and plotted a measure of the belief alignment of the dyad—the KL divergence between the joint goal contexts—and their performance, during the first 15 trials. The top panel of Fig. 4 shows that at the beginning of the simulation, the KL divergence between the prior beliefs of the two agents is small, as they are both equally uncertain about their joint task. While their beliefs are apparently aligned, the alignment regards an uncertain state—and this is why their performance is initially low (see the bottom panel of Fig. 4). During the next few trials, the agents consider different hypotheses about the joint task goal. This leads to a transient increase in the KL divergence between their beliefs (and its variance). After a few trials, the two agents converge to a shared belief about the joint goal and hence the KL divergence decreases. During this process, the success rate increases from 50% (random) to above 90%, within a few trials, see the bottom panel of Fig. 4.
Average results of 100 runs, with the same parameters as Simulation 1, for 15 trials. The top panel shows a measure of belief (dis)alignment of the agents: the KL divergence between their joint goal contexts. The plot shows the mean value of KL and the standard deviation of the mean (note that we removed outliers whose KL fell outside a confidence interval of 95%). The bottom panel shows a histogram of mean success rate.
Note that in this simulation the initial choice of a particular joint goal context (red-red or blue-blue) is random, but its persistence across trials depends on a process of interactive belief alignment between the agents. The alignment of behavior and of beliefs about task goals might occur in two ways:
It might occur thanks to interactive inference within trials, namely, because each agent monitors the movements of the other agent and uses this information to update his estimate about the joint task goal and the other agent’s plan, following a principle of rational action (i.e., the expectation that the other agent will act efficiently to achieve his goals [55]).
The alignment might occur because at the end of each trial, the agents receive a feedback about their success (“win” observation) or failure (“lose” observation) and use this feedback to update their beliefs. Second alignment might be the byproduct of a standard reinforcement learning approach to learn behavior by trial and error, without interactive inference within trials (but note that using reinforcement learning would require updating the model parameters, whereas we keep them fixed in our simulations).
To understand whether the first mechanism based on interactive inference is actually useful for alignment and task success, we replicated the same experiment, but by preventing interactive inference to take place. We did this by removing any useful information from the likelihood matrix that maps the others’ positions into task goals (i.e., by making the
In sum, Simulation 1 shows that two agents that engage in interactive inference can align both their joint goal contexts and their plans to achieve the joint task goal, forming shared task knowledge [18], [56], [57]. The alignment at both the belief and behavioral levels is made possible by a process of interactive, reciprocal inference of goals and plans. The two agents initially have weak beliefs about the goal identity and therefore they can “follow each other” until they settle on some joint goal—and successively stick to it.
B. Simulation 2 (Asymmetric Leader–Follower Interaction)
The goal of Simulation 2 is testing the emergence of “leader–follower” dynamics observed in human studies using the “joint maze” setup [34] and other related studies in which the agents have asymmetric preferences (or information) about the joint task goal [25], [58], [59], [60]. This simulation is similar to Simulation 1, but the two agents differ in their prior beliefs about the task goal [25], [34], [58], [59], [60]. Specifically, the white agent (the “leader”) knows the task to be accomplished—for example, red-red—whereas the gray agent (the “follower”) does not. In other words, while in Simulation 1 both agents had initially weak beliefs (or preferences) about the joint goal and can be therefore considered two “followers,” in Simulation 2 one of the two agents is a “leader” and has a strong initial preference about the joint task goal.
The generative model of the follower is identical to the one used in Simulation 1, whereas the generative model of the leader differs from it in two ways. First, the (likelihood) tensor
Several studies [25], [34], [58], [59], [60] showed that when leaders and followers have asymmetric information, the leaders modify their movement kinematics to signal their intentions and reduce the uncertainty of the followers [16], [35]. For example, consider that in the scenario of Fig. 1 the leader (white agent) has a choice between two kinds of action sequences or policies to reach the red goal location. The first, “pragmatic policies” follow the shortest and hence most efficient path to the goal: L15, L11, and L10. However, if the leader selects the pragmatic policy, he does not offer any cue to the follower about the joint task goal, until the last action (to L10). This is because passing through L15 and L11 is equally likely if the intended goals are red or blue and hence does not provide diagnostic information about the goal location. The second, “social epistemic policies” follow the route through L18, L17, L14, L9 and L10, which, despite being longer, provides to the follower early information to the intended goal location. This is because passing through L18, L17, L14, L9 is rational only if the goal is the red button—and hence it provides diagnostic evidence that the to-be-pressed button is red. The above studies [25], [34], [58], [59], [60] show that leaders often select “social epistemic policies”: they sacrifice efficiency to reduce the follower’s uncertainty.
The tradeoff between pragmatic and epistemic components of policy selection is automatic in active inference, because the EFE functional used in active inference to score policies includes two components: 1) a “pragmatic component” that maximizes utility and prioritizes the shortest paths to the goal and 2) an “pistemic component” that minimizes uncertainty (see Section III). We therefore expected the leaders to select “social epistemic policies” most often when the followers were uncertain—and select “pragmatic policies” when uncertainty was reduced.
The results of an example leader–follower simulation lasting 30 trials are shown in Fig. 5. The first and third panels of Fig. 5 show the prior beliefs of the leader (white agent) and the follower (gray agent), respectively, at the beginning of each trial. These are largely aligned, except in the very first trials. The second and fourth panels show the policies selected by the leader and the follower, respectively. As discussed above, we divided policies into two categories: 1) “pragmatic policies” (S: shorter red bars) that follow the shortest path to the goal and 2) “social epistemic policies” (L: longer blue bars) that follow longer but more informative paths. The second panel of Fig. 5 shows that the leader tends to select “social epistemic policies” in the first trials, to reduce the follower’s uncertainty (see also Fig. S7 in the supplementary material for a visualization of the EFE of the leader’s policies). Rather, the follower has no benefit from selecting epistemic policies and selects pragmatic policies across almost all the trials. Finally, the bottom panel of Fig. 5 shows that in all but the first two trials (short black bars), the agents successfully achieve the red-red goal (long red bars).
Results of Simulation 2: example of leader–follower dyadic interaction, for 30 trials. The first two panels show the prior beliefs of the leader at the end of each trial and the policy he selects (shorter red bar: pragmatic policy that follows the shortest path to the goal; longer blue bar: social epistemic policy that follows the longer but more informative path to the goal). The third and fourth panels show the prior beliefs of the follower at the end of each trial and the policy he selects. The fifth panel shows the outcome of the trials. These include successful trials in which the agents press the red button (red bars) and failures (black bars).
Fig. 6 shows the results of 100 repetitions of the same simulation (see also Fig. S8 in the supplementary material). The first panel shows that the beliefs of the agents, measured as the KL divergence between their prior beliefs about the task goal, align over time. The second panel shows that the average performance of the dyads, measured as the number of times they select the correct red-red goal, increases over time. The third panel shows the percentage of “social epistemic policies” selected by the leaders. Initially, the leaders have a strong tendency to select “social epistemic policies,” but this tendency decreases significantly across trials, as the followers become increasingly certain about the joint task goal, as shown empirically [25], [34], [58], [59], [60]. This result emerges because in the EFE (used in active inference to score policies), the decrease of uncertainty lowers the epistemic value of policies, hence lowering the probability that they will be selected [61].
Average results of 100 runs, with the same parameters as Simulation 2. The format of the first two plots is the same as Fig. 4. The third plot shows the percentage of policies selected by the leader that we label as “social epistemic actions” and prescribe signaling behavior. See the main text for explanation. For example, if the gray agent is the leader, he can select an epistemic policy that passes through L3, L2, L1, L6, L9, and L10 (to reach the red button) or through L3, L4, L5, L8, L13, and L12 (to reach the blue button).
Fig. 7 permits appreciating how the leader balances epistemic and pragmatic policies over time. The first panel shows the negative EFE (averaged across 100 repetitions) that the leader selects the most useful social epistemic policy (red line) and the most useful pragmatic policy (green line). It shows that the social epistemic policy has a very high probability during the first five trials, then its probability decreases until the pragmatic policy becomes the most likely, starting from trial 10. The second panel shows the probability (averaged across 100 repetitions) that the leader selects a social epistemic policy as a function of his uncertainty about the task goal, quantified as the entropy of his belief about the joint goal context (please remind that this is a shared representation that encodes both the leader’s and the follower’s contributions). The entropy over this variable reflects an estimate of the follower’s uncertainty, not of the leader’s uncertainty (as the leader knows the goal) and decreases over time, as the follower becomes less uncertain. Notably, the results shown in Fig. 7B in the supplementary material closely correspond to the findings of a study that uses the “joint maze” setting [34]. The study reports that the probability that a (human) participant selects a pragmatic policy is high only when (his or her estimate of) the follower’s uncertainty is very low (see [34, Fig. 8A]), which is in good agreement with the pattern shown in our Fig. 7B in the supplementary material.
How the leader balances epistemic and pragmatic policies in Simulation 2. Top panel: negative EFE of the most useful epistemic (red) and pragmatic policy (green), in the first ten trials. Bottom panel: frequency of the most useful epistemic policy as a function of the entropy of the leader’s joint goal context. This entropy provides a measure of the (leader’s estimate of the) follower’s uncertainty.
In sum, Simulation 2 shows that in leader–follower interactions with asymmetric knowledge, leaders select “social epistemic policies”—and therefore sacrifice some efficiency in their choice of movements—to signal their intended goals to the followers and reduce their uncertainty. This signaling behavior is progressively reduced, when the followers become more certain about the joint action goal. This form of signaling was shown in previous computational models that used goal and plan inference, but the models used ad-hoc formulations to promote social epistemic actions [16], [35]. In contrast, social epistemic actions emerge naturally in our model, for two reasons. First, the EFE functional used to score policies balances automatically pragmatic and epistemic components. This means that when uncertainty resolution is necessary, the EFE functional automatically promotes epistemic behavior [45]. To illustrate this point, we performed a control simulation (Fig. S9 in the supplementary material) that is the same as Simulation 2, except that we removed the “epistemic component” from the EFE (see Section III). In the control simulation, the leader selects significantly less social epistemic policies, the behavioral alignment process is slower and the success rate grows more slowly compared to the case in which the EFE is used. This control simulation shows that EFE affords social epistemic actions and these promote leader–follower interactions.
The second reason why social epistemic behavior emerges in our model is because the leader’s generative model includes beliefs about the shared task goal. When scoring his policies (via the EFE functional), the leader considers the uncertainty (or the entropy) of the shared task goal and it assigns high probability to “epistemic” policies, regardless of the fact that they lower his own uncertainty (as shown in previous studies) or the follower’s uncertainty [34]. This illustrates that active inference agents endowed with shared representations behave in socially oriented ways, even without ad-hoc incentives.
Conclusion
Joint actions are ubiquitous in our lives and engage several cognitive abilities, such as mutual prediction, mental state inference, sensorimotor communication, and shared task representation. However, we still lack a comprehensive formal model that explains these abilities from first principles. Here, we proposed a computational model of inter active inference, in which two active inference agents coordinate around a joint goal “—pressing together either a red or a blue button”—that they do not know in advance (Simulation 1) or that only one of them knows in advance (Simulation 2).
Our results show that the interactive inference model can successfully reproduce key behavioral and neural signatures of dyadic interactions. Simulation 1 shows that when two agents have the same (uncertain) knowledge about the joint task to be performed, they spontaneously coordinate around a joint goal and align their behavior and their task knowledge (here, their beliefs about the joint goal) over time. This result is in keeping with a large number of studies showing the synchronization of neuronal activity and kinematics during joint actions, perhaps as a way to enhance coordination and the sense of joint agency [15], [30], [49], [62]. Furthermore, interactive inference is robust to sudden changes of mind of one of the agents, as indexed by the fact that the alignment of behavior and task knowledge is recovered fast. While simple joint tasks, such as the “joint maze,” that we adopted could be in principle learned by trial and error and without inference, our control simulation shows that interactive inference within trials promotes better performance and alignment of behavior and of belief states (see Figs. S5 and S6 in the supplementary material).
Simulation 2 shows that during dyadic interactions in which only one agent (the “leader”) knows the task to be performed but the other agent (the “follower”) does not, the leader systematically selects “social epistemic policies” in early trials. The social epistemic policies sacrifice some path efficiency to give the follower early cues about the task goal, hence reducing his/her uncertainty and contributing to optimize the joint action. The results of this simulation are in keeping with a large number of studies of sensorimotor communication during dyadic interactions with asymmetric information [25], [34], [58], [59], [60]. Specifically, our model reproduces two key phenomena of leader–follower interactions:
In all these tasks, leaders select an apparently less efficient path, which however provides early information about the intended task goal.
The selection of these more informative (or social epistemic) policies is dependent on the follower’s uncertainty and it is abolished when the follower is (estimated to be) no longer uncertain, as reported in a study that uses our “joint maze” setup [34] and other studies in which the uncertainty of the follower varies across trials [25], [60].
Different from previous models, here the leader’s social epistemic behavior does not require ad-hoc mechanisms [16], [35]. Rather, it is a necessary consequence of the fact that the agents have shared task knowledge and select actions using the EFE functional, which considers epistemic actions on equal ground with pragmatic actions. In other words, active inference agents who cooperate in uncertain conditions and have beliefs about their shared goal can natively select “epistemic” policies that reduce their own uncertainty (as shown in [45]) and the uncertainty of the other agents (as shown in our simulations).
Another important feature of our model is its flexibility. Simulations 1 and 2 use exactly the same computational model, except for the fact that in Simulation 2 only the “leader” knows the goal. This implies that active inference is flexible enough to reproduce various aspects of joint action dynamics, without ad-hoc changes of the model. In our simulations, the differences between standard, “leaderless” (Simulation 1) and “leader–follower” (Simulation 2) dynamics emerge as an effect of the strength (and the precision) of the agents’ beliefs about the joint goal to be performed. When the agents’ beliefs are uncertain, as in Simulation 1, they tend to follow each other to optimize the joint goal—and update (and align) their beliefs afterward. In this case, the joint outcome (e.g., red-red or blue-blue) can be initially stochastic, but is successively stabilized thanks to the interactive inference. This setting therefore exemplifies a “peer-to-peer” or a “follower–follower” interaction. Yet, it is possible to observe some “leader–follower” dynamics, in the sense that one of the two agents drives the choice of one particular joint task goal. However, in Simulation 1, the role of the leader is not predefined, but rather emerges during the task, as one of the joint goals is stochastically selected during the interaction—and then the two agents stick to it (note however that the situation is different during changes of mind, because the goal is predefined by us rather than being stochastically selected during the interaction). Rather, Simulation 2 exemplifies the case of a “leader–follower” setup in which the role of the leader is predefined—because the leader has a strong preference for one of the goals. The comparison of Simulations 1 and 2 shows that what defines leaders and followers is simply the strength of the prior about the joint goal context (and of its associated outcomes). Our results in Simulation 2 are in keeping with previous active inference models that showed the emergence of behavior synchronization and leader–follower dynamics in joint singing [63] and robotic dyadic interactions [41]. These studies nicely illustrate that several facets of joint actions emerge when two agents infer each other’s goals and plans. However, the results reported here go beyond the above studies, by demonstrating the emergence of sensorimotor communication and social epistemic actions when the agents have asymmetric information.
In sum, our simulations provide a proof-of-concept that interactive inference can reproduce key empirical results of joint action studies, such as the interactive alignment and synchronization of behavior and neuronal activity (which, in our model, correspond to the belief dynamics) during standard joint actions [15], [30], [49] and the “sensorimotor communication” during dyadic leader–follower joint actions with asymmetric information [25], [34], [58], [59], [60]. An open objective for future research is extending the empirical validation of this framework by adopting it to model more cases of joint action, beyond the “joint maze” scenario of [34]. Another objective for future research is exploiting this framework to design more effective agents that exploit sensorimotor communication to enhance human–robot joint actions. The ease of human–human collaboration rests on our advanced abilities to infer intentions and plans, align representations and select movements that are easily legible and interpretable by our agents [16]. Endowing robots with similar advanced cognitive abilities would permit them to achieve unprecedented levels of success in human–robot collaboration and plausibly increase the trust in robotic agents [44], [64], [65], [66]. Scaling up the approach from the current grid-world simulation to noisy continuous space robotic experiments would require more effective methods to learn sophisticated generative models (using for example deep learning methods [67], [68], [69]) and to plan in large state spaces (using for example tree search methods [70], [71]). Another challenge consists in scaling up the approach to groups of agents. In principle, each agent could maintain beliefs about each group member. However, a more parsimonious alternative is maintaining beliefs over the “average group member’s mind”; this latter approach remains to be evaluated in our setting [72]. Finally, a key challenge for future research is extending this framework beyond cooperative joint actions, to also cover competitive and mixed cooperative-competitive interactions, which are frequent in multi-agent settings [73].
ACKNOWLEDGMENT
The GEFORCE Quadro RTX6000 and Titan GPU cards used for this research were donated by the NVIDIA Corporation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
NOTE
Open Access funding provided by ‘Consiglio Nazionale delle Ricerche-CARI-CARE-ITALY’ within the CRUI CARE Agreement