Leading or Following? Dyadic Robot Imitative Interaction Using the Active Inference Framework

This study investigated how social interaction among robotic agents changes dynamically depending on the individual belief of action intention. In a set of simulation studies, we examine dyadic imitative interactions of robots using a variational recurrent neural network model. The model is based on the free energy principle such that a pair of interacting robots find themselves in a loop, attempting to predict and infer each other's actions using active inference. We examined how regulating the complexity term to minimize free energy determines the dynamic characteristics of networks and interactions. When one robot trained with tighter regulation and another trained with looser regulation interact, the latter tends to lead the interaction by exerting stronger action intention, while the former tends to follow by adapting to its observations. The study confirms that the dyadic imitative interaction becomes successful by achieving a high synchronization rate when a leader and a follower are determined by developing action intentions with strong belief and weak belief, respectively.


I. INTRODUCTION
S OCIAL interaction is considered an essential cognitive behavior. In both empirical studies and synthetic modeling, researchers have investigated underlying cognitive, psychological, and neuronal mechanisms accounting for various aspects of social cognitive behaviors. This study investigates mechanisms underlying synchronized imitation as a representative social cognitive act, by formulating the problem using the free energy principle (FEP) [1], [2]. In simulation experiments of dyadic robot imitative interaction, we examine how a leader and follower can be determined in conflicting situations by investigating the underlying network dynamics.
Numerous robotic studies have investigated imitative interaction. In the 90s, imitation was identified as an indispensable human competency required in early development of cognitive behaviors [3], [4], [5], [6]. Rizzolatti and colleagues [7] showed that the mirror neuron system uses observations of an action to generate the same action. Arbib and Oztop [8], [9] indicated that mirror neurons may participate in imitative behaviors. Upon this development, several research groups proposed computational mirror neuron models for imitation using Hidden Markov Models [10] and neural network models [11], [12], [13], [14].
An essential unsolved question in modeling of imitative interaction is how a leader, who initiates an action, and a This work was sponsored by the Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan 904-0302. 1  follower, who imitates this action, can be determined when multiple choices of actions are possible among a set of wellhabituated ones.
Recent theories on predictive coding (PC) and active inference (AIF) based on the free energy principle (FEP) [2], [15] show that "action intention" and its "belief" can be formulated as a predictive model. "Action intention" is considered a top-down prediction of action outcomes and "belief" as an estimated precision of this prediction or the strength of intention (as described in [2], [15]). Analogous to PC and AIF, Ito and Tani showed that imitative interaction can be performed using an RNN model by minimizing the prediction error instead of free energy in order to update deterministic latent variables [13]. However, this deterministic model does not account for the belief of action intention because the precision of prediction cannot be estimated. On a related topic, Ahmadi and Tani developed predictive-coding inspired variational RNN (PV-RNN) [16]. Their model was used to investigate how the strength of top-down intention in predicting fluctuating temporal patterns was modulated, depending on learning conditions in the model. In the learning process, free energy represented by the weighted sum of the accuracy term and the complexity term is minimized. Ahmadi and Tani found that softer regulation of the complexity term during network training develops strong top-down intention. Predictions are more deterministic by self-organizing deterministic dynamics with the initial sensitivity characteristics in the network. Likewise, tighter regulation of the complexity term results in weaker intention and increased stochasticity. Compared to other neural network models based on the FEP [17], [18], [19], [20], PV-RNN has advantages when applied to problems in robotics. It can cope with temporal structure by using recurrence associated with stochastic latent variables and by hierarchical abstraction through a multiple timescale structure [21].
Our research group further investigated human-robot imitative interaction using PV-RNN. Chame and Tani [22] showed that a humanoid robot with force feedback control tends to lead or follow the human counterpart in imitative interaction when its PV-RNN is set to softer or tighter regulation, respectively. However, the result is preliminary, merely showing a oneshot experimental result without any quantitative analysis. In a similar experimental setup, Ohata and Tani [23] showed that this tendency can be also observed when regulation of the complexity term is modulated during the interaction phase, rather than during the prior learning phase. The study investigated pseudo-imitative interaction between a simulated robot and a human counterpart. This study, however, lacks genuine interaction between the simulated robot and the human counterpart because the outputs of the counterpart were replaced with static output sequences prepared in advance.
The main contribution of the current study is to clarify the underlying mechanism of how a leader and a follower can be determined in dyadic synchronized imitative interaction using the framework of AIF. This study is distinct from the author's aforementioned past studies because genuine interaction between two robots using the same model is examined and results are analysed both quantitatively and qualitatively. An advantage of performing a robot-robot interaction experiment is that the internal dynamics can be analyzed in a comparative way between the two robots.
The interaction experiment considers two robotic agents that are trained to generate a set of movement primitive sequences. When movements are generated by following a probabilistic finite state machine, the transition probability differs, depending on each of the two robots. After each robot learns the given probabilistic transition structure for a sequence, the experimental design allows us to investigate how two robots generate movement primitives in the synchronised imitative interaction. In particular, we examine conflicting situations in which each robot prefers to generate different movement patterns, depending on its learned experience. Do they synchronize to generate the same movement pattern with one robot following the other or leading by adapting the intention? Or do they desynchronize by generating different movement patterns, ignoring their counterparts by following their own action intentions? The current study hypothesizes that these dyadic interaction outcomes depend on the relative strength of the intention between the robots as a result of regulating FEP complexity.

A. Predictive Coding and Active Inference
The current study applies the concepts of PC and AIF based on FEP [1]. PC considers perception as the interplay between a prior expectation of a sensation and a posterior inference about a sensory outcome. Expectation of the sensation can be modeled by a generative model that maps the prior of the latent state to the expectation of sensation. The posterior inference of the observed sensation can be achieved by jointly minimizing the error computed between the expected sensation and its outcome, i.e. the accuracy, plus the Kullback-Leibler (KL) Divergence between the posterior and the prior distributions, i.e. the complexity. Posterior and prior are both represented by Gaussian probability distributions using means and variances. This is to minimize free energy or to maximize the lower bound of the logarithm of marginal likelihood: z,X, p θ , and q φ denote the latent state, the observation, the prior distribution, and the approximate posterior, respectively. θ and φ are the parameters of the generative and inference model. In maximizing the lower bound, the interplay between accuracy and complexity characterizes the model performance in learning, prediction, and inference. Consistent with the AIF, actions are generated so that the error between the expected action outcome and the actual outcome is minimized. In robotic applications, this is equivalent to determining how expected proprioception in terms of robot joint angles can be achieved by generating adequate motor torque. A simple solution is to use a PID controller, in which adequate motor torque to minimize errors between expected joint angles and actual angles can be obtained by means of error feedback schemes. Finally, perception by predictive coding and action generation by active inference are deployed simultaneously, thereby closing the loop of action and perception.

B. Overview of PV-RNN
The PV-RNN model is designed to predict future sensation by means of prior generation, while reflecting the past by means of posterior inference based on learning (see Fig. 1). One essential element of the model is the introduction of a parameter w, the so-called meta-prior, which regulates the complexity term in free energy. Different w settings results in alternation of the estimated precision in predicting the sensation, as described later as prior generation (see section III-C). The model is also characterised by employing an architecture of multiple timescale RNN (MTRNN) [21]. The whole network comprises multiple layers of RNNs wherein the dynamics of each layer are governed by different time constant parameters τ . This scheme supports development of hierarchical information processing by adequately setting the timescale of each layer [21], [14]. This approach is considered as analogous to [24], [25].
The following briefly describes the two essential parts, a generative model which is used for prior generation to make future predictions, and an inference model, which is used for posterior inference about the past. For further details, refer to [16], [23].
1) Generative Model: The stochastic generative model is used for prior generation, as illustrated in the future prediction part (after time step 4) in Fig. 1. PV-RNN is comprised of deterministic variables d and random variables z. An approximate posterior distribution q φ is inferred based on the prior distribution p θ by means of error minimization on the generated prediction X. The generative model can be factorized as: Although d is a deterministic variable, it can be considered to have a Dirac delta distribution centered ond as σ(d −d).
X is conditioned directly on z throughd. At the initial time step,d is set to 0. Otherwise,d is recursively computed, for which the internal state before activation is denoted by h. This internal state h is a vector, calculated as the sum of the internal states of the current level l and its connecting layers of the previous time step t − 1 plus the latent z in the same layer of the current time step t: τ l denotes the layer-specific time constant. With larger value for τ l , slower timescale dynamics develop, whereas with a smaller value set, faster timescale dynamics dominate. W represents connectivity weight matrices between layers and their deterministic and stochastic units. The output with size N x is computed as mapping fromd 1 as: The prior distribution p θ (z t ) is a Gaussian distribution represented with mean µ p t and standard deviation σ p t . The prior depends ond t−1 by following the idea of a sequence prior [26], except at t = 1 where it follows a unit Gaussian distribution.
Based on the work on variational autoencoders, we use the reparameterization trick to formulate the latent prior of z t as mean µ p t and standard deviation σ p t . The reparameterization trick was introduced by Kingma and Welling [27] to make random variables differentiable for backpropagating errors through the network for learning. The same consideration is taken for the posterior of z t in the inference model as well (cf. below Eq. 7).
2) Inference Model: Posterior inference is performed during learning and afterward, during action and perception. Fig.  1 illustrates information flow in the posterior inference in a time window from time step 2 to time step 3. The inference model for the posterior is described as: where e t denotes the error between the targetX t and the predicted output X t . Like the prior p θ , the posterior q φ is also a Gaussian distribution with mean µ q t and standard deviation σ q t . For z 1:T it is defined as: Since computing the true posterior is intractable, an approximate posterior q φ is inferred by maximizing the lower bound, analogous to Eq. (1). Here, the adaptation variable A 1:T forces the parameters φ of the inference model to represent meaningful information. The lower bound of PV-RNN can be derived as: where the first term is the accuracy and the second term is the complexity (for details referred to [16]). N x and N l z are the number of sensory dimensions and the number of the latent random variables at the l th layer, respectively. w l serves as a weighting parameter for the complexity term in layer l and is referred to as the meta-prior [16]. The meta-prior represents the strength for regulating the closeness between the posterior and the prior distributions. In t = 1, w l 1 is set with 1.0. w l 2:T is set to a specific value when the sequence prior [26] is used after time step 1. In the posterior inference, all learning-related network parameters of θ, φ, and the adaptive variable A are updated to maximize the lower bound by back-propagating the error from time step T back to t 1 [28].
3) PV-RNN in Dyadic Robot Interaction: Two robots equipped with the PV-RNN model interact during synchronized imitation. In the interaction, the robots predict proprioception X pr t+1 and exteroception X ex t+1 for the next time step. The predicted X pr t+1 regulates joint angle movements of a robot by considering a PID controller. This movement X pr t+1 can then be sensed by the other robot in terms of exteroception X ex t+1 . This is provided through the kinematic transformation of joint angles X pr t+1 (cf. Fig. 2). While in the training phase, the error signal is taken from the proprioceptiveX pr as well as the exteroceptiveX ex target sequences, in the interaction

III. ROBOT EXPERIMENTS
To investigate how the interaction of two robots changes with tighter and looser regulation of complexity, each robot was trained and tested individually, as described in III-B and in III-C, respectively. Finally, two robots were examined during a dyadic interaction (III-D).

A. Task Design
Robotic agents are trained with three movement primitives A, B, and C ( Fig. 3 (a)). Each primitive is 40 time steps in length. A human experimenter generated the primitive data via a master control of a humanoid OP2 2 . The experimenter controlled six joints in the upper body of one humanoidX pr . The exteroceptive trajectoryX ex is generated by mirroring its own movementX pr and transformed intoX ex xy-coordinate positions of the left hand and right hand tips of the robot (Fig.  3b).X pr andX ex are six and four dimensions, respectively. Individual movement primitives are sampled and combined to form a continuous pattern of 400 time steps that follows a probabilistic sequence (analogous to [23]). Two probabilistic patterns were generated, A20%B80%C and A80%B20%C as shown in the form of a probabilistic finite state machine (P-FSM) (Fig. 3 (c)). The difference between these two probabilistic patterns is that C is biased and comes more often (80%) than B (20%) after A in the former, and vice versa for the latter. A point of interest is the interaction phase after the learning phase. It is expected that both robots can generate A synchronously, since it is a deterministic state. This could be different from generating B or C as two robots learned different preferences in terms of transition probabilities. One robot may lead so as to generate B or C while the other may just follow it. However, both robots may generate their own biased movements and, thus, desynchronize their behavior. The current study hypothesizes that whether B or C is generated synchronized or desynchronized between the two robots depends on the complexity regulation of each robot.

B. Robot Training
The PV-RNN was trained with 20 data samples on a set of different parameters (TABLE I). All network specific parameters were fixed during training. To explore the influence of the meta-prior, w, only this parameter changed for different networks and was repeated with different random seeds to ensure reproducibility. Networks were trained for 80,000 epochs, using Adam Optimizing and back-propagation through time (BPTT) [28] with learning rate 0.001. After training, network performance was first analysed in standalone robot experiments (subsection III-C). Thereafter, dyadic robot interaction was studied using networks trained with w set for the two representatives of tight and loose regulation of FEP complexity (subsection III-D).

C. Preparatory Analysis of Training Results
To investigate how the model learns the probabilistic structure of the training data, we conducted a first analysis in the form of prior regeneration. For prior regeneration we choose one training sample and use two time steps of the adaption variable AX 1:2 to initialize the prior distribution p(z 1:2 ) in the PV-RNN. Thereafter the future prediction X 3:400 for the remaining training sample length can be calculated (cf. prior generation in Fig. 1). Using this scheme, we generated 20 sequences for each meta-prior w. This was repeated for each network that was trained for that parameter for all random seeds. For brevity, training analysis is reported only for the network that was trained on the probabilistic sequence A20%B80%C. Training of A80%B20%C showed comparable results. An Echo State Network for multivariate time series classification [29] with reservoir size N = 45, 25% connectivity and leakage 60% was used for classification of movement primitives. Movement patterns were identified as not classified below a 55% threshold.
1) Analysis of Probabilistic Transition: A robot that is trained with A20%B80%C will first generate an A movement, and then transition to B with 20 percent probability and to C with 80 percentage probability. We found that smaller w settings are less stable in reproducing the probabilistic structure of the training data. The BC-ratio was either greater or less than 20% for B or greater or less than 80% for C. Networks trained with larger meta-priors become more reliable in regenerating the probabilistic training sequence (BC-ratio in Table II). In addition to the capacity of learning the probability distribution of the training data, we found that smaller meta-priors show noisier pattern generation. Nonclassified movements were as high as 22% ± 4 with w = 0.01 and decreased to 6% ± 0.6 with w = 3.4.
2) Divergence Analysis: Repeatability in generating sequences in prior generation was examined by conducting a divergence analysis. Sequences are considered diverged when a comparison per time step of X pr exceeds a threshold 3 . Out of 20 regeneration sequences, we randomly select one as a reference and calculate the average divergence step of the other sequences to this the reference. Out of 400 time steps of prior generation, sequences diverged from the reference around time step 43 for networks trained with smaller w. With increasing w, repeatability of the trajectories increased. Here the divergence step was around 139 (cf. divergence step t in Table II).
3) Summary of Preparatory Analysis: Loose regulation of the complexity term results in noisier, less repeatable prior generation performance. Also the learned probability for transition to either B or C is not accurate. This observation changes with increasing meta-prior. The larger w, the more accurate the learned transition probability becomes. Also, prior generation becomes more repeatable by developing more deterministic dynamics with the initial sensitivity characteristics (i.e., the sequence is generated solely depending on the latent state in the initial time step). For subsequent dyadic robot interaction experiments, we empirically select the meta-prior setting w = 0.005 and w = 3.4 as two representatives of tight and loose regulation of the FEP complexity.

D. Dyadic Robot Interaction Experiments
In the following experiments, robots are either trained with w = 0.005 or w = 3.4. For readability, we will consider R 1 w and R 2 w with subscripts of the respective meta-priors w. In the dyadic interaction, we present the network of each robot with observations of movements of the counterpart robotX ex as the target and perform posterior inference in a regression window with size win size = 70. Inference is performed from the current time step t back to t − win size , or t 1 in case t−win size ≤ 1. After 200 epochs of iteration to maximize the lower bound, the time window is shifted one time step forward. Note, all experiments were conducted in simulation due to the difficulty of real-time posterior inference computation.
We investigated how two robots interact in three different dyadic conditions (TABLE III). We then analysed whether the robots trained with A80%B20%C maintained the learned preference between B and C or adapted to their counterparts that were trained with A20%B80%C. We also calculated the so-called BC-synchronization rate during the interaction. If at any time step t, one of the robots generated B or C and the other robot generated the same movement primitive, the interaction was considered synchronized. Note that time steps in which movement patterns were identified as not classified by the Echo State Network (cf. subsection III-C) were excluded from the computation. TABLE III shows the summary of the analysis for all three experiments. To better understand effects of loose and tight regulation of FEP complexity, exemplar plots of robot movement patterns, as well as corresponding network dynamics, are shown (cf. Fig. 4 and Fig. 5 adapts to the probabilistic transition of R 2 3.4 by increasing the probability of performing B from 22% in the standalone condition to 70% in the dyad (Table III Experiment  1). Both robots are performing more B than C with a BCsynchronization of 56±23% which is significantly higher than  the chance rate of 32% 4 . Fig. 5 shows an example of how prediction of the future and posterior inference of the past proceed as time passes from time step 199, 229, to 259 for both robots. We observe that the intended future behavior (the prior generation) of R 1 0.005 is not consistent with the actually performed actions after posterior inference. On the other hand, in the case of R 2 3.4 , the performed action complies with its prediction. This behavior can be explained by looking at exemplar priors µ lp i and posteriors µ lq i for layer l and neuron i between two robots. In layer 1, selected posterior network dynamics µ 1q 1 and µ 1q 2 are deviating from prior dynamics µ 1p 1 and µ 1p 2 for robot R 1 0.005 . Whereas the dynamics of R 2 3.4 are mostly overlapping (cf. Fig. 4a and supplementary movie). More specifically, the average KL Divergence e z of R 1 0.005 is larger for all layers ((e z,1 , e z,2 , e z,3 ) = (109.1, 1.4, 0.06)) than for R 2 3.4 ((e z,1 , e z,2 , e z,3 ) = (0.4, 0.0003, 0.00001)). This means that R 2 3.4 tends to behave as intended because the posterior is attracted by the prior. On the other hand, R 1 0.005 tends to adapt to R 2 3.4 since the posterior is rather attracted by the observation than by the weaker prior belief.
Note that µ 3q 1 and µ 3p 1 in layer 3 change only slowly with time. This indicates that these latent variables represent how movement primitives transit from deterministic states to nondeterministic states using their slower timescale properties characterized by τ 3 .
2) Experiment 2: R 1 3.4 vs. R 2 3.4 : When two robots with loose complexity regulation interact, both robots maintain their learned preferences in terms of probability in generating either B or C. R 1 3.4 , which learns a 76% transition to C in a standalone situation, shows its preference to C in the dyad with 4 We assume that B and C are independent probabilistic events. Then we can consider the probabilities for a robot R to perform either a B movement as P R (B) or a C movement as P R (C). The actual BC-synchronization chance level can then be calculated as: probability of 83%. R 2 3.4 , which in a stand-alone condition would maintain its preference to B with a probability of 75%, shows 61% percentage transition to B in the interaction. BCsynchronization rate turns out to be low as 31 ± 24%, which is almost equal to the chance rate. Examining the network dynamics of the prior and posterior distributions shows that the robots executed movements based upon their prior action intention without adapting their posteriors to observations of the other robot's movement (cf. Fig. 4b and supplementary movie).
3) Experiment 3: R 1 0.005 vs. R 2 0.005 : When two robots with tight regulation of complexity interact, both try to adapt their own action to the one demonstrated by the other. Indeed, Fig. 4c shows that the prior and posterior do not comply, but deviate. Whether trained with the probabilistic transition of A20%B80%C or A80%B20%C, both robots significantly reduce the tendency to perform their own intended behavior C or B, respectively. This is evidenced by changes of the BCratio from stand-alone compared to the dyadic setting (TABLE  III Experiment 3). BC-synchronization rate is 42±20% which is higher than the chance rate but not significantly. The interaction becomes noisier, compared to results of Experiments 1 and 2 (cf. Fig. 4c and supplementary movie), which indicate that tight regulation makes robots more sensitive to temporal fluctuations in observations of their counterparts.

IV. DISCUSSION
The current study examined how social interaction in robotic agents dynamically changes depending on how the complexity in the free energy is regulated. For this purpose, we conducted simulation experiments on dyadic imitative interactions using humanoid robots equipped with PV-RNN architectures. PV-RNN is a hierarchically organized variational RNN model that employs a framework of predictive coding and active inference based on the free energy principle. In a preparatory analysis we showed that PV-RNNs trained with looser regulation of complexity develop stronger action intentions by self-organizing more deterministic dynamics with strong initial sensitivity. Networks trained with tighter regulation develop weaker intentions by self-organizing more stochastic dynamics.
Our experiments revealed different types of interactions between robots. In the experiment where a robot having looser regulation interacts with a robot with tighter regulation, the former tends to lead the interaction by exerting action intention with stronger belief, while the latter tends to follow the other. The following robot adapts its posterior to its observations of the leading robot. In this setting, the synchronization of  Interaction of R 1 0.005 (upper) and R 2 3.4 (lower) in terms of X pr . The first, the second, and the third row show X pr after the posterior inference in the inference window with size win size = 70, as well as its future prior generation with current time steps of 199, 229, and 259, respectively. movement B and C (BC-synchronization rate) between the two robots was significantly higher than the chance rate. When two robots with looser regulation, i.e. intentions with stronger belief, interact, each robot tends to generate its own intended movements. Finally, in case both robots have tighter regulation, a fluctuating dyadic interaction develops where each robot attempts to adapt to the counterpart with an intention with weaker belief. In the last two cases, the BC-synchronization rate was not significantly higher than the chance rate. It can be summarized that the dyadic imitative interaction, including situations where the other's movements are unpredictable, tends to be synchronized successfully when a dedicated leader and follower are determined; a leader develops action intentions with strong belief whereas a follower develops action intentions with weak belief.
The readers may ask why tighter or looser regulation of the complexity term results in development of weaker or stronger belief of action intention for each robot. Let us consider a situation in which the PV-RNN learns to predict probabilistic sequencesX 1:T with meta-prior w set either with a large value (loose regulation) or a smaller one (tight regulation). The learning process infers the posterior mean µ q t and standard deviation σ q t at each time step t. In order to minimize the error e in the accuracy term, µ q t is fitted with an arbitrary value, where σ q t will be minimized, in both cases. Notably, when the dataX t is observed as random, the corresponding posterior µ q t also becomes random. Let us consider the two cases when the meta-prior w is either set large or small. In case w is set large, the KL Divergence between the posterior and the prior is strongly minimized. Thus, µ p t and σ p t of the prior latent state become close to µ q t and σ q t of the posterior. By this, σ p t in the prior is forced to take a minimal value close to 0; therefore, the prior generation becomes deterministic. Since µ p 1:T should be reconstructed as close to the sequence µ q 1:T inferred with randomness, the prior generative model is forced to develop strongly nonlinear deterministic dynamics with the initial sensitivity through learning. On the other hand, if w is set with a small value, the KL Divergence is only weakly minimized. In this case, prior µ p t and σ p t can diverge from the posterior ones; therefore, the learning becomes "relaxed". As a result, the prior generative model develops stochastic dynamics with only weak non-linearity, wherein µ p t takes an average of µ q t over all occurrences and σ p t takes their distribution at each time step. Consequently, with larger w, the generative model develops action intention with stronger belief (i.e. smaller σ p ) whereas in the case of tighter regulation using a smaller w, the generative model develops action intention with weaker belief (i.e. larger σ p ).
The current experiments consider a fixed meta-prior setting only. Since the meta-prior is the essential network parameter to guide the strength of action intention in the proposed framework, future studies should target meta-learning of the meta-prior in developmental processes or through autonomous adaption within dyadic contexts. This could provide further understanding of more complex social interaction phenomena, including turn-taking in the context of adaptive regulation of the complexity term in free energy.