Imitation Learning With Time-Varying Synergy for Compact Representation of Spatiotemporal Structures

Imitation learning is a promising approach for robots to learn complex motor skills. Recent techniques allow robots to learn long-term movements comprising multiple sub-behaviors. However, learning the temporal structures of movements from a demonstration is challenging, particularly when sub-behaviors overlap and are not labeled in advance. This study applied time-varying synergies, which are representations of spatial and temporal structures in human behavior in neuroscience, to imitation learning. The proposed method extracts time-varying synergies from human demonstrations, with neural networks that learn their activation patterns. Because time-varying synergies can decompose demonstrations into linear combinations of primitives while allowing overlapping, neural networks can learn demonstrations efficiently. This would make the model compact and improve its generalization ability. The proposed method was evaluated with the task of cursive letter writing requiring overlapping sub-behaviors. Consequently, the proposed method allows a neural network to generate new movements with a higher success rate and fewer parameters than those without the proposed method. Moreover, the neural network worked robustly against control deviations and disturbances in an actual robot.


I. INTRODUCTION
Imitation learning allows robots to learn motor skills from human demonstrations without requiring expert knowledge of robot control. The structure of demonstrated movements should be captured to learn complex motor skills through imitation. Daily human movements involve multiple shortterm movements [1]. For example, a simple activity such as pick-and-place can be decomposed into sub-behaviors: reaching the hand to the object's position, grasping, moving it to the destination, and releasing the hand. Moreover, subbehaviors often overlap and combine over time. Capturing such structures would be beneficial for learning and generalizing using a small number of demonstrations. Additionally, because collecting large numbers of demonstrations is labor-The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .
intensive, it is necessary to learn them from small numbers of demonstrations and generalize to new behaviors.
To learn efficiently from human demonstrations, an approach based on the principle of human motor control is effective. Neuroscience has revealed that human movements can be accounted for by few primitive components, referred to as synergies [2], [3], [4]. Synergies efficiently represent the spatial and temporal coordinate patterns of movements using a linear combination. The central nervous system can simplify the control of complex and redundant bodies by linearly combining several synergies [5], [6]. Therefore, synergies are expected to adequately represent the structures of human movement, even in a compact model. Synergies have also been used in robotics to address joint redundancy, particularly in robotic hands [7], [8], [9], [10].
However, synergies are yet to be used to capture the temporal structures of demonstrations concerning robotic learning. Although time-varying synergies [11], [12] indicating movement primitives activating at different onset times while overlapping exist, it has not been applied to imitation learning.
This study proposes an imitation-learning method based on time-varying synergies, that encode spatial and temporal coordination in human movements. The proposed method extracts time-varying synergies from human demonstrations for neural networks (NNs) to learn their activation patterns, as illustrated in Fig. 1. Because synergy extraction decomposes human demonstrations into short-term primitives, NNs can focus on learning the spatiotemporal structures of movements. The contributions of this study are as follows. Recently, various studies have focused on learning movements comprising multiple short-term movements [1]. A common method involves using hierarchical architectures [13], [14]. These typically consist of low-level policies that learn primitive behaviors and high-level policies that plan sequences of low-level policies. Konidaris  proposed learning from play, which learns long-term movements [18]. In most hierarchical approaches, sub-movements are completely separated (no overlap) in time. Additionally, approaches to learning high-and low-level policies simultaneously tend to be unstable, as observed in hierarchical reinforcement learning. In contrast, our method can automatically extract primitive movements in advance and combine them with nearly arbitrary timing (allowing overlap). Embedding short-term sequences into single latent variables [19], [20] can be useful when providing low-level behaviors explicitly or in advance. Long-term movements can be generated by combining single latent variables with model predictive control [21]. Furthermore, switching can be made between multiple skills based on observations [22] or external signals [23]. Our method also acquires low-level behaviors in advance, but does not require them to be labeled or separated manually; instead, they are extracted directly from demonstrations. Methods that generate movements by dividing them by a fixed time [24], [25] do not capture the spatiotemporal structures of the task.

B. SYNERGIES
Synergies extracted from muscle activities or kinematic observations [26], [27] are task-specific movement primitives that account for human movements. They provide biological and neuroscientific support, suggesting the benefits of representing human demonstrations in imitation learning. Moreover, they exhibit linearity, which is advantageous for engineering applications. Additionally, synergies emerge during motor learning tasks of gait [28] and reaching [29], [30], [31].
A feature of synergies is that they can be generalized to other movements. The same synergies are often observed in different movements [32], [33], suggesting that synergies can VOLUME 11, 2023 34151 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
be reused in new situations. Furthermore, studies reported that even a few synergies can represent diverse movements [30], [33], [34]. Therefore, synergies are expected to improve the generalization ability of NNs for generating new movements.

C. SYNERGIES IN IMITATION LEARNING
Synergies have also been applied to imitation learning. A common application is the control of robotic hands, with many degrees of freedom (DoF) [7], [8], [9], [10]. Chai et al. used synergies in locomotion tasks to reduce the movement dimensionality [35]. Reducing the dimensionality of the control inputs with synergies is effective for learning.
Although approaches using spatial coordination patterns are common, applications of temporal coordination patterns are rare. In particular, its application to online movement generation is yet to be attempted. For example, Rückert and d'Avella used time-varying synergies to represent spatial and temporal patterns, whereas activation patterns were determined before the task [36]. Chen and Qiao developed a synergy-based musculoskeletal control system, where the temporal parameters were determined at the beginning of the task [37]. In contrast, this study aimed to determine how synergies are used online.
Synergies are similar to dynamic movement primitives (DMPs) [38], [39], which encode movements with small numbers of primitives. DMPs are often used in imitation learning because of their linearity and ability to reduce dimensionality [40], [41]. Synergies possess these features along with other unique features such as task-specificity and representing shifts in time. These features allow encoding temporal structures of human demonstrations with a smaller number of primitives compared to DMPs, making them advantageous for this study.

A. OVERVIEW
The proposed method extracts time-varying synergies and performs supervised learning of synergy activities. An overview of this process is shown in Fig. 1.
We considered the displacement of the positions for each time step, as follows, to extract synergies: where

x[t] and p[t]
indicate the displacement and position at time t.

B. TIME-VARYING SYNERGIES
Time-varying synergies represent movements by superimposing primitives while varying the onset times and amplitude, as follows: where J (i) ∈ {1, . . . , N }; N is the number of synergies. Using J (i) instead of i allows multiple uses of the same synergy in a movement. c i and t i indicate the activation amplitude and onset time of w J (i) , respectively. An overview of the time-varying synergies is shown in Fig. 2. Time-varying synergies (hereinafter, synergies) are common in a set of movements; however, the activation amplitude and onset times are different for each movement. Therefore, diverse movements can be represented by the same repertory of synergies by varying the activation amplitude and onset times.

C. SYNERGY EXTRACTION
The synergies were extracted using the algorithm in [32]. This algorithm can only be applied to non-negative signals; therefore, we first convert the original movements x[t] into non-negative variablesx[t] as follows: where φ is a conversion function, and . Inverse conversion can be performed as follows: We then initialize N synergies with a length of K time steps to random positive values, where N and K are determined by the users. The synergies were optimized using the following algorithm [32]: 1) Determine a set of onset times t i and synergy indices J (i) using the matching pursuit procedure [42]. This procedure selects the synergy having the highest cross-correlation with signalsx until the cross-correlation is less than the threshold value. 2) Calculate the activation amplitude c i based on the cross-correlation. 3) Update the synergies with the gradient descent to reduce the reconstruction error as follows: where the maximum operation ensures that the elements of the synergies are zero or more. 4) Go back to 1) and repeat the steps until convergence is achieved.

D. LEARNING ACTIVITIES OF SYNERGIES
After decomposing human demonstrations into synergies and their activation patterns, an NN was trained to learn the activation patterns. Here, we consider an NN that receives the current observation and generates current synergy activities, as illustrated in Fig. 1. Training was performed using supervised learning.
To handle synergies with NNs online, we define the variables γ [t] that represent synergy activity, instead of c i and t i .
represents the activity of n-th synergy w n as follows: The NNs receive the current observation o[t] and predict the synergy activity γ [t]. Therefore, synergy activities are decoded as follows: where * denotes convolution operation. Finally, the nonnegativex is converted to real-valuedx using (6). Furthermore, the positions were reconstructed by integrating the displacements as follows: By substituting t − τ = t i into (10) and (9), (10) corresponds to (2). NNs were trained using the following loss functions: where α is the hyper-parameter. The first term allows the NNs to reproduce the demonstrated movements. The second term is a regularization term that encourages the synergy activities to be sparse (the smallest amplitude possible). This regularization term is expected to drive the NN to reproduce the ground-truth synergy activities because the synergies are mostly inactive and activate only at a single time step at any instant.

IV. EXPERIMENTS
To evaluate the proposed method, we applied it to writing cursive-script letters. This specific task has several features that are common to other tasks.
1) The task consisted of sub-behaviors, that is, movements of writing individual letters or basic curves.
2) The order of the sub-behaviors (i.e., the order of letters) varies.
3) The boundaries of each sub-behavior is uncertain and often overlap. Therefore, evaluating this task would demonstrate the potential for various imitation-learning applications.

A. TASK SETUP
The robot grasps the pen and moves it to write letters j, q, and k. We chose these letters because their shapes were significantly different, making it easy to determine successful completion of the task. The robot writes the three letters in one stroke based on a task signal, as explained in Section IV-C.

B. DATA COLLECTION AND SYNERGY EXTRACTION
A motion capture system collects the demonstrations, as shown in Fig. 3. A marker attached to the pen records human movements. Another marker on the paper identifies the paper's position and pose. The Intel RealSense D455 captured the movements at 30 Hz. A third marker specifies the pattern currently being measured.
We collected all combinations of three letters (27 patterns in total) with three trials for each, that is, 81 demonstrations in total. Fig. 4 shows the collected demonstrations. The time series length is 9.57 s on average, with a standard deviation of 1.69 s. Sixteen out of 27 patterns used for synergy extraction and NN training are given in Table 1.
We extracted synergies from the displacements of positions p to capture writing movements regardless of their positions as follows: The trajectories of p were low-pass filtered at 1.1 Hz and down-sampled to 20 Hz. Subsequently, synergies were extracted using the method described in Section III-C.

C. NEURAL-NETWORK SETUP
The single-layer NN architecture is illustrated in Fig. 5 where φ −1 is defined in (6). During the training, we used the loss function described in (12). Task signal z represents the pattern of letters to be written. It is a three-dimensional variable; the n-th component is −1, 0, or +1 if the n-th letter is j, q, or k, respectively. For example, pattern j-q-k is expressed by [−1, 0, +1], and pattern q-k-q  is expressed by [0, +1, 0]. It is expected that NNs can learn the correspondence between task signals and demonstrated trajectories.
We additionally trained two baseline models, baseline-A and baseline-B, for comparison with the proposed model. The baseline-A model consisted of a single layer with 256 LSTMs, followed by a linear output layer. This architecture is nearly identical to that of the proposed model except for the synergy decoder. The baseline-B model consisted of three layers with 256 LSTMs, followed by a linear output layer. This model has more parameters than the proposed and baseline-A models. We used this model because the baseline-A model resulted in unsatisfactory performance when used in the robot experiments, as described in Section IV-E. Both baseline models directly generated two-dimensional displacement commands of the hand, p[t], instead of synergy activities. The generated commands are then converted to position commands as described in (15), and passed to the robot controller. The hyperparameters used for training are listed in Table 2. The codes are publicly available. 1 The main objective of the experiments was to compare the performance with and without synergies. Accordingly, we employed LSTM as a general method for time-series generation, which is widely used in various fields, including robotics.

D. ROBOT SETUP
We used a five-DoF robot with a gripper, as shown in Fig. 6. This robotic arm has series elastic actuators provided by HEBI Robotics, Inc. This robot was connected to a desktop computer with Intel Core i7-10700 CPU using Ethernet. The computer ran the control system and neural networks.
The robot is velocity controlled, using the control system illustrated in Fig. 7, and the joint angle θ is observed. The controller calculates the angular velocity command to the  actuators, ω ref , as follows: where J is the Jacobian matrix of the robot, K p = diag [40,50,40,20,20], and p cmd indicates the positional command. The x-y components of the position were computed based onp and the z-axis component was set to a constant value to maintain contact with the paper. The pitch and yaw components of the posture were set to ensure that the pen was perpendicular to the paper. The NNs were run at 20 Hz and the controller was run at 1 kHz; first-order interpolation was applied top before passing to the controller to account for the difference in computation intervals and smoothen the position command values.

1) EXTRACTED SYNERGIES AND THEIR ACTIVITIES
First, we show how synergies encode demonstrations. We also demonstrated that synergies could represent new movement patterns. Figure 8 shows the extracted synergies and examples of their activities in the demonstrations. The convergence criterion was made using the variance accounted for (VAF), denoted by R 2 : where m is the index of the demonstrations andμ m is the mean vector ofx m over the demonstrations. The value ranges from 0 to 1; the higher this value, the more accurate the reconstruction with the synergies. The synergy extraction is completed when the changes in R 2 is less than 10 −5 . Table 3 lists the reconstruction performance using the VAF defined in (17) and validates how the synergies could account for the demonstrations. In N = 4, K = 25 and N = 6, K = 10, the VAFs were almost identical in the learned and unlearned patterns, with more than 85% on average. For N = 3, K = 80, the VAFs were below 80%. Although we cannot immediately state whether these values are sufficient, in most cases, synergies could encode sub-behaviors accurately and even represent new movements.

2) AUTOREGRESSIVE TRAJECTORY GENERATION
Here, we test the three NN models described in Section IV-C. To evaluate the performance of the models without robots, we let the models generate trajectories in an autoregressive manner. Instead of receiving responses p[t] from robots, the NNs received their previous position outputs,p[t]. This setup corresponded to a case in which robots were ideally controlled without any control delays or disturbances, which was impossible in actual experiments. The initial position was selected based on demonstrations. Figures 9 and 10 show the letters generated by the NNs. The success rates computed using demonstrations are summarized in Table 4. A trajectory is labeled a success if more than half of the top-two similarity demonstrations are correct. The similarity was measured using the mean squared error; when the two trajectories had different lengths, we truncated them to match the shorter ones. In the baseline-A model, the trajectories were significantly unstable in the learned and unlearned patterns. In contrast, trajectories were generated stably in the baseline-B and proposed models, despite the proposed model having a significantly smaller NN size. Moreover, the success rates of the proposed models were higher than that of the baseline-B model, except for the case of N = 3, K = 80, suggesting that synergies influence learning.

3) APPLICATION TO ROBOTS
Finally, we evaluate the performance of the proposed method using an actual robot. Additionally, we assessed the generalization ability by commanding the model to generate unlearned patterns (orange patterns in Fig. 4). We deployed the proposed and baseline-B models to the robot and not the baseline-A model, which could not generate stable movements, even in the autoregressive scenario, as shown in Fig. 9A. Figure 11 shows the hand-tip trajectories of the robot. Furthermore, Fig. 12 shows examples of actual letters written on paper. Figure 13 shows snapshots of the trial using the proposed method.
The baseline model generated oscillating behaviors and could not reproduce the letters correctly, despite generat- ing the correct trajectories in the autoregressive situation. In addition, most trials were halted during the task because of strong oscillations that could damage the surrounding environment. This was mainly due to the control deviations of the robot, which did not appear in training. Moreover, friction between the pen and paper makes it difficult to model and compensate perfectly. In contrast, the proposed models, except for N = 3, K = 80, could write letters in the correct order in most trials, although not in perfect shapes. In addition, these models can write letters in new orders. Even when the contact states occasionally fluctuated owing to the stick-slip phenomenon, as observed in  The success rates are summarized in Table 4. The baseline-B model could not achieve the task in most trials owing to the oscillation. However, the proposed models could write letters, although the success rate was slightly reduced for unlearned patterns. In addition, the shapes of the written letters barely oscillate. The proposed models with N = 4, K = 25 and N = 6, K = 10 resulted in almost consistent success rates with those in the autoregressive scenario, whereas those with N = 3, K = 80 performed poorly.
The proposed model failed in some unlearned patterns; for example, in N = 6, K = 10, the generated trajectories resembled q-q-k for the task signal k-j-k. Because the model could generate nearly correctly in an autoregressive scenario (cf. Fig. 10), the model may have gotten confused during the experiment owing to the control deviations. However, despite such failure cases, the proposed model did not behave destructively, such as vibratory behaviors; instead, the mistakes were at an abstract level (i.e., writing a different letter). Figure 14 shows that the proposed model activated synergies similar to the demonstrations. Although the synergies are occasionally activated for several time steps, unlike in the demonstrations, the activation patterns are similar to those of the demonstrations.

A. REPRESENTATIONS OF EXTRACTED SYNERGIES
As shown in Fig. 8, the extracted synergies represent the spatial features of the task at different scales depending on N and K . In summary, the longer the length of the synergies, the more global the features they represent. In addition, when the number of synergies increases, some may not be used.
In N = 4, K = 25 (Fig. 8A), the synergies represented complex curves, as is often observed with cursive letters. Additionally, most synergies move the pen to the right (+x direction), whereas none moves the pen to the left (−x direction). This observation makes logical sense because the movement is to the right when writing Latin letters in a cursive style. The synergies indicate smooth acceleration and deceleration patterns, as observed in human writing movements, suggesting that they can be reused in different orders from the demonstrations.
Moreover, their activation patterns correspond to the temporal structures of the task (Fig. 8B). First, the activation patterns are nearly regular and correspond to the letters. For example, when writing the letter j, the first and second synergies are activated in a specific order. The third and fourth synergies were predominantly activated when writing q and k, respectively. Moreover, the activation orders within individual letters were nearly consistent, even when varying the letter orders. However, certain modifications could be observed depending on the previous and next letters. The same tendency is observed for the pattern j-q-k, which was not used in synergy extraction.
Synergies represent different types of features when using different numbers and lengths. In N = 6, K = 10 (Fig. 8C), the synergies represent simpler curves than those for N = 4, K = 25. The first, second, fourth, and sixth synergies represent straight movements toward the −y, +x, +y, and −x directions, respectively. In contrast, because linear combinations of these synergies can express movement in any direction, the third and fifth synergies represent meaningless movements that remain at the origin. Their activation patterns correspond to the temporal structures of the letter orders, although they look complicated (Fig. 8D).
Each synergy clearly represents an individual letter in N = 3, K = 80 (Fig. 8E). This captures the structure of the demonstrations. However, the activation patterns in Fig. 8F did not adequately correspond to the temporal structures. For example, in j-j-j, the third synergy is expected to be used thrice; however, it was used only once at the beginning of the task. Moreover, the first synergy, which represents the shape of k, was used initially, although the commanded pattern did not include the letter k. Similarly, the activation patterns did not correspond to the letter orders in j-q-j and j-q-k. A possible reason for these failures is VOLUME 11, 2023  that the synergies could not represent the deviations in letter shapes. Even the same letter varies in shape each time. The shape also varies depending on the letters immediately before and after it. However, such variations cannot be represented when only a single synergy exists for each letter. Consequently, an undesirable synergy activity was obtained by attempting to fit variations in letter shapes, as observed in Fig. 8F.

B. PERFORMANCE OF MOVEMENT GENERATION
Time-varying synergies improved the performance of the neural networks with fewer parameters, as shown in Figs. 9-11 and Table 4. The time-varying synergies assisted NNs in learning movements by providing the spatiotemporal structures of the task. The proposed method decomposes demonstrations into movement primitives and combination patterns. Using these primitive patterns (synergies), NNs could generate unlearned patterns relatively more easily than learning demonstrations directly. However, without syner- gies, NNs would need more layers, parameters, and training data to capture these structures.
Moreover, the proposed models succeeded in the task in an actual robot against control deviations owing to the controller and disturbance by friction. This could also be due to synergies. Because the synergies deal with short-term movements, NNs could have focused on longer-term movements and have become robust against small deviations in the input-output relationship. Additionally, the small number of parameters and layers in the proposed model also contributed to stable movement generation; this was also realized owing to synergies.

C. LIMITATIONS AND FUTURE WORKS
Although time-varying synergies can help imitation learning, there are some limitations. First, providing guidelines for designing N and K of synergy extraction is beneficial. As shown in Fig. 11 and Table 4, the performance varied depending on these hyperparameters. Although they were not very sensitive, a technique to find the best ones for performing the task would be useful. Second, the proposed method does not have a function to halt synergy activation once it starts, causing delays in reacting to unexpected disturbances. However, it has been reported that time-varying synergies can account for human reaching movements when the target positions suddenly change during a task [45]. Therefore, we expect a certain amount of adaptability to be realized without a synergy-activity interruption function. This evaluation is a topic for future research. Third, extending the proposed method to contact-rich tasks, such as assembly tasks, by combining force and impedance controls is desirable.

VI. CONCLUSION
This study proposes an imitation-learning method using timevarying synergies to decompose human demonstrations into linear combinations of a few primitives activated at various times. Owing to the spatiotemporal representation ability of the synergies, even compact NNs could learn movements comprising several sub-behaviors.
We evaluated the proposed method with the task of writing cursive-script letters. Consequently, the NN with the proposed method could generate movements even for new patterns using a small number of model parameters. In addition, the proposed method generated movements even when applied to a robot. The proposed model focuses on temporal structures (synergy activities) because synergies encode primitive temporal patterns, making it relatively easy for NNs to generate new movements by varying the onset times and amplitudes of the synergies. This contrasts with the baseline models, which need to vary the entire behavior without knowing the temporal structures of the demonstrations. Therefore, the time-varying synergy is a promising representation method for improving the efficiency and generalization ability of imitation learning.