Adaptive Robot-Human Handovers With Preference Learning

This letter proposes an adaptive method for robot-to-human handovers under different scenarios. The method combines Dynamic Movement Primitives (DMP) with Preference Learning (PL) to generate online trajectories that are reactive to human motion, modulating the speed of the robot. The PL allows for tuning the coupling parameters of the DMP, tailoring the interaction to each participant personally, and allowing for qualitative analysis of user preferences. Simulation of an interaction-constrained learning task with different optimization techniques is performed to determine an appropriate learning approach for a handover task. The validity of the approach is demonstrated through experiments with participants on two handover tasks, with results indicating that the proposed method leads to seamless and pleasurable interactions.

importance on temporal precision [2]. Temporal coordination tends to be more challenging in practical HRC settings where the handover could be susceptible to interruptions or perturbations due to changing dynamics in human motion (Fig. 1). These perturbations could arise from scene misinterpretation on the side of the high-level controller (e.g. the robot initiating a handover while the human is not ready to receive the object), unexpected disturbances on the human side, or simply user disengaging from the handover. Furthermore, users might have different preferences in how the robotic partner should move based on different settings or personal comfort with the robot. Thus, it can be beneficial to develop a system that allows participants to tune the behavior of the robot intuitively.
To this end, we propose an adaptive method based on Dynamic Movement Primitives (DMP) [3] with Preference Learning (PL) for dynamic handovers. DMP allow for online trajectory generation which is reactive to human motion, modulating the speed of the robot [4]. Further, PL is used to tune the parameters of the DMP (and thus the adaptive capabilities of the robot) from interactive user feedback. In doing so, generated trajectories can be more coordinated, responsive, and robust to perturbations, thus ensuring seamless and pleasurable interaction. The proposed approach enables tailoring the interaction to each participant personally, and thus allows for qualitative analysis of user preferences between changing handover scenarios.

A. Trajectory Generation for Handovers
Handover represents a joint action between two partners, cooperating both spatially and temporally [5]. As such, to accommodate for changes in the environment and partner behavior, adaptable behavior is crucial. Pre-planned methods may only be effective if all the constraints are known, which is not the case in the present study. To this end, several approaches have been used in the literature, including DMP [6], [7], Probabilistic Movement Primitives [8], and Interaction Primitives [9]. DMP are a popular choice for trajectory generation in robotics due to their ability to generate smooth and continuous trajectories while being able to handle perturbations and noise.
In our previous work [4], we focused on a DMP-based approach that coupled the evolution of the robot trajectory to the human hand trajectory. In that work [4], and present work as well, we consider adapting to unmodeled perturbations given the permanence of the handover intention on the giver's side. However, in our previous study, we hand-picked the relevant This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Experimental setup (left) and protocol (right) for two different handover scenarios. In Straightforward Handover (denoted by the yellow trajectory), the user, starting from the start position (A), is tasked to reach the handover exchange location (H). In the Perturbed Handover (denoted by the red trajectory), the user, starting from the start position (A), is tasked with taking the box at location (B), transferring it to location (C), and then reaching the handover exchange location (H). Robot always starts in the robot start location (R) and reaches (H). 16 participants are divided into two groups of eight and follow the outlined protocol. The board with a relative scale is placed behind the participants to give their preferences after the relevant interactions. In the displayed example, upon obtaining the preference on interaction labeled as "4", comparisons added to D are θ 4 θ 2 and θ 3 θ 4 . As interactions "1" and "4" are rated as equal, a comparison is not constructed between them. DMP parameters, which may not be optimal for all users or situations. In this work, we aim to enable humans to optimize toward preferred behavior, thus resulting in more natural and intuitive handovers, across two different handover scenarios.

B. Coordination, Perturbations, and Preferences in Handovers
Koene et al. [2] have demonstrated that temporal precision is a major determinant of user perception of the handover. Considering the handover with a predetermined exchange location allows the controller to more easily adapt to changes in the speed of the human hand, thus better examining the effects of temporal coordination between the human and the robot.
While the literature on robotic handover seldom considers interactions in which the handover action could be disrupted by another task or unexpected perturbations, Huang et al. [10] propose an adaptive controller that takes into account the availability of the human receiver. The aforementioned approach is based on a finite state machine (FSM) which consists of multiple robot states and as such address a similar control problem on a higher level.
Using human preferences to directly shape the generation of the trajectory is often limited by the high dimensionality of the problem and the difficulty to gather significant feedback. In [11] the authors propose a method to learn user preferences from real-world interaction using the contextual policy search. However, this method relies on absolute feedback, which might introduce the problems of drift, change of scale, or forget [12], [13]. Handover tasks could also successfully be learned from demonstrations as shown by Wu et al. [7]. Similarly, they employ DMP to generate the trajectory. The focus of the work in [7] is on the spatial coordination between partners, and in human-to-robot handovers. In both of these works [7], [11], handovers are represented as straightforward interactions that are not susceptible to possible perturbations.
To address the issues introduced by absolute feedback, the preferred behavior can also be learned from user preferences expressed as binary comparisons [12], [13]. A notable line of work on pairwise comparisons applied to robotic learning is demonstrated in research by Tucker et al. [14], [15]. In this work, comparisons are supplemented by coactive feedback to improve the sample efficiency of the model optimizing the gait of exoskeletons and bipedal robots [14], [15]. Such improvements could benefit the proposed method in the future as well.
It is worth noting that a more elaborate method for employing preference feedback could be set up by learning a reward which some reinforcement-based learning algorithm could exploit [16], [17]. However, in this case, additional challenges could be introduced, both computationally and in terms of the necessary training data. Furthermore, [18] have shown that DMP can be adapted by learning on discrete human feedback. As the handover is relatively simple in terms of task complexity, we aim to use preference learning to adapt the dynamics of the system by changing critical parameters of the trajectory generation module.

A. Dynamic Movement Primitives
DMP can be used to generate trajectories online as an evolution of a virtual dynamical system. The original framework [3] is commonly extended to learn and reproduce both periodic and non-periodic trajectories.
A one-dimensional trajectory can be represented with one degree of freedom (DOF) x(t), with initial state x(t 0 ) = x 0 and desired final goal x(t f ) = g. The DMP models the evolution of this trajectory as a second-order dynamical system: where α x and β x are positive parameters, and τ a time constant.
With β x = α x 4 , a critically damped system is obtained. f (s) represents a forcing term that can be used to mold the evolution of the trajectory and is defined as: where f nl is a arbitrary non-linear function approximator and (g − x 0 ) is a scaling factor. As the system evolves, a phase variable s decreases monotonically from 1 to 0, and x(t) converges stably toward g, as the effect of the forcing term vanishes. The phase s typically follows a first-order dynamics of the form: with τ and α s again as the time-constant of the system and a positive parameter, respectively. The system can be extended to multiple DOFs by defining one transformation system for each DOF and a shared canonical system. By doing so, different DOFs are coordinated by the single common phase variable.

B. Coupling Terms
By adding spatial or temporal coupling terms DMP framework can be extended to produce elaborate behavior. The coupling terms are integrated as: where C s represents the spatial and C t the temporal coupling term.
To coordinate the trajectory with the behavior of the user, this works uses the method proposed in [4].
Given an estimate of the handover location g, d is defined as the distance of the human hand, with an initial value d(t 0 ) = d 0 at the start of the interaction. A second-order low-pass filter is applied on the measured distance d (denoted asd): Coupling terms are then defined to be: where k t and k s are positive gains and σ i (y) is a sigmoid function with x-axis offset δ i and steepness coefficient a i . The sigmoid σḋ is mainly responsible for the adaptation to human motion, while σ d reduces its influence once the hand reaches the final position. As the main role of σ d is as a filter against practical edge cases [4], here the values of a d and δ d are arbitrarily fixed to 13.0 and −0.35, respectively, producing a quasi-linear response , as in [4].

C. Preferential Learning With Refreshing Scale
Learning from human feedback is difficult due to inherent problems in human evaluations. Absolute human feedback is usually noisy and unreliable, suffering from drift (scale shifting over time) and anchoring (early interactions being deemed as more important) [12], [13]. Moreover, different users can have very different internal scales. To address these issues, participants can be asked to give a relative evaluation, stating how the most recent interaction compared to the previous one [12], [13]. To this end, a probit model can be used to infer the utility function of human preferences u from binary feedback [12], [13]. In the proposed approach, we assume that u is a latent value given by the user's perception of the interaction relative to the changing robot trajectory. Given a data set of ranked pairs: where θ r i , θ c i ∈ Θ are instances of points in the parameter space. After collecting the data, a zero-mean non-parametric Gaussian process (GP) prior can be fitted as: where u = |u(θ 1 ), u(θ 2 ), . . ., u(θ n )| T is the utility of user choice at sampled points and K is the n × n covariance matrix (n is the number of instances) [12]. To estimate the posterior distribution of u given D, model is fit [12], [13]: This problem can be treated as an optimization of an expensiveto-estimate black-box function, a setting where Bayesian Optimization (BO) can be employed effectively [13]. Most commonly in BO settings, Expected Improvement (EI) acquisition function can be used to efficiently sample the next set of parameters. Given the best observed value of the latent function u(θ * ), new point is queried by maximizing the EI: However, the performance of BO methods can be susceptible to noise, especially since human evaluations represent high noise feedback. A method proposed by Letham et al. [19] for batch Monte Carlo approximation of Expected Improvement under Noisy observations and constraints (qNEI) can alleviate these hindrances. While EI-based acquisition functions are often regarded as greedy, in HRC tasks this property can be somewhat desired, as collecting data tends to be expensive, and the number of interactions is limited. The exploration-exploitation trade-off in an optimization problem with a limited number of interactions and scaling noise is thus considered in Section IV.
While the preferential feedback alleviates variance introduced by noise in human evaluation, preferences themselves are not highly informative. In a classic setting [13], two points are sampled and displayed, and the user gives a preference between them. Similarly, in virtual environments, it is possible to simultaneously provide a gallery of options from which the user selects the preferred option [13]. However, this is not possible in practical, HRC scenarios due to its interactive nature. To alleviate these disadvantages we propose a relative scale with a periodic refresh. Interactions are performed in batches of q sampled points (q = 5 in the experiments with participants outlined in Section V). The relative scale is presented as an arrow ranging from "Strongly Worse" to "Strongly Better". This design decision comes after the pilot study [20], where some participants found difficulties in expressing their preferences on a seven-point scale (e.g. asking to rate a later interaction as in-between two previous interactions that are already placed in consecutive bins). This representation is employed to reduce the strain on participants when giving the preferences. Participants are explained that it is only important to order the preferences relative to each other and that visual representation and the distance between the placed preferences carry no absolute value and do not affect the model.
After each interaction, pairwise comparisons are extracted between the most recent set of parameters and previous sets within the batch (Fig. 1), added to data D, a new prior is obtained, and a new parameter point θ is sampled. Thus, comparisons are generated between up to q number of points, instead of a twoby-two approach, increasing the information gained from each interaction. After each batch the scale is refreshed, removing the user's feedback so far. Then, the user is presented with the previous best-observed point θ * , and a new batch of sampled interactions begins. The benefits of the refresh are twofold: first, by refreshing hindrances of drift and anchoring are removed, as the scale does not have an absolute value; secondly, the strain on participants' memory is reduced, and they can focus on the relation between the few most recent interactions. To this end, the proposed method aims to learn the parameters which shape the reactiveness of the generated trajectory: aḋ, δḋ, and k t = k s = k. This approach is slightly different when compared to the pilot study [20], where the temporal and spatial gains were decoupled. However, this could cause certain instabilities in the virtual dynamical system in cases where the difference between gains was significant, while for close enough values the differences in trajectory were not noticeable. The continuous parameter space Θ of the PL algorithm was set to aḋ ∈ [1, 10.0], δḋ ∈ [−1.0, 1.0], and k ∈ [0.01, 15], with θ = (aḋ, δḋ, k). This again comes as a result of findings in the pilot study [20] where the original bounds were too broad, and certain parameter settings could lead to unsafe behavior (i.e. robot excessively accelerating resulting in high inertia). The values of the DMP are reported in Table I.
The complete block diagram of the proposed system is demonstrated in Fig. 2.

IV. SIMULATION OF AN INTERACTION CONSTRAINED LEARNING TASK
As it would be exceedingly time-consuming to examine the performance of the BO method under different hyperparameter settings in a real-world robotic scenario, a simulation is performed. The task is set up as reaching the randomly sampled point (θ t ) in a three-dimensional space by evaluating the l 2 norm: By collecting observations as: where N (0, σ 2 ) is Gaussian noise and j is the index of the interaction within the batch (j = 1, . . ., q). Thus, comparisons are reconstructed as θ r θ c if y(θ r ) < y(θ c ). By doing so, scaling noisy observations are simulated to represent possible participant forget as more interactions are performed within the batch. Two of the most critical hyperparameters for the proposed PL loop are the acquisition function and the size of the batch q. To better evaluate the performance of the BO approach under different acquisition functions, an additional acquisition function is considered. Given two queries, Expected Utility of Best Option (EUBO) aims to increase the information by maximizing the utility obtained due to these queries [21]. Thus, two acquisition functions, qNEI and EUBO are evaluated. The influence of batch size q is evaluated by considering qNEI approaches with q = 2 and q = 5. These conditions are tested in the noiseless (σ = 0) and noisy (σ = 0.05) settings. GP is implemented with a Radial Basis Function kernel (as in [12], [13], [22]).
The model is initialized by sampling 5 random parameter points. To simulate the constraints of real-world interactions, we only consider 25 unique queries in total from each approach. It is worth noting that practically in the two-by-two experiment (q = 2) this would result in 45 interactions, while with q = 5 it would result in 30 interactions. The experiment is run for a total of 500 trials for each test condition. To evaluate the performance and sample efficiency of different BO approaches the mean distance to the target by interaction and the success rate (u(θ) < 0.1) by interaction are reported in Fig. 3.
From the top plot in Fig. 3 it is apparent that the EUBO is a more elaborate acquisition function as the utility function is better approximated. However, in an HRC scenario, we might be interested in finding some set of parameters that leads to satisfactory performance in the limited interaction time users might have with the robot. Thus, from the bottom plot in Fig. 3 the merit of using qNEI in a practical scenario, with a reduced number of trials can be seen. Due to the "greediness" of qNEI, satisfactory solutions could be more easily reached. It is worth noting that this is highly dependent on the scenario, but it can be a reasonable approximation in HRC settings where the goal is to have a pleasurable interaction. Further, it is worth noting the performance of batched approach (q = 5), even in the presence of scaling noise. While the noise in evaluations is both dependent on the task and the user, the amount of information gained, and thus a potential reduction in real-world training time, could not be overlooked. Therefore it might be beneficial to perform interactions in refreshing batches, should the users experience no difficulty remembering the interactions.

V. EXPERIMENTS WITH PARTICIPANTS
In Section III we presented a method to optimize towards a preferred set of adaptive parameters for robot trajectories from human preferences in a limited number of handover interactions. An experiment is run to first learn the preferred set of parameters from each participant's feedback, collected over a limited number of interactions. Secondly, we evaluate the hypothesis that the participant prefers the set of parameters obtained by BO over a set of parameters picked with quasi-random Sobol sampling. Hence, the controlled variable is the set of parameters selected with PL, and the dependent variable is the percentage of BO sets evaluated as preferred over the sampled ones.

A. Handover Scenarios
The straightforward handover (yellow arrow in Fig. 1) consists of the user reaching for and grasping the object (an empty 0.5 l water bottle) from the robot giver. The robot and the participant are given the "Go" signal simultaneously. Furthermore, they are positioned so that both the robot and the human perform their respective trajectories at the approximately same time, should the robot move at top speed. This can be considered as a standard handover, commonly discussed in the literature, without any perturbations.
In a second scenario (red arrow in Fig. 1), the perturbed handover, the participant is tasked with performing a secondary task before engaging in the handover. Again, both the robot and participant are given the "Go" signal simultaneously. In this scenario, the participant has to first reach for a box placed in the vicinity (∼ 25 cm) of the handover location. Secondly, they have to place the box on the desk nearby. Finally, after placing the box, the participant should reach for the final location to receive the object from the robotic giver, and place it in the aforementioned box.

B. Setup
To facilitate the robotic handover, Universal Robots UR5 CBseries manipulator equipped with IH2 Azzura hand (Prensilia SRL) [23] is used. A HEX-70-XE OnRobot six-axis force-torque sensor is mounted between the wrist and the hand, to enable a force-threshold release of the object. A Vicon motion capture system with 6 Bonita cameras (Vicon Ltd) is used to track the human hand at 100 Hz. A speaker was used to give the participants an auditory "Go" signal. PL algorithm was implemented using BoTorch [22], and run on a separate machine.

C. Experiment Protocol
Sixteen participants (right-handed, 8 female and 8 male, aged 23-40) took part in the experiment. Eight participants were assigned to each scenario. Informed consent for voluntary participation was obtained in accordance with the Declaration of Helsinki.
Firstly, motion capture markers were fit to the participant's hand. Then, the assigned task was explained. It was also explained to the participants that the robot has different coordination capabilities which might vary from trial to trial. Participants were asked to give their relative preference after each interaction. The choices for relative preference being rating interactions as better, worse, or equal to the previous within the batch, with an "equal" evaluation meaning that the participant does not have a clear preference between those evaluated as such. Fig. 1 reports an example of how pairwise comparisons are constructed after each preference. It was made clear that the robot behavior might change as a result of these preferences and how the relative scale functions. Further, it was emphasized to the participants that the release of the object was always going to be the same and that they should only give preference in relation to the timing of the robot's trajectory.
To start the experiment, 5 interactions are performed with randomly sampled parameters to initialize the model. Then, 4 learning batches of 5 interactions each are performed, with the scale refreshing after each interaction. Finally, 1 more interaction is performed with the best set of parameters learned, and a subjective questionnaire is given to the participants: r Q1: It was easy to remember the interactions within the batch. (5 points from "Not at all" to "Perfectly") r Q2: Did you feel that the robot was constantly improving?
(Yes/No/Maybe) r Q3: Relative to the last set of parameters tried, how much did you perceive that the robot was coordinating with you? (5 points from "Not at all" to "Perfectly"). Then, participants were asked to perform 4 more batches of 2 trials each. These served as validation trials, where unbeknownst to participants, the learned set of parameters was compared to randomly sampled parameters, in a randomly sampled order. Participants again had the choice to rate each interaction as better, worse, or equal to the previous one. During these validation trials, human and robot trajectories were recorded for qualitative analysis. In total, the experimental procedure lasted approximately 40 minutes per participant. This study was approved by the local ethical committee of the Scuola Superiore Sant'Anna, Pisa, Italy (approval number 21/2022).

A. Validation Results
The results of validation trials are presented in Table II for straightforward handover and in Table III for the perturbed scenario. The percentages of preferences toward the set of parameters selected with BO were significantly above the reference threshold of 50% (Sign test, N = 16, p < 0.001). The reference threshold to test against was chosen as the median value of the possible outcomes.

B. Survey Results
In Table IV results of subjective surveys in straightforward and perturbed handover are given on the questionnaire described in Section V-C. Answers to Q1 were above the threshold of 3 (Sign test, p < 0.001). This result points to the relative ease in remembering the interactions within the batch, and in general, participants did not claim any difficulties. Moreover, there is no significant difference between the answers in straightforward and perturbed handover (Mann-Whitney U Test), suggesting different tasks did not affect the ability to remember the interactions in the batch. No significant difference between the straightforward and perturbed handover was found for Q3 as well (Mann-Whitney U Test), underlying that a similar level of coordination was perceived in the two conditions. Q2 was posed as a precaution, as the trend in participants answering "Yes" to this question could mean that there are some biases towards learning robots, which could hinder the PL process. No statistical analysis is performed on this question as it would require a significantly higher number of respondents.

C. Qualitative Metrics
To qualitatively assess the preferred robot behavior, human and robot trajectories are recorded. While it is challenging to define a direct optimization metric when unmodeled perturbations might arise, comparing the respective trajectories given the preferred parameters allows for a qualitative assessment of controller performance. The trajectory plots are represented in Fig. 4.

VII. DISCUSSION
Two handover scenarios are set up to investigate the correlation and contrast in participant preferences in different handover Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. tasks. The motivation behind the task set-up in perturbed handover is not only to present a realistic, less structured, interaction but also to present a worst-case scenario for the controller. By placing the secondary object (the box) close to the final handover location, higher precision is required from the controller in interpreting human motion. Thus, learning the parameters is made more difficult as we hypothesize that users might prefer more nuanced reactive behavior in such a scenario. This is largely due to the fact that many parameter settings lead to non-reactive robot behavior.
By considering validation results in Section VI-A, it can be seen that the proposed PL approach has a high success rate. Out of all the trials, the only optimization which could be deemed as suboptimal is the one performed by Participant 9. In this case, the PL algorithm likely over-fitted to a local minimum representing a slightly-damped non-reactive controller, when from validation trials it was made clear that the participant might have preferred a more reactive behavior. This might be due to the correlation between speed and coordination, as even if the robot was not reactive it was still performing the trajectory at the right speed to be perceived as coordinated. Nevertheless, this exemplifies the difficulty of learning from direct human feedback, as participants might have to weigh between many different correlated characteristics when giving their preferences. Furthermore, it is worth noting that the challenges that come with noise in human evaluation, affect the validation process as well. For example, in Table II, in the single missed validation trial marked with *, the participant was presented with two (identical) non-reactive controllers, but preferred one to another, rather than rating them as the same. These findings might indicate that the assumptions from Section IV are valid, as it is very challenging to estimate a "global" optimum of the latent utility function of the human preference. For the same reason, and due to limited experimental time with human participants, it is challenging to estimate the absolute number of interactions required for finding the optimal parameters. For the proposed scenarios and the required parameter space, 25 interactions could be viewed as a conservative estimate to reach satisfactory performance. Should the dimension of the required parameter space be significantly higher, the noisiness of evaluations would likely diminish the effectiveness of the BO approach. Considering all the factors, the PL approach seems to overcome these challenges in the proposed setting and consistently produce satisfactory controllers in the limited amount of interactions.
The questionnaire was given to participants to better understand their subjective experience. Mainly, it was of interest to verify that participants in general did not have trouble remembering the interactions within the batch. The handover does not represent a long task, so batching a small number of interactions can lead to more effective data usage, as the participants did not exhibit difficulty remembering the trials within the batch of five interactions. Q2 was posed to investigate if there was some bias toward a learning robotic agent, i.e. participants rating later interactions more favorably due to expectation of a robot that is constantly improving. This might represent a challenge as BO tends to query points with high variance, leading to points of varying perceived value. As mentioned, the number of participants does not allow for statistical analysis of this bias. However, further studies into biases introduced by interaction with learning robots are warranted, as this might improve the design of the learning methods. From Q3, no statistically significant difference can be reported in the perception of robot coordination between changing scenarios, even though the dynamics of the two tasks are varying. However, somewhat lower scores in answering this question might be due to a misunderstanding, as some participants later noted that they believed that a "set of parameters" refers to the last learning batch and the converged parameters together, instead of converged parameters separately (as intended).
From qualitative metrics, there is an insight into pleasurable robot behavior between scenarios. In the straightforward scenario, 6 out of 8 participants converged to a completely non-reactive set of parameters of the robot moving at the top speed (leading to identical robot trajectories). Two remaining participants converged to highly reactive parameters, with high accelerations. From Fig. 4 in Participant 7 and Participant 8 plots it can be observed that the robot initiates its movement after the human. In the perturbed scenario, reactive behavior is more appreciated, as 5 out of 8 participants converged to reactive controllers. Between the participants, there were varying types of reactive parameters, for example, Participant 14 preferred a slightly-damped reactive controller resulting in smooth accelerations. On the other hand, Participant 16 preferred a highly reactive controller, with the robot moving only after the participant has placed the box on the desk (Fig. 1). It is worth again mentioning Participant 7 as they might have not converged to their global optimum, but instead would have preferred a reactive controller as well. From these plots between different scenarios, it can be concluded that participants placed high importance on coordination (arriving at the same time) as opposed to simply speed or reactiveness. Between all the learning and validation trials, participants consistently rated "slow" controllers (the robot arriving late) negatively.

VIII. CONCLUSION
A combination of the DMP framework for online trajectory generation and BO-based PL from direct user feedback allows for adaptive, responsive, and user-tailored robotic handovers. Following the results from Sections IV and VI, batching relative user preferences with a refreshing scale can be highly beneficial in short interactions. From the qualitative analysis of preferred robot behaviors, it can be apparent that users deem temporal coordination (partners arriving at the same time) as the key factor as opposed to other factors such as speed or reactiveness.
Practical robotic implementations would benefit from a combination of the proposed method with high-level FSM controllers (for example querying preference on interactions when some specific perturbation is detected). Furthermore, PL data could be used across different interactions with different participants to construct better priors, which might greatly benefit the BO approach.