A Tele-Operated Display With a Predictive Display Algorithm

Tele-operated display systems with head mounted displays (HMD) are becoming popular as visual feedback systems for tele-operation systems. However, the users are suffered from time-varying bidirectional delays caused by the latency and limited bandwidth of wireless communication networks. Here, we develop a tele-operated display system and a predictive display algorithm allowing comfortable use of HMDs by operators of tele-operation systems. Inspired by the kinematic model of the human head-neck complex, we built a robot neck-camera system to capture the field of view in any desired orientation. To reduce the negative effects of the time-varying bidirectional communication delay and operation delay of the robot neck, we developed a predictive display algorithm based on a kinematic model of the human/robot neck-camera system, and a geometrical model of a camera. Experimental results showed that the system provide predicted images with high frame rate to the user.


I. INTRODUCTION
Sites of disasters such as natural disasters, radioactive accidents and chemical accidents, which can pose problems in terms of human access, are increasing [2], [3]. Tele-operation systems have been researched to work in such disaster sites with human user's intelligence [4]- [7]. In the tele-operated systems, the user typically obtains the information of the sites from two-dimensional images transmitted to monitors by cameras attached to the tele-operated robot. Visual data is the most intuitive form of information when observing environments. However, the performance of a tele-operated system is limited if three-dimensional (3D) stereoscopic images are not provided, because the user cannot gauge distances between objects.
Recently, head mounted displays (HMDs) affording immersive 3D visual feedback have been used to observe environments and control tele-operated robots to perform manipulations [8]- [10]. Most HMDs feature integrated inertial measurement units (IMUs) that measure the orientation of the user's head. The head orientation is used to capture the field of view (FOV) in the direction in which the user looks by The associate editor coordinating the review of this manuscript and approving it for publication was Zheng Chen . employing a robot neck-camera system or a panoramic camera of tele-operated robots. However, it is difficult to provide real-time images in such tele-operated situations; the latency, limited bandwidth of the tele-communication network, and the large sizes of stereoscopic images impose delays and packet loss. Such delays and loss of data induced by unstable nature of wireless tele-communication network trigger timevarying delays in image presentation, causing simulator sickness [11]- [13]. The operation delays imposed by the physical limitations of robot neck-camera system increase the delay further.
Delays in tele-operation system have been researched for decades, especially to ensure stability and transparency of bilateral haptic systems. Many control algorithms, such as wave-variable based passivity control algorithms [14], [15] and modified algorithms to address constant [16], [17] and time-varying delays [18]- [20] have been researched. Adaptive and robust control algorithms have been studied to deal time-delayed systems with nonlinearities and uncertainties [21]- [24]. Despite of the achievements, it is difficult to apply such methods in visual feedback systems because of different nature.
The use of point cloud data or images to construct virtual worlds (rather than images recently captured by the robot camera) may mitigate the delay effect problems [9], [25], [26]. However, such methods impose large computational burdens or require expensive sensors. The delay effect could be reduced if the user's head motion is predicted, and the remotely controlled robot moves ahead of the user. Prediction of user's head motion have been researched to reduce the delay of HMDs for displaying virtual reality [27], [28]. However, it is difficult to apply such predictions, because large delays and packet loss are expected in the communications under disaster situations. Model-based predictions have been researched to mitigate the effect of delays in tele-operation systems. The position and orientation of robot manipulators were predicted by kinematic models of the robots and reflected in the user-side display, to avoid long task time induced by ''move and wait'' strategy [29], [30] or increase manipulation precision [31]. However, user's head motion was not involved in the systems, because the display systems consisted of ordinary monitors. A predictive display method was suggested in [32] to provide the image in actual direction of a tele-operated vehicle utilizing the dynamics of the vehicle, but it was not implemented in real system.
Here, we developed a tele-operated display system and a practical predictive display algorithm that compensates for the bidirectional network and operation delays. We constructed a robot neck-camera system based on a kinematic model of the human head-neck complex. The display algorithm predicts images in the direction of user's head orientation by employing delayed image and analyzing the difference between the delayed robot neck orientation and the current user head orientation. We compensated for the bidirectional communication and operational delays by translating and rotating the delayed images using kinematic models of the human neck and the robot neck-camera system, and a geometrical model of the camera. The delays were addressed by predicting images which correspond to the current user's head orientation, utilizing the delayed images, robot neck orientation, and current user's head orientation.
The remainder of this paper is organized as follows. Section II provides an overview of our tele-operated display system. Section III contains a detailed analysis of the camera and the predictive display algorithm. The experimental setup and the results afforded by the proposed algorithm are shown in Section IV. Section V presents the conclusions and describes planned future work.

II. OVERVIEW OF THE DISPLAY SYSTEM
We developed an intuitive display system for tele-operated robots. As shown in Figure 1, the human head-neck complex was modeled as a simplified series linkage system featuring three revolute joints to represent the rotational head motion. The HMD [33] attached to the user's head measures head orientation, allowing presentation of 3D stereoscopic images. The robot neck had the same kinematic structure as the human model. Two cameras [34] were placed on top of the robot neck to capture stereoscopic view of the work site. To prevent user's sickness caused by time differences between the  stereoscopic cameras, the cameras were synchronized by digital signals; both captured images at the same moment. The detailed specifications of the tele-operated display system are listed in Table 1.  The images can be delivered directly to the user via the HMD. However, as shown in Figure 2 . If the image is provided directly to the user, the user may suffer because of the delays. Especially, bidirectional communication delay is time-varying in nature and can trigger simulator sickness [11]- [13]. Also, random loss of image data, which is frequent because of the size of stereoscopic images, increases variation in CM . Such delays and image loss are inevitable in wireless networks, unless the communication is exceptionally well-controlled. OP is also time-varing, as it depends on user's head motion. CM and OP are the predominant delays; IMU is both relatively small and constant. To reduce the undesirable effects of CM and OP (such as sickness [11]- [13]), we develop a predictive compensation algorithm. The algorithm modifies the delivered images ] before providing the images. In this way, the image which is supposed to be in the direction of user's head is predicted; thus, the provided image instantly reflects user's head motion despite the existence of time-varying delays. The algorithm compensates for CM and OP , which are the dominant time-varying delays of the entire system that affect to the sickness of users.

III. THE PREDICTIVE DISPLAY ALGORITHM A. DERIVATION
We first analyzed camera geometry. In this analysis, we assumed that the camera is a pinhole camera without lens distortion. Also, the image sensor was assumed to be square.
The camera was analyzed by dividing robot neck motions into yaw, pitch motion (θ 1 or θ 2 in Figure 1, respectively) and roll motion (θ 3 in Figure 1) of the robot neck. Figure 3 shows the simplified geometry; the camera features an aperture and an image sensor. Figure 3a shows image formation on the image sensor and the position change of the formed image by yaw or pitch motion of the robot neck. If a subject is placed in the direction of θ from the perpendicular line of the sensor, the distance from the center of the image sensor to the formed subject image [op] can be approximated as follows: where d is the distance between the aperture and the camera image sensor. If the camera is rotated by θ , the distance changes to op , as follows: Using (1) and (2), the positional change in a formed image [pp ] can be calculated as: Note that pp is proportional to the translation distance of the subject in the captured image (in pixels, δ pixel ) as follows: where α is a conversion factor used to transform the distance change of the subject image on the image sensor to that in the captured image in pixels. Figure 3b shows image formation of a subject on the image sensor and its position change due to the roll motion by θ . In such a case, the image sensor also rotates by θ , as does the subject in the captured image.
The relationship between changes in camera orientation and positional change of the subject in captured images can be used to predict future images based on current images. If the camera joints corresponding to θ 1 , θ 2 and θ 3 of the human kinematic model are rotated by θ 1 , θ 2 and θ 3 respectively, the future image can be predicted by translating the current image by αdθ 1 in the horizontal direction and αdθ 2 in the vertical direction, and by rotating the image through θ 3 . Similarly, the scene in the direction of the current user head orientation can be predicted using that current orientation of The image prediction can be implemented by introducing a cropped area, and translating or rotating it by the orientation difference between the robot and human for yaw, pitch and roll motion, respectively. An example of image manipulation in the case of yaw motion (rotation of θ 1 ) is shown in Figure 4a. Here, a single axis-rotation case is considered, for the clarity of the explanation. The camera captures an image in the direction of its current orientation, as measured by the Similarly, the algorithm can deal with pitch and roll motion of the user head by applying the manipulation sequentially ( Figure 4b). The cropped areas are translated horizontally and vertically by αdθ 1 and αdθ 2 , respectively, and rotated by θ 3 , where θ 1 , θ 2 and θ 3 are the yaw, pitch and roll components of the difference between o H (t − IMU ) and o R (t − CM ). As the algorithm predicts images in the direction of current user head orientation [o H (t − IMU )], the effect of CM and OP is compensated. Thus, image immediately change following user head rotation feature a delay of only IMU , which is both constant and much smaller than the time-varying delays CM and OP . Therefore, the predictive display algorithm can reduce the discomfort feelings or the sickness of the display system caused by delays.
A delay compensation algorithm was proposed in Edwards' patent with a similar approach [35]. However, that method yields an image smaller than or equal to the FOV of the HMD, with negative or zero M h and M v values. Thus, the image shown in the HMD looks like a scene through a window that is translated and rotated when the image is updated. This may negatively affect user immersion, limiting the user-side FOV. Although the concept is similar to ours, there was no analysis about the camera geometry and image position change; our analyses support the validity of our algorithm. Also, the extent of margins were not considered; we discuss these margins in Section III.B.

B. MARGIN ANALYSIS
The margins must be set considering the rotational speed of the user's head and the image delays caused by CM and OP . If the cropping area is overlapped with the outside of image, the user no longer receives a square image, but rather a clipped image, because information is lacking outside the image. This compromises task performance of the tele-operation system by reducing user immersion. Thus, the margins M h and M v must be large enough to make the cropped area do not attain the image edges. However, large margins reduce the image FOV in turn reducing the information imparted. Given this trade-off, the margins must be as small as possible but sufficiently large to prevent overlapping with cropped areas the outside the image.
The required margins can be calculated from the image delay, and the rotational speeds of the user's head in yaw (θ 1 ), pitch (θ 2 ) and roll (θ 3 ). Figure 5 shows the translation and rotation of the cropped area. The four corners are denoted A, B, C, and D. As one of these points will be the first point to reach the edge of the image when the cropped area begins to overlap with and area outside the image, it is necessary to ensure that all points stay within the image. Margin analysis commences at point A with initial coordinates (x A , y A ). The distances between A and closest horizontal and vertical image edges are denoted M h,A and M v,A , respectively. The coordinates of point A prior to translation and rotation of the cropped area are: where l is the distance from the center of the image to A, and θ i is the angle between the horizontal line and OA, calculated using the margins and image size, as follows: Before manipulation of the cropped area, M h,A and M v,A are identical to M h and M v . As the area is translated and rotated, the coordinates of point A change: Assuming an image delay of dt and user rotational speeds oḟ θ 1 ,θ 2 ,θ 3 (yaw, pitch and roll axis, respectively), the positional change of point A caused by user motion during the delay time dt can be calculated as follows: The positional change varies the margins M h,A and M v,A as follows: To ensure that the cropped area does not attain the edges of the image, M h,A and M v,A must satisfy the following criteria: Similarly, the coordinates and margins of the other three points after manipulation of the cropped area are:  such that the cropped area does not reach to the edge of the image. Using these criteria, the minimal required margins can be calculated by applying the maximum angular velocity of the user's head (θ 1,max ,θ 2,max andθ 3,max ) and maximum image delay (dt max ) into (11)- (12) and (16)- (21). On the contrary, the delay that can be dealt with a set of margins can be obtained by assigning the margins to (7)-(8) and calculating dt that satisfies (15) and (28).
The problem here is that the image delay changes significantly as tele-communication network conditions vary. To address this problem, margins can be calculated and applied in realtime, because our predictive display algorithm does not impose a large computational burden. However, margin variation also changes the image FOVs, which may disturb the user. Also, there are limitations in the margins, since the image have limited size. If the user's head motion is fast enough to escape the margins in the given network delay, reaching to the outside of image is unavoidable. Thus, we determined the margins empirically in the following performance tests. The margins were first set to zero, and then were increased until the cropped area does not reach to the edge of image during the experiments so that the margin criteria can be satisfied under given delay and user's rotational speed in a test operation.

A. PARAMETER IDENTIFICATION
To apply the predictive display algorithm, the camera parameter αd of (4) was identified by an experiment. Figure 6 shows the experimental setup. The robot neck-camera module was fixed, and a subject was placed in front of the camera at a distance of 1 m, occupying the center of the image when the neck of the robot was in the initial position. The robot neck was rotated (in yaw or pitch) from −30 • to 30 • in 1 • steps with the camera operating. The distance between the subject in the captured image and the image center was recorded in pixels; the αd parameter was identified by linear fitting of αd to (4), applying the least squares method to the recorded position and rotational angle. Thus, αd was identified as αd = 5.5 pixel/deg for yaw, and αd = 5.9 pixel/deg for pitch with root mean square (RMS) errors of 3.3 pixel and 4.1 pixel respectively. Since there was no significant outlier, the identified model showed good agreement with the measured data with  the least squares method. Figure 7a and Figure 7b show the measured positions of the subjects and the linearly fitted models for yaw and pitch of the robot neck, respectively. As the graphs show, the measured data and camera model of (4) were in good agreement. The parameters differed slightly in terms of both yaw and pitch motion, caused by lens distortion or/and image sensor asymmetry. The parameters depend on the physical specifications of the lens and the image sensor of camera, such as the refractive index of the lens and the size and position of the image sensor, which are not time-varying specifications. Thus, pre-identified camera parameters were used in the implementation of predictive display algorithm.

B. IMPLEMENTATION OF THE PREDICTIVE DISPLAY ALGORITHM
The predictive display algorithm was tested in a real-world tele-operation experiment. The experimental setup was the same as that of Figure 2. The user wore the HMD and rotated the head; head orientation was measured and delivered to the robot neck-camera module through a wireless network; the robot neck followed the user's motion while a stereo camera captured environmental images at a resolution of 554 × 413 pixels. The captured images and robot neck FIGURE 9. Captured images in the tele-operation experiment. Focus on changes of image edges (e.g., the robot on the right side), or the moving subject (i.e., the human). Check the attached video file for more details.
orientations when the images were taken were delivered to the user side. A random delay was intentionally added to the wireless network to reflect network conditions in practical tele-operated applications. The captured image and robot neck orientations were sent to the user side after random delay with a maximum magnitude of 230 ms and a minimum of 70 ms, representing a communication delay in the teleoperated display system. Both images (i.e., those subjected to and not subjected to predictive display algorithm) were recorded. The margins of the predictive display algorithm  and M h = −90, M v = −50) as suggested in Edwards' patent [35] Given that the frame rate of most digital videos is 30 frames per second (FPS), image prediction proceeded at in 30 Hz.
The recorded results are attached as a supplementary video file. The bidirectional network delay is shown in Figure 8. The minimum, average, maximum, and standard deviation of the delay were 145, 245, 361, and 48 ms, respectively. Figure 9 shows sets of updated images from scenes (1) to (3) without/with the predictive algorithm operating under the same experimental conditions with appropriate margins. In the absence of the predictive display algorithm, the captured images do not change until new images arrive. However, the images modified by the predictive algorithm change continuously as the user's head rotates. As shown in the video file, the delay-compensated images are much smoother and more natural than those presented in the absence of the algorithm. The desired frame rate (30 FPS) was maintained even in the presence of unpredictable time-varying delays. The results imply that the proposed algorithm allows operators of tele-operation systems to comfortably use their HMDs.
A feature analysis was performed to verify the performance of proposed algorithm. In this analysis, the coordinate differences of keypoints between the images before and after the arrival of new image were compared, as shown in Fig. 11a. A fast library for approximate nearest neighbors (FLANN) matcher in openCV was used in the analysis. The average values of coordinate differences are plotted in Fig. 11b. The average difference was 11.0 pixels and 1.9 pixels for the images without the predictive display algorithm and with the algorithm, respectively. The coordinate difference was significantly reduced when the predictive display algorithm was applied; i.e., the video with proposed algorithm is more continuous, providing a predicted image similar to the original image to be updated.
However, the algorithm cannot deliver the change of environment instantly, as reflected in Figure 9c and the video with a moving subject. The motion of a moving subject is not reflected until the image is updated in (3). This is a fundamental limitation of the algorithm, because the algorithm utilizes delayed images for the prediction. Also, inappropriate margins negatively affect user immersion as we concerned in Section III, and as shown in the last part of the supplementary video and Figure 10, limiting the user-side FOV. Thus, the margins must be selected to satisfy the criteria introduced in Section III.

V. CONCLUSION
In this research, a display system with a predictive display algorithm was developed. A robot neck-camera module was designed and manufactured based on a kinematic model of the human head-neck complex. The image delay during the tele-operation was analyzed; this may trigger simulator sickness. To deal with this issue, a predictive display algorithm was developed based on the camera geometry and human and robot kinematics. The image for the current user head orientation was predicted and provided to the user, exploiting the difference between the current user head orientation and the delayed camera orientation. Thus, the time-varying delays in communication and operation were compensated, and continuous images were thus provided to the user. The proposed algorithm is simple to implement in real-time tele-operation systems without expensive sensors or computational burden, yet powerful as demonstrated in the experiments. The feature analysis result shows that the proposed algorithm is effective in predicting images.
However, some challenges remain. The modeling uncertainties of camera geometry, such as tangent approximation and certain aspects of the pinhole camera model, should be reflected in the algorithm. The cropping margin decision strategy requires refinement for use under various telecommunication network conditions. If the user's motion is fast enough to escape the margins, the proposed algorithm cannot provide properly predicted image, as in [35]. Also, the predictive display algorithm does not instantly reflect a change in the environment, because the algorithm used delayed images. Nevertheless, the proposed algorithm is a significant practical utility because it is simple and imposes a low computational burden. User evaluation in various situations is essential; we will quantitatively evaluate improvements in user comfort and employ these data to guide our future research.
KYUTAEK HAN received the B.S. degree in mechanical system design engineering from the Seoul National University of Science and Technology, Seoul, South Korea, in 2016, and the M.S. degree in mechanical engineering from the Ulsan National Institute of Science and Technology (UNIST), Ulsan, South Korea, in 2019. He is currently a Researcher with the Korea Culture Technology Institute. His current research interest includes imaging technology and platform for immersive experience.