VR-Based Immersive Service Management in B5G Mobile Systems: A UAV Command and Control Use Case

The management of remote services, such as remote surgery, remote sensing, or remote driving, has become increasingly important, especially with the emerging 5G and Beyond 5G technologies. However, the strict network requirements of these remote services represent one of the major challenges that hinder their fast and large-scale deployment in critical infrastructures. This article addresses certain issues inherent in remote and immersive control of virtual reality (VR)-based unmanned aerial vehicles (UAVs), whereby a user remotely controls UAVs, equipped with 360° cameras, using their head-mounted devices (HMD) and their respective controllers. Remote and immersive control services, using 360° video streams, require much lower latency and higher throughput for true immersion and high service reliability. To assess and analyze these requirements, this article introduces a real-life testbed system that leverages different technologies (e.g., VR, 360° video streaming over 4G/5G, and edge computing). In the performance evaluation, different latency types are considered. They are namely: 1) glass-to-glass latency between the 360° camera of a remote UAV and the HMD display; 2) user/pilot’s reaction latency; and 3) the command/execution latency. The obtained results indicate that the responsiveness (dubbed Glass-to-Reaction-to-Execution—GRE–latency) of a pilot, using our system, to a sudden event is within an acceptable range, i.e., around 900 ms.


INTRODUCTION
N OWADAYS, we are witnessing very rapid develop- ments towards innovative technologies in the field of telecommunications.Thanks to 4G mobile networks that have been widely used for the past years, and the great efforts devoted to the development of the Internet of Things (IoT), different sectors have been positively impacted [1] [2].These sectors include health, education, smart cities [3], and agriculture [4], to name a few.In this vein, 5G networks are expected to continue paving the way for enabling different IoT applications [5] [6], particularly in case of remote immersive services, and/or Virtual Reality (VR)-based applications, such as tele-surgery, self-driving, and UAV-based applications.
VR technology has recently received increasing interest in both academia and industry [7], [8].It is rapidly progressing towards customers' use to bring an immersive experience to a variety of applications.It is worth noting that remote and highly mobile services, particularly the use of UAVs have gained great interest and will play an important role by becoming an enabling technology in IoT and next-generation networks [9] [10] [11].
However, one key challenge for the current networks to support these immersive services relies on their special peculiarities [12], [13].These services require high throughput varying from 1.0Gbps up to 1.0Tbps, and ultra-low latency of less than 10ms [12] [14].Sensitive and short response times are crucial for new latency-sensitive applications in the IoT, VR and Augmented Reality (AR), autonomous vehicular and UAV remote control applications, etc.For example, to remotely control a process under changing and critical conditions, very short-term information exchange is required between sensors, controllers, and actuators to achieve real-time high Quality of Control (QoC).Another potential challenge lies in the intensive computing required for capturing and encoding the volumetric scene of a remote location.Among other things, one of the encountered problems is, for instance, the choice of devices that a UAV can carry.In order to effectively control the UAVs, the weight of the onboard equipment must be taken into account.Therefore, a trade-off between their performance and their lightness must then be considered judiciously.
Advanced 360 • cameras are too heavy for a UAV to be able to take off or control accurately.Lightweight 360 • cameras, on the other side, come with limited computing resources and capabilities.Thus, enabling edge computing will alleviate the computational burden of these lightweight devices.One of the many improvements that 5G promises, in addition to enhancing New Radio (NR) and Millimeter Wave (mmWave) and energy efficiency [15], is to incorporate network automation and micro-services-based architecture that could improve data rates, latency, and reliability [16].Such an architecture will offer network elasticity and softwarization which will ease the deployment of edge services and enhance their portability in the sense that they are capable of running on any server [17] [18].
Previous studies provide only solutions for immersive UAV tele-existence, without considering the option of remotely controlling the device, using either a stereo camera [19] or an array of cameras [20].However, these solutions consider only how to efficiently remote control the viewpoint from a HMD using a non 360 • cameras, whereas the service's reliability and delay analysis are not considered.
Furthermore, much of the research up to now deals with immersive 3D video streaming and more specifically point cloud video delivery that provides a 6-DoF (Degree of Freedom) view experience [21] [22] [23] [24].
Most of the studies on immersive services focus only on leveraging view-port adaptive streaming to reduce bandwidth consumption with no consideration to streaming latency.In tile-based solutions, tiles of the user's FoV are assigned high priority [25].Moreover, based on the HMD predictions, several versions of the video can be created and therefore, an optimal solution should be able to deliver the version with the highest quality [26].
To increase the quality of experience, the work in [27] created dynamic heatmaps representing the user's view probability of a navigated 360 • video while taking network resources into account.Similarly, the experimental setup in [28] assessed the influence of 360 • video geometric layouts variation with an analysis of how to incorporate the solution into MPEG DASH streaming [29].From the perspective of this work, the authors stated that they would investigate how to apply the solution on live video streaming because generating different representations of the 360 • video is extremely time-consuming and difficult to implement on the go.All of the above-stated works are very promising for minimizing the bandwidth while increasing video quality, except that they are intended for Video on Demand which is not concerned with ultra-low latency video delivery requirements.
Nevertheless, few articles have addressed the E2E latency of immersive real-time live media streaming from 360 • video cameras, and there has been little quantitative analysis of End-To-End (E2E) delay measurements.In addition, very little consideration has been paid to the Glass-to-Glass (G2G) latency.For instance, the work in [30] proposed a live omnidirectional video streaming system leveraging viewport prediction and offloading the encoding to reduce bandwidth and latency.However, viewport prediction is timeconsuming and its E2E latency exceeds 2s.The work in [31] presented an efficient transcoding method for 360 • CCTV system, however, their chosen protocols and transcoding parameters lead to a delay reaching 7s and is not suitable for real-time surveillance.The work in [32] proposed a ratedistortion scalable multicast system for live 360 • streaming based on machine learning for viewport prediction, yet their system was not analyzed in terms of E2E latency which is an important parameter for live video.On the other hand, the work in [33] analyzed delays of their proposed 360 • testbed that comprised an omnidirectional camera prototype.Their E2E delay results were quite good as of 800 ms, except in case of the low resolution of 1080p owing to the use of HTTP based streaming at the client-side.
To sum up, 360 • video streaming is bandwidthand compute-intensive and latency-sensitive.Additionally, many articles have studied commercial streaming platforms that allow 360 • video streaming such as Facebook, and YouTube in terms of E2E latency [34] [35].Most of these studies state that the current commercial platforms suffer from frequent re-buffering and a very high delay as of 25s to 60s [35].
However, in the proposed system of this paper, a full analysis of the E2E VR-based remote UAV control from a communication perspective is considered, with an average video streaming delay of 700ms.The aim of this paper is to study the behavior of VR-based remote UAV control under different network access technologies, and to provide a suitable software and hardware architecture as well as recommendations to achieve high reliability and full immersiveness.In this study, an initial investigation was carried out to choose the software and hardware that allow achieving an immersive service.Then, the architecture was defined with different HW and SW components for remotely controlling the UAV using either wireless or mobile networks.Furthermore, a practical method based on Video Multimethod Assessment Fusion (VMAF) [36] is provided to measure the objective video quality metric of the 360 • video stream.
Moreover, a particular focus is given to E2E UAV command and control by developing a hardware-based measurement tool.This tool measures the video streaming and control command latency which is referred in this paper as "Glass-to-Reaction-to-Execution latency" (GRE).GRE is the time from the moment an event (i.e., usually when a UAV approaches an object) has been captured by the UAV camera until the user's reaction has been executed by the UAV to avoid hitting that object.Note that this latency information is essential to be considered when navigating, especially when controlling the UAV remotely in areas requiring maximum responsiveness such as cramped areas and/or areas with many obstacles.In such cases, the UAV approaching an obstacle must be highly responsive and must exploit the latency and dynamic information of the UAV, in order to avoid colliding with that object.
The core contributions of this paper are summarized as follows: • Design the system that allows the remote control of a UAV using a suitable real-life hardware and software-based architecture for maximum immersion while allowing optimal and reliable control of UAVs.
• Provide a detailed analysis of the GRE latency under different conditions, namely, i) mobile networks such as LTE, 5G, and WiFi, ii) user reactions, and iii) video qualities.Furthermore, it is worth noting that to optimize the GRE latency and its measurement, a hardware-based technique is proposed to measure precisely the different delays (i.e, in contrast to hardware-based technique, software-based techniques affect the GRE latency).

•
Conduct the experiments in real-life testbed allowing exhaustive testing and validation.It is worth noting that our solution for remotely controlling a UAV with controllers and body movements based on 360 • video feedback from a camera on the UAV is successfully implemented and experimented in real life.
The rest of this paper is organized as follows.In Section 2, an overview of remote service use-cases and their requirements is presented.In Section 3, the description of the system architecture is detailed.In Section 4, extensive experiments are conducted alongside their analysis.Finally, this paper concludes in Section 5 by highlighting the research challenges and future research directions.

REMOTE SERVICES: AN OVERVIEW
Remote services are increasingly important enabling technologies for 5G/B5G networks.In fact, they provide the ability to remotely control, with a high rate of safety, heavy and precision-needy machinery, such as surgeries, medical services, or vehicles.A fundamental challenge of these services, for the most guaranteed services, lies in their strict connectivity requirements.Streaming of the remote location is needed to control the remote device, and upon receiving this stream, a user can control the remote device by sending the needed commands (e.g., hand gestures).These requirements allow no late or lost packets, as such an event could lead to a disruption or failure in the planned service.The required network latency is less than 5ms with network reliability below 99.999 [37] [38].
The objective of remote services is to create a sense of presence between the user and the remote location.These remote services can adopt a three-dimensional or a traditional 2D interactive video stream environment that is rendered in real-time.Despite the fact that the word VR is associated nowadays with immersive VR, the latter can be also non-immersive and allow the user to view a remote environment on a desktop or projection screen [39].The degree of immersiveness depends on how much the user is isolated from the physical environment [40].

Non-immersive remote services
Non-immersive remote services are concretely deployed using fully or partially traditional interaction devices.In the medical field, a vocal cord 5G telesurgery based on robots was recently achieved using 5G and a Multi-access Edge Computing (MEC)-based core network from a distance of 15 km to the actual site of the surgery [41].Since VR is now so widely used in daily life, non-immersive virtual environments are often ignored in the VR category.This technology creates a computer-generated world, thus allowing the user to remain conscious of their physical surroundings and monitor them.VR systems that are not fully interactive rely on a computer or video game console, a monitor, and input devices such as keyboards.
Although non-immersive services are more accepted by users since they offer less motion-sickness, deploying them provides weak interactivity to 2D end-user devices (i.e., an equipment with a screen/monitor and simple input devices such as keyboard and mouse).This results in an increase in latency caused by many extra functionalities such as pose redirection [19].Alternatively, VR applications provide a full immersion of the users into the remote location.360 • video streaming delivers a more immersive viewing experience to end users.However, it faces tremendous challenges, because of the high resolution and the short response time requirements.

Immersive remote services
Immersive services provide an intuitive way to control remote robots, thereby transporting the user to a virtual remote space where they can control these robots via natural hand movements or through virtual interactions with them.Immersive technology integrates virtual content in a physical environment so that the user naturally interacts with mixed reality.
Nevertheless, high-quality immersive services require high-throughput, low latency, and reliable connectivity.However, one of the interests of 5G is to provide a higher uplink capacity than 4G, which partially addresses the quality and latency issues encountered in high-resolution streaming at 360 • .Therefore, 5G ensures reliable remote control of devices, as well as increased efficiency for low risks in hazardous environments.Undoubtedly, immersive services are now thriving in many industries.The following subsections present the different use cases of remote immersive services and their requirements.

Use cases
Nowadays, VR and AR are thriving within some sectors as follows.
Manufacturing: Manufacturing is one of the most promising fields that is expected to benefit from VR.Industry 4.0 is going to automate traditional practices, and the help of IoT and machine-to-machine communications will improve communication and monitoring.Automotive companies use VR to improve the design of their vehicles.On the contrary, creating physical samples of cars to improve and test them has resulted in wastage of significant time and money.Now thanks to VR, manufacturers can make adjustments virtually and in real-time.On the other hand, researchers are developing VR-based systems for remote manufacturing inspection and monitoring.For instance, in [42], a Cyber-Physical System (CPS) architecture, that uses WebVR, was developed to display real-time visual 3D models based on IoT sensors' data and actuators, leading to optimized production and assistance.In [43], a novel concept was proposed for remote inspection and anomaly detection in manufacturing machines.It relies on 360 • video streams with virtual embedded information.However, it uses a lowquality resolution stream (i.e., 720p), and both the camera and HMD were connected to the same computer Moreover, no analysis regarding the video stream or interaction latency was provided.
Training and Education: VR and AR technologies can help in education, especially after the outbreak of COVID-19 and the massive reliance on distance learning.This unexpected situation shows the need for interactive distance learning, which can be provided using VR/AR for more interactivity.Yet another example is learning to drive before trying a real car, visiting distant places and interacting with their objects and surroundings.In certain cases, it would be difficult to experience in real-life ocean exploration, or it would take more than half the globe to fly and explore worlds and galaxies.This area is expected to be one of the most affected by VR since it reproduces the user's natural spatial abilities.In [44], authors presented the challenges and scenarios of integrating 360 • video streaming in higher education for e-learning environments.One of the challenges they encountered is that they cannot offer HMDs to all viewers and, therefore, the solution should be, instead, web-based using WebVR.
Smart Healthcare: VR has also a major impact on the smart healthcare sector.This interactive environment provides doctors the medium to create 3D models of human anatomy, based on scans, to practice critical medical pro- cedures.This is expected to improve the performance of the operations and the well-being of the patients [45].In [46], a low-cost VR system for next-generation rehabilitation was developed.This system recognizes body motions, allowing the user to see his/her movements, and provides health status using the worn medical sensors.In [47], a proof-of-concept for a tele-medicine technology platform was presented.It provides an interactive experience through 360 • video in an augmented virtual world.However, this concept is not suitable for ultra-low latency services as one of the top priorities stated in their perspectives is to reduce video delay.
UAV and vehicle control: Multimedia and VR technology can be also used to help first responders in natural and man-made disasters, whereby conditions are often extremely difficult for the human to reach the disaster area [48].In such cases, UAVs can also help to measure damage, establish communications, and allow better visibility to the rescue teams.In [49], authors investigated the possibility of navigating and manipulating an UAV indirectly from an exocentric perspective using drone-augmented human vision.However, their solution used low-resolution streamed images (640×480), and implemented all computations onboard the UAV.

2.3
Requirements and challenges in 360 • video streaming 360 • applications are bandwidth-intensive and therefore optimizing their E2E latency through the network is very challenging [12], [13].Furthermore, E2E video latency depends strongly on the encoder, the target resolution, bit rate, and most importantly the streaming protocol, such as Real Time Messaging Protocol (RTMP), Real Time Streaming Protocol (RTSP), or HTTP Live Streaming (also known as HLS) to name a few that are supported by most 360 • cameras [50], [51], [52].Moreover, compared to the standard streaming systems, interactive VR applications will push the connectivity requirements to their limits due to their strict requirements in terms of capacity, low latency (less than 50ms latency), as well as consistent Quality of Service (QoS).
Consequently, the design of the system, presented in this paper, has largely been influenced by the current implementation technologies related to immersive services and UAVs.Compared to lightweight cameras, advanced 360 • cameras have extra functionalities (high frame rate, 8k streaming, internal RTMP server) but at the cost of high weight.Moreover, UAVs are only able to carry light cameras, and these light cameras are not suitable for such immersive services for various reasons such as, i) protocol dependencies, ii) hardware (HW) and software (SW) limitations, and iii) user-centric considerations.Furthermore, to obtain our optimized 360 • video streaming architecture, as in Section 3, many experiments were carried out at the beginning of this study to test and analyze various streaming protocols (using different types of media servers as presented in Table 1).These protocols were tested on 2D and 360 • cameras.The Experiments were carried out with a perfect bandwidth link by connecting the server and the video client to the same network [53], [54].
Table 1 shows the results of these experiments.Overall, the WebRTC protocol exhibits much lower G2G latency compared to the other protocols.However, current implementations of 360 • cameras do not support streaming using this protocol.From these results, our design choice should follow the setup that supports 360 • cameras and provide the lowest latency.Therefore, Experiment 6 represents the best option for building such system, with the following considerations: Protocol dependencies: Unfortunately, most of 360 • cameras only support streaming via RTMP.These protocols were designed for broadcast purposes and they require the deployment of an RTMP server at the continuum public/edge cloud.
User-centric considerations: Due to the high mobility of UAVs and the effect of winds, this may cause instability in the video, and consequently, it may cause dizziness to the user of the VR HMD.Therefore, in such case, HW or SW images stabilizer should be used to provide more reliable video stream.However, using SW stabilizer may contribute to the G2G latency.
HW and SW limitations: UAVs can only carry lightweight 360 • cameras.However, these cameras come with limited resources that provide streaming through a smartphone using WiFi direct.They provide low 4K streaming rates at a maximum of 30 Frames Per Second (FPS).Their battery lasts for 1h and they heat up while streaming, which adds some delays.
Access control: The issues of access control, security, and privacy are always a big concern for the users.Recently, the blockchain technology has been introduced as a viable solution for access control problems in decentralized systems [55].Bandwidth: The current generation of 4K 360 • videos requires 20-100Mbps of upload and download bandwidth, that is approximately 6x the bandwidth of a regular 2D video streamed at a similar resolution [56].Although 4G LTE can deliver this bandwidth, for a High Definition (HD) video, LTE needs to provide higher throughput and lower latency below 10ms and 1ms jitter per user [57].Moreover, most of the approaches widely used in the literature to reduce the download bandwidth are analog to the novel solution namely, the Omnidirectional MediA Format (OMAF) [58], established by the Moving Picture Experts Group (MPEG) [59], [60], [61], [62].However this latter is not tailored to low latency since the first MPEG-OMAF standard-compliant end-to-end chain for live VR360 [63] showed high E2E delay results of 6s for a 6K x 4K resolution video.However, the work in [64] proposed a tile-based viewport-dependent solution compliant with MPEG-OMAF and showed an end-to-end delay >1s.
Latency: One of the most critical requirements of immersive services is G2G or MTP motion-to-photon latency [53], [54].VR/AR applications are the most latency-demanding applications, as the latency between the physical movement of a person's head and the updated photons on the Head Mounted Device (HMD) display reaching the person's eye must be very small.This is because human sensory systems can detect very small delays.To avoid this motion-to-photon issue, rendering and display latency at the HMD must be less than 15-20ms.Furthermore, the overall E2E video delay must be less than 1s in order to feel smooth interaction with the information [65].In this work, we are focusing more on the E2E video delay.
Frame rate: In order to have a smooth immersive experience, the frame rate must be fast .When input video sources have a fixed resolution (4K/8K), the video output viewed by the viewer must be within the appropriate range [66].

VR-BASED UAV REMOTE CONTROLLING
It is worth noting that our system is a real-life testbed that allows to remotely control a UAV using a VR HMD.As it will be shown, the testbed's latency results regarding the video and UAV control overcome most of the challenges discussed earlier.

Description
This section describes the proposed architecture functionalities for the VR-based remote control of a UAV with the aim of providing fully immersive services.Each component is presented alongside the interactions with the other components.Two key architectures are presented, global and detailed architecture (shown in Figures 1 and 7, respectively).The global system's architecture in Figure 1 shows the overall components of the system, whereas the detailed architecture in Figure 7 provides the measured delays and the complete testbed demonstrating the components' interactions in a chronological order.
Our architecture, which is a real-life implementation, includes micro-services based components at the edge server.Components are containerized in order to provide portability and scalability.The system aims at providing end users, the ability to control remote UAVs based on 360 • stream and sensed data from the remote location.At the remote location, the UAV is equipped with a 360 • camera that live streams to the user's HMD.On the user side, upon receiving the camera's stream, the user can control the remote UAV using body movements and HMD controllers.The components of the system are described as follows : Remote UAV: The remote UAV is equipped with (i) a Single-Board Computer (SBC), ii) a flight controller, iii) IoT sensors, and iv) 360 • camera.Both the SBC and the camera communicate with the system through WiFi, 5G, or 4G.The SBC controls the UAV by receiving the user commands from the edge applications and transmits them via USB to the flight controller.The 360 • camera live streams an RTMP video to the VR edge application streaming module (FFmepg RTMP server [67]) through the SBC.
VR users: Upon receiving the 360 • video stream, the user views the real-time stream through a Web platform using his HMD and controls the UAV remotely using the right-side HMD's controller (as shown in Figure 2(A)).The HMD sends these commands to the UAV edge application.The user can control the Field Of View (FoV) as full 6-DoF, using head and body movements (Figure 2(C)), or using the HMD's left-side controller, (Figure 2(B)).Furthermore, with body movements, the user can control the altitude of the UAV, which lowers or increases its altitude if the user crouches or stretches, respectively, as shown in Figure 3.  Edge Computing: Edge computing is considered to reduce the latency and alleviates the processing burden from the UAV and HMD.The latter would be achieved by offloading heavy tasks.Two edge applications are deployed near the UAV.These applications are namely the streaming module that relays video data and the control and monitoring module that transmits IoT and UAV commands data, as shown in Figure 1.
Streaming module: The first edge application deployed near the HMD is the streaming module.This application is composed of an RTMP server and a Web Real-Time Communication (WebRTC) proxy allowing to transmit the real-time video stream to the web application with the lowest latency possible.The streaming module converts the RTMP stream into WebRTC to reduce video latency.We have chosen WebRTC since it is well known for its ultralow latency and web support [43]; hence it provides 360 • video to any device able to access the web.
Web server: The web server serves the WebVR application.It is the interface to the user to view the information status of the UAV and 360 • video stream through an HTML5 video player [68] adapted to play 360 • video.It manages the WebSocket stream video from the WebRTC server and synchronizes different video inputs (UAV video streams) with the outputs (video players who are requesting a given stream).The 360 • video can be viewed by any device able to access a web browser.The choice of WebVR was mainly due to allowing an immersive view to any device that has access to a web browser starting from a simple card box to a HMD.
Control and monitoring module: It is composed of a Message Queuing Telemetry Transport (MQTT) broker in charge of two functions: i) forwarding the user control commands from the web application to the flight controller module via one of the most popular communication protocols for UAVs, namely the Micro Air Vehicle Link (MAVLink) protocol [69], and ii) updating the user about the censorial information of the UAV such as altitude, latitude, longitude and speed, as well as LTE and 5G-relevant information from the dongle that is connected to the UAV.This information is integrated within the 360 • immersive view of the HMD.It is visualized by clicking on virtual elements within the immersive view, as shown in Figure 4.

PERFORMANCE ANALYSIS
This section provides a deep analysis of the testbed performance.The performance is analyzed in terms of Glass-to-Reaction-to-Execution (GRE) latency, Glass-to-Glass (G2G) latency, Human Reaction Latency (HRL) and Command Transmission Latency (CTL).It is worth noting that the tests were not simulated but achieved in real using a real-life implementation.It is important to recall that for efficiently controlling a remote device/UAV, very low video transmission latency is essential.In this paper, GRE latency is defined as the time between the moment a motion or an event has been captured by the UAV's 360 • camera to the moment a user's reaction to this event has been received and executed by the UAV.GRE is essentially important to determine how the system reacts to predicted events (e.g., approaching an obstacle such as a wall).GRE comprises three delays, namely: i) G2G latency, ii) human reaction latency, and iii) command transmission latency.These different delays were measured following the detailed architecture in Figure 7.
This architecture was carried out using the following hardware configuration shown in Figure 5.It is composed of an Oculus Quest 2 as a HMD, a UAV equipped with a flight controller and a SBC, another SBC for G2G latency measurement and an Insta 360 One X as a 360 • video camera.We used a local wireless network as well as the 4G and 5G networks of a Finnish telecom operator, the characteristics of which are summarized in Table 2. Though the speeds of 5G show remarkable downfall (as much as 7.5 times) when the receiving devices are not in direct line of sight of the antennas, this remains one of the biggest challenges of 5G [70].
The term G2G is defined in the literature as the delay between the moment an event is captured by a camera, for instance, a 360 • camera, till the event is projected on the display (HMD display).The human reaction latency refers to the delay a user takes to perceive a visual event and react to it.Finally, the command transmission latency is the time a user command takes to reach and be executed by the UAV.Table 3 summarizes the parameters used in this paper.Figure 6 illustrates the analyzed delays.

TABLE 3
The notations used in this paper.

Parameter Description G2G
Glass-to-Glass latency GRE Glass-to-Reaction-to-Execution latency T LED The LED blink time at the UAV T LS The time a light sensor detects the LED blink from the HMD lenses The time a user command is received at the UAV T U AV S The time a light sensor command is received at the UAV HRL The human reaction latency (visual stimuli delay) CT L The command transmission latency SRL The sensor reaction latency

Metrics measurement
Several techniques have been proposed to measure G2G latency.Software-based techniques encode timestamps within the stream frames and retrieve them at the receiver.G2G is then calculated as the difference between the local system time and the decoded timestamp.Timestamps based on EAN-8 barcodes are the most commonly used, compared to numbers and characters, since they are accurate and easy to decode [71] [72].However, these techniques are compute-intensive and may strain both sender and receiver resources to encode/decode the timestamps.On the other hand, hardware-based techniques [73] do not involve the system resources, but instead use external tools and devices to measure G2G delay.MacCormick.[74] and Hill et.al. [75] measured the G2G delay using a camera that films a clock on a computer screen, and its stream in a second screen whereby the difference noticed in the clock between both screens represented the G2G delay.However, the system required manual intervention to compare the clock's images, which does not allow to retrieve many samples of latency results and is thus not highly accurate.Robert et al. [76] used an almost similar method to the previous one and measured visual latency on video for AR devices with a hardware instrumentation-based measurement method.However, their method requires comparing both the event source and the screen of the HMD manually.Xu et al. [77] compared the G2G delay of video conferencing tools such as Microsoft Skype and Google+ using an analog method to Hill et al. [75].In their method, the clock's comparison from pictures was not retrieved manually but automatically by using an Optical Character Recognition (OCR) software.However, OCR cannot be used in panoramic frames such as in 360 • video since images are stretched and therefore clock' numbers may not be recognized.In this paper, a hardwarebased technique inspired from [78] and applied to the VR application on HMD is used to measure G2G, GRE, HRL, and CTL.The method is applied to a HMD, following the architecture in Figure 7, and as shown in Figure 5.It consists of triggering a light source, Light Emitting Diode (LED), using the SBC that is attached to the UAV, in front of a camera's Field of view (FoV) at T LED .The LED's flashing is captured by the 360 • camera and streamed to the HMD display.The light sensor that is connected to the HMD captures the blinks at T LS , and triggers its SBC.The G2G delay is then calculated as the difference between T LS and T LED .Effectively, knowing that both the SBC of the UAV and the one equipped with a light sensor are connected through a wire and that the UAV's SBC notifies the light sensor's SBC when the LED is triggered, this allows us to measure the G2G delay at this latter SBC as follows: Furthermore, based on the G2G latency measurement tool, to measure GRE latency, the SBC on the UAV, on which a Light-Emitting Diode (LED) is attached is used.This SBC blinks the LED at a constant frequency and is placed in front of the camera to simulate an event, such as an approaching object or obstacle.The user then recognizes this object/obstacle through the HMD display and reacts to this event by pressing a button of the HMD controller (e.g., to turn away from that object).This command reaches the UAV at T U AV U to be executed then.The GRE is measured then as the delay between the moment the SBC blinks the LED, T LED , and the moment the user command is received by the UAV T U AV U .Simultaneously, since the user wears the HMD with a light sensor, placed on one of the HMD lenses, the same setup can be used to measure the light sensor's detection latency which represents the command transmission latency comprised with the G2G delay (i.e., as illustrated in Figure 6).Once the light sensor detects the light at T LS , the SBC on which it is attached reacts and sends a command to the UAV.This command reaches the UAV at T U AV S , and the Sensor Reaction Latency can be computed at the SBC of the UAV as follows: This latency can be also theoretically expressed as follows: CTL represents the commands' E2E transmission latency, excluding the latency needed for the processing and the rendering of the video, as shown in Figure 7. Starting from when the user triggers the HMD's right controller until the reception of the command by the UAV, this latency consists of two main delays since the transmission involves two protocols: one from the user to the edge server which is based on the MQTT protocol and the second from the edge server to the UAV that is based on the MAVLink protocol.For the MQTT protocol, we start a timer at the user when he/she sends a command to the MQTT broker at the edge.Effectively, we configured the subscriber (i.e., flight controller module) to echo a message to the client once it receives a command.The time difference between when the command is sent and when the echo is received represents the round-trip latency of the MQTT communication.Therefore, we consider the MQTT protocol latency as the round-trip latency divided by two.The same is done for the MAVLink protocol latency, as we exploit a Ping library implemented within the telemetry messages and calculate the round-trip latency.

Results analysis
This section presents an analysis of the results obtained through experiments on the system.While the HMD uses WiFi to access services running at the edge and control the remote UAV, the UAV connects to the edge server through WiFi, LTE, or 5G.The 360 • camera streams at 30 FPS and encodes video using the H264 codec [79].Figure 8 shows the GRE latency for different streaming bit rate (Mbps), for different access networks of the UAV (i.e., WiFi, LTE, 5G), and for two different video qualities (i.e., HD -1280×720or 4k -3840×1920).Within the same setup, two scenarios are considered, namely with and without human reaction latency.Figure 8 (a) plots the measured G2G latency.The results demonstrate that this G2G delay increases as per an increase in the Constant Bit Rate (CBR) encoding.This is intuitively due to network bandwidth limitations and lower throughput when increasing streaming CBR.Moreover, it is clear that the increase is more noticeable in case of 4G, especially for 4K 360 • videos.High network latency and bandwidth limitations in 4G are the main factors beneath the increase in the G2G latency.On the other hand, 5G shows very good results compared to 4G, since the G2G latency obtained in case of 5G is almost similar to the G2G latency experienced when the UAV is connected through a dedicated WiFi connection.
Figures 8 (b) and (c) show the GRE latency and the Sensor Reaction Latency (SRL), respectively.Both metrics are measured simultaneously, once the LED blink is displayed on the HMD.Both the user and the light sensor react to this blinking.The user reacts by pressing the controller button, whereas the light sensor reacts by sending a command to the UAV.Naturally, the sensor reacts faster than a human being.From these figures, we also observe that the GRE and SRL increase, in the same fashion as the G2G latency, when the streaming bitrate is increased.This is trivial as both latencies include the G2G delay.
Overall, it is noticeable that the average GRE is 900ms (Figure 8 (b)).This latency represents the overall E2E roundtrip latency of the system from the moment an action occurs or an event is detected and streamed by the camera till the execution of the user's command at the UAV.As illustrated in Figure 6, this latency comprises the G2G latency, the HRL latency (200 -400ms) and the CTL latency.
It is worth noting that each sample in the graphs shown in Figures 8 (a     from the research team) and that is in order to minimize the impact of the individual human reaction latency on the overall analysis.Indeed, the human reaction latency, also known as the visual stimuli, differs from one individual to another, but it is known to be perceived after approximately 200ms from stimulus [80].We have measured this HRL latency by having different persons react to the LED blinking without wearing the headset.The visual stimuli delay is then the delay of an individual in reacting to a LED blink, without wearing a headset, subtracting from it the command transmission delay.As shown in Figure 6, an offset delay is what differentiates the sensor reaction and GRE delays.This offset maps unto HRL.SRL consists of the G2G latency and the command transmission latency.Figures 8 (d) and (e) illustrate the measured human reaction latency and the command transmission latency, respectively.The human reaction latency is independent of network delays and depends only on each person's visual reaction delays.Therefore, we can see that this delay tends to converge towards a constant value which is around 220ms.For the measured command transmission delay, 5G shows almost identical delays to the WiFi network as a mean of 103ms for 5G and 88ms for WiFi.Whereas 4G tends to have a higher delay, namely an average of 138ms.This is mostly due to network latency and bandwidth limitations of 4G compared to WiFi and 5G.
Furthermore, to assess the video quality, we measured an objective and a subjective quality assessment metric, namely the View-Port Peak Signal to Noise Ratio (VP-PSNR) and Video Multimethod Assessment Fusion (VMAF) developed by Netflix [36] based on the open-source library FFmpeg360 [81].VMAF is a Full Reference (FR) metric that combines multiple secondary metrics using machine learning to offer a good prediction of subjective video quality (human perception) on a scale of 0 to 100.It was designed first to assess 2D video quality, but its compatibility to work with 360 VR content without any adaptation is validated by [82].The VP-PSNR is an objective video quality metric used to measure the distortion introduced by encoding at video transmission.To measure the VP-PSNR and VMAF values, we pursued the following steps:

•
Step 2: Generate the view-rendered video from the reference one by applying a filter from the FFm-peg360 library at this latter.The filter allows us to replicate the user's view at a given orientation position of pitch, yaw, and roll that was 0 • , 0 • , and 90 • in our case.The video is referred to as a referenceview video

•
Step 3: Stream this reference video through the internet and our streaming system and record it at the receiving HMD that is viewing at 0 • , 0 • , 90 • angle.We call this recorded video the user-view video.

•
Step 4: Make the visual quality comparison between the original reference-view and the user-view videos by applying PSNR and VMAF filters provided by FFmpeg.

•
Step 5: Repeat steps 3 and 4 by changing the streaming rate as well as all of the steps for an HD reference  video.
The VP-PSNR and VMAF measurements were done considering different streaming rates as shown in Figure 9(a).The values represent the mean values of the VP-PSNR and VMAF results for each stream rate of a 500 frames video stream for both 4K and HD quality streamed over WiFi. Figure 9(b) plots the variation of the VP-PSNR and VMAF values of each frame of the 500 frames of a 360 • HD video streamed at a rate of 2Mbps.From Figure 9(a), we observe that the values of VMAF and VP-PSNR increase as per the increase in the streaming rate and that is for both HD and 4k video streams.This increase is justified by the fact that the distortion rate is less important when the streaming rate is higher.We also notice that 4K videos get more distorted than HD videos when streamed at low rates, which is due to 4K videos containing more data than HD videos.Overall, the quality assessment of our received streams was satisfactory, since the lowest VMAF value at the lowest streaming rate, 2Mbps, is 40 and 50 for both 4k and HD, respectively.In contrast, it reaches 78 and 90 when the streaming rate is at 8Mbps.We also observe satisfactory values for VP-PSNR and that is in case of both 4k and HD videos and for all the considered streaming rates.Accordingly, even at low stream rates, our video quality remains acceptable.

Results validation
To validate the results provided in Figure 8, both CTL and HRL can be calculated as follows: The measured GRE was validated by measuring the delays composing it and comparing them against deduced ones, as shown in Figure 10.The HRL, CTL, and G2G latencies were measured.The sum of these delays represents the theoretical GRE latency.Further, the GRE latency was measured, which was validated by comparing it against the theoretical one.
The measured CTL and HRL results, shown in Figures 8 (d) and (e), are compared against the deduced results, as shown in Figure 11.We can see that the human reaction delay, shown in Figure 11 (a), is independent of the network and is therefore almost steady for all network types.Moreover, the error between the deduced results and the measured ones is small as of 11.8ms for WiFi experiments, 15.26ms for 4G, and 11.59ms for 5G.These delays are between 200ms and 300ms, which is similar to the results known in the state of the art [80].
Thereafter, we compared the measured and deduced delays for the command transmission latency, as shown in Figure 11 (b).As we can see, this delay is networkdependent since delays are higher when using 4G, then lower when using 5G, and even lower for WiFi.The error between the deduced and measured values was very small as 4.37ms for WiFi, 10.39ms for 5G, and 14.73ms for 4G.These small errors between the theoretically deduced and measured delays prove the validity of our measurement method.

CONCLUSION
In this paper, we proposed a real-life testbed, along with its management architecture, for VR-based remote control of UAVs, assisted by several IoT sensors.In the experiments, UAVs were reachable through a WiFi network, a 4G or a 5G cellular system.In the evaluation, several delays were defined and a methodology for their measurements was proposed.The obtained results were promising and proved the efficiency of the proposed VR-based UAV remote control architecture and the delay measurement method.The errors between the measured and deduced delays were very small, which validates the proposed measurement method.Whilst the obtained results were encouraging, there is still room for improvement to minimize the 360 • video streaming delays, such as the use of a camera rig instead of a 360 • camera, and the application of machine learning techniques to predict the Field of View of a user and to ultimately stream the watched FoV with high quality and stream the non-watched FoV with lower quality to reduce the overall

Fig. 1 .
Fig. 1.The high-level architecture of the envisioned system.

Fig. 2 .
Fig. 2. Operations for the control of the field of view of a UAV.

Fig. 4 .
Fig. 4. Displaying the IoT sensed data in the immersive view.

Fig. 6 .
Fig. 6.The different delays analyzed in this paper.

Fig. 7 .
Fig. 7. Testbed's hardware and software components for VR-based UAV control and the measurement of the different considered delays.
), (b), and (c) represents an average of the results obtained from 40 iterations of the experiment described above.Iterations were carried out by different people (i.e., Sensor reaction latency.(d) Measured Human Reaction Latency (HRL).(e) Measured Command Transmission Latency (CTL).
(a)VP-PSNR and VMAF values of a 360 • video streamed over WiFi at different streaming rates.(b) VP-PSNR and VMAF values of different frames of a 360 • HD video streamed over WiFi at a streaming rate of 2Mbps.
(a)Validation of human reaction latency.(b) Validation of command transmission latency.

TABLE 1
Benchmark of different streaming protocols.

TABLE 2
Testbed's parameters and values.