Network-Aware 5G Edge Computing for Object Detection: Augmenting Wearables to"See"More, Farther and Faster

Advanced wearable devices are increasingly incorporating high-resolution multi-camera systems. As state-of-the-art neural networks for processing the resulting image data are computationally demanding, there has been growing interest in leveraging fifth generation (5G) wireless connectivity and mobile edge computing for offloading this processing to the cloud. To assess this possibility, this paper presents a detailed simulation and evaluation of 5G wireless offloading for object detection within a powerful, new smart wearable called VIS4ION, for the Blind-and-Visually Impaired (BVI). The current VIS4ION system is an instrumented book-bag with high-resolution cameras, vision processing and haptic and audio feedback. The paper considers uploading the camera data to a mobile edge cloud to perform real-time object detection and transmitting the detection results back to the wearable. To determine the video requirements, the paper evaluates the impact of video bit rate and resolution on object detection accuracy and range. A new street scene dataset with labeled objects relevant to BVI navigation is leveraged for analysis. The vision evaluation is combined with a detailed full-stack wireless network simulation to determine the distribution of throughputs and delays with real navigation paths and ray-tracing from new high-resolution 3D models in an urban environment. For comparison, the wireless simulation considers both a standard 4G-Long Term Evolution (LTE) carrier and high-rate 5G millimeter-wave (mmWave) carrier. The work thus provides a thorough and realistic assessment of edge computing with mmWave connectivity in an application with both high bandwidth and low latency requirements.


I. INTRODUCTION
Technology in smart wearables is advancing rapidly with an increasing integration of rich camera and sensor data [1], [2]. At the same time, there has been remarkable progress in machine vision technology for processing this visual information. A key challenge of deploying advanced machine vision algorithms in the wearable setting is that state-of-theart deep neural networks are computationally demanding, particularly for mobile devices that are limited in power and processing resources for high-resolution images [3].
Mobile edge computing combined with the massive mobile broadband capabilities of fifth generation (5G) cellular wireless systems offers the possibility of offloading these computationally intensive vision processing tasks to the network edge [4], [5]. Importantly, 5G systems can leverage the millimeter-wave (mmWave) bands which afford vastly greater spectrum for higher-rate and lower-latency connectivity compared to standard 4G ones [6]- [8]. With mmWave connectivity, a mobile device or wearable can upload highresolution video data to edge servers, where much greater computational processing can be performed while keeping resources closer to the user to reduce the overall latency. Wireless offloading can thereby enable support for multiple cameras for an enlarged field-of-view. Edge connectivity may also provide real-time access to data from other users, converging to new cooperative service strategies.
In this work, we study the potential of wireless offloading of machine vision processing for a powerful, smart wearable for the Blind-and-Visually Impaired (BVI). The system, called VIS 4 ION (Visually Impaired Smart Service System for Spatial Intelligence and Navigation) [9]- [16] is a human-in-the-loop, sensing-to-feedback advanced wearable that supports a host of microservices during BVI navigation, both outdoors and indoors. The current VIS 4 ION system is implemented as an instrumented backpack; more specifically, a series of miniaturized sensors are integrated into the support straps and connected to an embedded system for computational analysis; real-time feedback is provided through a binaural bone conduction headset and an optional reconfigured waist strap turned haptic interface.
A key limitation of the current VIS 4 ION system is that all the machine vision is performed locally by an embedded processor in the backpack, which limits the image resolution and the frame rate at which visual computation (e.g., object detection) can be performed. Furthermore, the battery needed to enable prolonged operation adds considerably to the backpack weight. Here, we investigate the wireless offloading of the vision computations to edge servers as shown in Fig. 1. In the system studied, the wearable is augmented with multiple high-resolution cameras to increase the field of view (device-wise) and enhance functionality (the current system has a single stereo camera). When wireless connectivity is available, the camera data will be uploaded over a cellular network to a mobile edge server. We analyze the system in the case where the cellular wireless link can include both traditional lower data rate carriers (e.g., sub-6-GHz carriers in 4G) as well as higher data rate 5G carriers in the mmWave band. Since the data rate in the multi-carrier system may be variable, we consider an adaptive video scheme where the number of camera feeds and bit rate per camera are adapted based on the estimated uplink wireless rate and delay.
The requirements for such a wireless system are considerable and well beyond those considered in prior vision offloading studies. For example, as we will see in Section III, accurate object detection for pedestrian scenes at reasonable distances can require over 100 Mbps if four cameras are used. Moreover, based on physiological markers (see Section II-D), the total maximum end-to-end delay of the system will likely need to be less than 100 ms. After removing the time for video acquisition, compression, and inference, there is a limited time for uplink and downlink transmission. As described in the previous work section below, most prior applications of edge computing for machine vision processing with video rate adaptation (e.g., [18]- [22]) have considered relatively low-resolution, single-camera data where the requirements are much less strict. In these cases, sub-6-GHz carriers with relatively limited bandwidths are generally sufficient. In contrast, we will see that the 5G mmWave bands are uniquely capable of meeting the peak requirements for the enhanced VIS 4 ION wearable.
Yet, 5G mmWave connectivity presents considerable technical challenges of its own when used for offloading. Most importantly, data rates in mmWave outdoor links are highly variable since the signals have limited range and are strongly susceptible to blockage from buildings, pedestrians, and other objects in the environment [23], [24]. In addition, mmWave links are highly directional and require continuous beam tracking to maintain connectivity [6], [25]. This beam management and rate prediction can cause significant additional delays [26]. The broad goal of this paper is to provide a detailed assessment of the feasibility of 5G mmWave machine vision edge processing in a high data rate lowlatency application.
Our study does not include the important factor of power consumption that arises from mmWave connectivity, or the potential power savings from avoiding local computation. Power analyses of mmWave devices transceivers and beam tracking can be found in [27]- [29], along with power measurements of commercial devices in [30]. Our focus here is on the functional benefits of 5G connectivity such as support for increased number of cameras and higher resolution and object detection range.
Our analysis follows four main steps, each of which bears significant new contributions to handle the unique nature of the mmWave offloading system for the enhanced VIS 4 ION wearable: • Creation of the NYU-NYC StreetScene dataset: First, to evaluate object detection, we curated a custom dataset, NYU-NYC StreetScene, of high-resolution videos taken during the 'Last Mile' pedestrian segment of commut-FIGURE 1: Wireless offloading study: The VIS 4 ION wearable jacket from [10]- [13], [16], [17] is outfitted with multiple cameras for 360°view. Due to power limitations, local processing of the camera data may be limited to a single camera at low resolution. When high rate wireless connectivity is available, multi-camera, high-resolution data can be sent to an edge server where greater processing capabilities is available. The detection results are then returned in the downlink. The paper assesses the feasibility of this approach in urban environments under realistic wireless channel conditions and deployment assumptions.
ing in NYC. The videos were manually annotated with objects specific to BVI navigation. A depth estimation method was developed for selected objects (standing people) so that the distance of the objects could also be estimated -key to assessing the detection range. The dataset is made public and is itself a contribution of the work [31]. • Evaluation of the impact of video resolution and bit rate on object detection: Using the dataset, we conducted an extensive study to evaluate the impact of video resolution and bit rate on the object detection accuracy and the reliable detection range. As discussed in the prior work section, previous analyses such as [21], [22] considered only low-resolution images, and did not explicitly study the detection range. Moreover, [32] did not consider the effect of compression. • Wireless network evaluation: We next conducted detailed, realistic wireless network simulations of end users engaged in 'Last Mile' pedestrian commuting similar to those from which the video was captured.
To assess the unique capabilities of 5G, we simulated both a 5G mmWave carrier at 28 GHz, and a traditional 4G Long Term Evolution (LTE) carrier at 1.9 GHz.
To accurately predict propagation at both frequencies, we used state-of-the-art ray tracing [33] combined with new highly detailed 3D models acquired from GeoPipe [34]. The use of such detailed models in wireless simulation is the first of its kind. The channel data was integrated into a widely-used end-to-end network simulator, ns-3 [35], that captures blocking, beam tracking, 4G and 5G protocol functionalities as well as delays in the core network and edge network. • Performance analysis and availability: Combining the video and wireless analysis with inference times, we compared the performance of three scenarios: local processing only, offloading with LTE, and offloading with 5G mmWave + LTE. For each option, we were able to determine key performance numbers such as the object detection accuracy, range of detection, number of cameras that can be supported, and endto-end latency. In addition, since the channel quality is variable, we determined the percentage of time that the performance values can be obtained for a given scenario. We further evaluated the performance achievable with an adaptive offloading scheme, which switches between edge and local computing and varies VOLUME 4, 2016 the video resolution based on the wireless throughput. In summary, we completed a thorough assessment of combining state-of-the-art machine vision and mobile edge computing for contemporary advanced wearables.
We point out that this present work focuses on cellular wide area technologies such as 4G and 5G since our target application is outdoor mobility. Of course, in indoor settings and hotspots, Wireless Local Area Networks (WLANs), including high data rate versions such as [36], [37], may be available -see also the prior work section below for studies in mobile edge computing with WiFi. The study of indoor navigation with wireless offloading with high data rate WLAN is an interesting topic of future research.

RELATED PRIOR WORK
With the growing use of computationally intensive deep learning methods for vision tasks, there has been significant work in studying offloading of this computation via edge computing; see, e.g., [18] for an excellent recent survey. However, very few consider the unique challenges of highresolution images transmitted over massive broadband 5G links as needed by the VIS 4 ION system. For example, some of the most recent works are as follows: • Edge-AI [19] studies edge computing for image classification. Since the study is on relatively limited data rate 4G links, the work studies dynamically partitioning the layers in a CNN for vision classification. In this work, we assume the processing is entirely done at the edge or local device. • mVideo [20] considers offloading large batches of surveillance images to the cloud for face detection. This work also only considers 4G. • Hochsteetler et al. [38] studies inference time on a very low-power edge device without access to an edge server. • Liu et al. [21] considered edge-assisted object detection in mobile Augmented Reality (AR) applications. To meet the stringent delay requirement, they combine edge computing for object detection and fast local object tracking, which we adopt in our work as well. They also propose slice-based processing so that video transmission and object detection can be run sequentially over successive slices to reduce the edge computing delay. Their simulations only consider indoor WiFi connections between the VR headset and the nearby server, while we consider users walking through urban streets using 4G and 5G cellular networks. They also investigated the effect of video resolution and bit rate on the object detection accuracy and processing delay. However, they only examine resolutions up to 720P and consider the faster R-CNN object detector, while we consider resolutions up to 2.2K and focus on the more popular YOLO detector. • Ran et al. [22] also study edge computing for AR applications. They assume the local processor (smartphone) can run either a tiny-YOLO or a big-YOLO model, while the edge server only runs the big-YOLO model. They consider the trade-off among decision variables including spatial resolution, frame rate, mobile power consumption, edge vs. local processing, object detection model through measurement studies and propose a measurement-driven optimization framework to determine the optimal setting for these decision variables to optimize a weighted average of the detection accuracy and frame processing rate, under the delay, bandwidth, and power constraint. However, the resolution range considered in their study is very low (only up to 480×480). They also do not simulate real wireless networks. • Jiang et al. [39] propose methods to adapt frame size and frame rate as well as detection models to meet a target detection accuracy based on video content. They leverage the spatial (cross cameras) and temporal correlation of the optimal configuration to reduce the computation cost of profiling. However, this work does not consider the impact of video compression (and hence the bit rate). • Huang et al. [32] consider the impact of frame size and frame rate (as well as detection model configuration) on the object detection accuracy. This work is not evaluated in the context of edge computing and hence does not consider the effect of compression. In addition to the above studies on mobile edge computing, there is also a large body of work on delivering low-latency services in 5G. Indeed, one of the core design requirements of 5G are so-called Ultra-Reliable Low Latency Communications (URLLCs) [40], [41] targeting airlink latencies of 1 to 10 ms. For the application in this paper, the URLLC features of 5G are critical in meeting the overall delay requirements. However, we will also see that several other operations contribute to the overall delay including video framing, video encoding, delays in the core network, and inference time. One goal of this work can be seen as evaluating end-to-end delays with a realistic assessment of the major components of the overall application.
Finally, on a more general note, the work [42] describes an Internet of No Things and "seeing the invisible". The focus of this paper is more limited, specifically increasing the range and field of view via computational offloading.

II. THE VIS 4 ION SYSTEM AND OFFLOADING ARCHITECTURE A. MOTIVATION
Immobility is a fundamental challenge for persons with BVI [43]. Loss of sight leads to loss of spatial cognition [44]. Spatial cognition can be defined as the knowledge or cognitive representation of the structure, entities/objects, and relationships within space [45]. The overwhelming majority of physical spaces are not visually accessible or do not allow for safe and efficient travel [46], [47]. This limits spatial cognition for the blind and leads to inefficiencies and peril  [13], [16], [17] is an instrumented book bag with a single ZED camera, NVIDIA Jetson GPU for local vision processing, and haptic and audio feedback. In this work, we consider augmenting it (proposed augmentation shown in blue) with three additional cameras for rear, left and right visibility and wireless connectivity for edge processing. during navigation. This gap requires new tools to bridge such accessibility barriers and to promote independence in daily tasks. A host of assistive technologies for citizens with BVI have been proposed. The white cane [10] is arguably the most widely used [48], [49] and affordable tool, but its short perceptive range limits its function as a direct extension of physical touch [50]- [55]. High-tech hardware-based wearable devices have been developed to provide assistive features such as outdoor navigation [56]- [58]. However, they are generally either high cost or overly cumbersome [59]- [61]. In contrast, software-based solutions that run on ordinary smartphones are more affordable and accessible for BVIs. For example, Microsoft Seeing AI [52] and Blind Square [53] are widespread sensory substitution and navigation apps. However, these applications are not capable of offering advanced computer-vision-based assistive services or features due to the smartphones' limited on-board sensing capabilities and computing power.

B. THE CURRENT VIS 4 ION SYSTEM AND ITS LIMITATIONS
To address these challenges, we recently developed VIS 4 ION, a Visually Impaired Smart Service System for Spatial Intelligence and Navigation [9]- [11], [13], [14], [62]. The system is implemented as a mobile sensor-tofeedback wearable device in the form of an instrumented bookbag -See Fig. 2. This smart service system is capable of real-time scene understanding with human-in-theloop navigation assistance, supporting both mobility and orientation [12], [17]. VIS 4 ION has four components: (1) distance and ranging/image sensors scaffolded into the shoulder straps of a backpack; these sensors (including a stereo camera and an Inertial Measurement Unit) extract pertinent information about the environment; (2) an embedded system (micro-computer) with both computing and communication capability (inside backpack); (3) a haptic interface (waist strap) that communicates spatial information computed from the sensory data to the end-user in real time via an intuitive, torso-based, ergonomic, and personalized vibrotactile scheme; and (4) a headset that contains both binaural bone conduction speakers and a noise-cancelling microphone for oral communication [10], [11], [13], [15]. The system leverages stereo cameras as its primary sensory input and employs advanced computer vision algorithms on Nvidia Jetson processing boards. The goal of the embedded system is to enable continuous mapping, localization, and surveillance within a dynamically changing environment [10]- [13], [15], [17].
The key limitation in the current system is that the video processing is performed entirely locally, which is computationally and power intensive and limits the performance of visual analytics. Indeed, the wearable runs off a laptop battery with approximately 66 Wh at 0.5 kg yielding 2-3 hours of function if continuously running the vision processing. To stay within these power limits, the wearable uses the Jetson Xavier NX to perform object detection. With a standard YOLO model the system is able to process only WVGA resolution video at a rate of 10-13 frames per second (FPS) -See Table 4; results are often even poorer in mobile phone applications employing similar approaches [63]. As we discuss below, the low resolution results in poor detection accuracy and limited range.
Another limitation of the current system is that it deploys only a single stereo camera providing a field of view of approximately 90 degrees horizontally and 60 degrees vertically. At about 3 meters of distance or range from the end user, a 90-degree field of view is very restrictive, leaving potentially pertinent spatial obstacles out of the perceptive capabilities of the system, ones that may be encountered with even slight orientation shifts in forward paths. While this may be circumvented by simply using ultra-wide angle cameras, the geometric distortion in such cameras can degrade performance of visual analytics [64]. In order to address these shortcomings, we propose to embed multiple cameras in the VIS 4 ION backpack to provide omnidirectional coverage.
Although multiple cameras or 360 degrees of perception may seem superfluous for a human with no disability, a person with a disability may benefit significantly from a system that can provide advanced notice and anticipate danger omnidirectionally. These full-field approaches to environmental analysis are now a common practice in myriad autonomous systems, from robots to cars and drones [65], [66].

C. WIRELESS OFFLOADING SYSTEM STUDIED
Wireless offloading of vision processing to a mobile edge server offers two key potential benefits: (1) greater processing capability at the edge can enable analysis of multiple VOLUME 4, 2016 high-resolution camera streams for fast and more accurate object detection, over a wider field of view and greater range (distance); and (2) reducing processing on the wearable can prolong the battery life and/or reduce the battery weight.
To assess these potential gains, we consider two augmentations to the backpack, depicted in blue in Fig. 2. First, we consider a version of the wearable with four stereo cameras, for example, to cover four sectors of 90 degrees each. Second, to process multiple cameras at higher resolution, we consider adding cellular connectivity to an edge server with higher process capabilities. The system will adapt the number of camera streams to be uploaded to the edge server and their target bit rates based on the estimated uplink network capacity. Furthermore, the compression configuration (e.g., frame rate and spatial resolution) of each video stream will also be adjusted based on the target rates (adaptive). The mobile edge will analyze the videos from multiple cameras using a deep learning network for object detection (and other tasks such as environmental mapping) and send the results back to the wearable over the downlink; see also Fig. 1.
When the wireless connection is temporarily down (e.g., due to blockage), the local processor can analyze one video stream at a lower resolution, while storing the captured highresolution video within its on-board memory. The highresolution video can then be opportunistically uploaded to the edge server when the user reestablishes a highbandwidth wireless connection, to enable the mapping of various environments, perform post-hoc behavioral analysis, etc.

D. DELAY REQUIREMENTS
For real-time pedestrian navigation, there is no generally agreed upon requirement for the tolerable total delay between the time an object appears in the environment of a pedestrian and the time it should be detected and reported to the pedestrian. In our previous work, we suggested 100 ms [67]. This benchmark is predicated on a physiologic marker, the high-end of the duration range for a large-amplitude saccade (fast eye movement); such an eye movement would be used by a normal-sighted pedestrian to identify a potential hazard. This stringent delay requirement enables the detection of dynamic, high-velocity objects (e.g., a suddenly appearing scooter in a pedestrian walkway).

III. IMPACT OF VIDEO COMPRESSION AND SPATIAL RESOLUTION ON OBJECT DETECTION ACCURACY AND RANGE A. OVERVIEW
Due to the high bit rate of raw high-resolution video, compression is needed to stream image frames to the edge server via a throughput-constrained link. In this section, we analyze how video compression (including reducing video resolution) impacts object detection performance and inference time. The analysis will provide the bit rate and delay requirements for the wireless uploading described in Section IV.
Although the augmented VIS 4 ION system will use multiple stereo cameras to enable a wider field of view and distance estimation, the analysis in this section focuses on a single monocular video. How to combine detection results from multiple views or make use of depth information in object detection is an interesting subject of future work. Thus, in the wireless evaluation in the next section, we will consider only uploading of multiple monocular streams.
Given a target rate for a camera, the video can be compressed at different spatial resolutions (frame size in terms of pixels) and temporal resolutions (frame rate), as illustrated in Fig 3. With the chosen spatiotemporal resolution, the bit rate is controlled by the quantization stepsize, which controls the amplitude resolution and affects the pixel quality. While there has been significant work in relating spatial, temporal, and amplitude resolution (STAR) to perceptual video quality [68]- [70], the effect of STAR on object detection accuracy is less understood. As mentioned in the Introduction, most prior works have only studied relatively low-resolution images.
Here, we conduct a study to systematically evaluate the impact of spatial and amplitude resolution on the object detection accuracy using a popular object detection deep learning model (YOLO [71]). We leave out the consideration of the temporal resolution at this time because the YOLO model works on video frames independently. This study enables us to determine the optimal spatial resolution for a given bit rate, and the achievable detection accuracy under the optimal resolution at this rate. We will further characterize the effect of the object distance (from the camera) on the detection accuracy under different spatial and amplitude resolutions, to provide recommendations/guidelines on the necessary spatial resolution and bit rate to meet the desired detection range for wearables that support pedestrian navigation applications. Finally, we characterize the computational cost of YOLO (including inference time) at different spatial resolutions, which provides guidance on the tolerance for the roundtrip delay with wireless offloading.

B. CREATION OF THE NYU-NYC STREETSCENE DATA SET
Currently, there are no public datasets containing highresolution videos captured from the perspective of a typical pedestrian. A significant effort in this work is the creation of a new, manually-annotated 'StreetScene' video dataset for this purpose.

Video collection
To test the performance of the YOLO model for detecting objects of interest for pedestrian navigation, we recorded a set of videos while wearing the current VIS 4 ION backpack which has a single stereo camera on the front shoulder strap. The camera model is the ZED camera from StereoLabs [72] -a lightweight, powerful recording device, ideal for wearables requiring spatial intelligence. A total of 9 videos were captured with the ZED at the 2.2K spatial resolution FIGURE 3: Under the same bit rate constraint, one can represent a video using different combinations of spatial, temporal, and amplitude resolutions as shown here with an example video compressed to 1 Mbps. The bottom row shows a crop from each version of the video to better illustrate the differences in compression artifacts. and 15 Hz temporal resolution, with a total video length of 43 minutes.

Object annotation
We manually annotated the bounding boxes for 15 objects of interest, listed in Table 1, along with the number of occurrences for each object. We annotated every 30th frame of each video (only left view). This annotated dataset is publicly available at [31]. Because YOLO was trained where the 'traffic light' includes both 'vehicle traffic light' and 'pedestrian signal', we grouped these two separately annotated objects into the same object type when applying the YOLO model. Furthermore, because the detection performance of YOLO on 'bench', 'stop sign' and 'dining table' is very poor, we only report the detection performance for detecting the remaining 11 objects.

Video compression
We compressed all videos (left view only) in the StreetScene dataset using the FFmpeg software with the x265 codec [73], [74], which follows the latest international video coding standard H.265/HEVC [75]. We kept the same temporal resolution, and compressed the video either at the original 2.2K spatial resolution or reduced spatial resolutions (See Table 2) under different quantization parameters (QPs). Default down-sampling filters ('bicubic') in FFmpeg were used for the spatial downsampling. Considering the lowdelay requirement of the navigation application, we used a Group of Picture (GOP) length of 60 frames, without Bframes, i.e, each GOP starts with one I-frame, followed by 59 P-frames.

Distance estimation for standing people
To examine how distance affects the detection accuracy in the StreetScene dataset, which does not have accurate distance measurements 1 , we developed a method to estimate the distance of standing people in our 'StreetScene' dataset. We focus on distance estimation for this object type since this is a relatively small object whose detection can be greatly affected by the object distance. Given that the variation of the physical size of standing people is relatively small, the size of the box bounding a standing person is mainly determined by the distance of the person from the camera. Based on this observation, we trained a distance estimation model (containing a few fully connected layers) based on the bounding box width and height using the KITTI dataset [76], which has annotated bounding boxes and distances for standing people. To account for the difference in the camera used for the KITTI data and our ZED camera, we used our ZED camera to capture a set of videos with standing people at multiple distances against a variety of backgrounds, with the distance captured using the positional tracking system of the ZED camera. Using this dataset, we were able to learn the mapping from the distance estimated by the model trained on the KITTI data to the distance from the video captured by the ZED camera. To apply this model on the people detected in the StreetScence dataset using the YOLO model, which includes both standing and sitting people in the same object category, we looked at the distribution of the height over width ratio among the standing and sitting people in the KITTI data, and found that using a ratio threshold of 2.0 can fairly reliably separate standing people from sitting people. Therefore, we used this ratio threshold to detect standing people in the StreetScene dataset. By applying the distance estimation model to the detected bounding boxes for the standing people followed by the camera mapping, we generated the distance measurements of standing people in the StreetScene dataset.

C. EFFECT OF SPATIAL AND AMPLITUDE RESOLUTION ON OBJECT DETECTION ACCURACY
We first examine the impact of spatial and amplitude resolutions (with corresponding bit rates) on the object detection accuracy on the StreetScene dataset, in which objects appear at varying distances. Results show that the optimal spatial resolution varies with the target bit rate (which is constrained by the network throughput). We applied a pretrained YOLO 5s model [77] on the decompressed videos in the StreetScence dataset to detect the 14 objects of interest. Fig. 4 shows the weighted mean average precision (wmAP) 1 The depth estimation from the stereo disparity in the ZED camera SDK is not very accurate and only works when the distance is within 20 meters.  Object  person  car  vehicle  traffic light   pedestrian  signal   potted  plant  bicycle  truck  chair  fire  hydrant  bus  umbrella  motor  cycle  bench  stop  sign   dining  table  Occurrences  9783  5442  1977  651  1122  704  459  450  370  349  335  162  247  217  180 Table 1) vs. bit rate. 2 The weight of an object type is proportional to its occurrence frequency in the dataset. The figure reveals that there is an optimal spatial resolution at each bit rate that will maximize the wmAP. Specifically, 720P is best for 0.35-6.0 Mbps, 1080P for 6.0-26.2 Mbps, 2.2K for higher bit rates. However, 2.2K provides only marginal improvement over 1080P above 26.2 Mbps. We note that this could be because the YOLO model was trained mainly on low-resolution images. Although WVGA is best at a very low rate (below 0.35 Mbps), the achievable AP is too low to be usable.
At 26.2 Mbps and using 1080P resolution, the weighted mean AP is about 54%, which is still far from perfect. However, this relatively low detection accuracy is due to limitations of the YOLO 5s model, which was trained using uncompressed low-resolution images. Better detection models (e.g., models specifically trained for street scenes and/or models that are separately optimized for different resolutions) will likely further improve the detection accuracy. It is tenable that the trend of the detection accuracy vs. rate vs. spatial resolution would be preserved for future, more powerful models. Fig. 5 presents the detection result for the person category. We see a similar trend as in Fig. 4, although the specific rate points where higher resolutions take over the lower resolutions are slightly different. The AP for the person category is higher than the wmAP over 11 objects at similar bit rates, which shows that the YOLO model is more effective in detecting people than other object categories. This is consistent with the performance reported in [78], likely because there are significantly more instances of people than other objects in the training set.
Sample frames from videos compressed to around 10 Mbps using different settings are shown in Fig. 6(a-b). From the outset, it is not clear which decompressed image 2 Note that although the videos in the StreetScene dataset are captured and compressed at 15 Hz, we report the equivalent bit rates for videos at 30 Hz, which is necessary to meet the real-time navigation requirement as further detailed in Sec. II-D. This is accomplished by scaling the actual bit rates corresponding to different spatial resolutions and QPs with different scaling factors determined by a separate experiment where we compressed videos at 30 Hz and 15 Hz separately at multiple resolutions and QPs for several sample videos captured at 30 Hz.  will lead to improved object detection. However, from the detection results shown in Fig. 6(c-d), the YOLO model did better for image (b), which was represented with a high spatial resolution but low amplitude resolution. Table 3 summarizes, for each resolution, the bit rate at which the detection accuracy for multiple object detection (wmAP) plateaus, and the corresponding detection accuracy. We also list the AP for human detection, and the recall when the precision is 0.80. As we can see, using higher resolution video enables higher object detection accuracy, which translates to more correctly detected objects. For example, going from WVGA to 1080P, the recall for the "person" category increased from 48% to 74%, while keeping the false detection rate at 20%. Therefore, by transmitting high resolution video to the edge server when the network throughput is sufficiently high, we are able to see "more"   objects.

D. EFFECT OF SPATIAL RESOLUTION ON DETECTION ACCURACY AT DIFFERENT DISTANCES
The performance measures reported so far are aggregated results for objects appearing at varying distances from the camera. Generally, detecting a faraway object is harder than a nearby object. On the other hand, being able to detect an object while it is still far away provides more time for navigation planning. Therefore, it is important to understand how the detection accuracy degrades as the distance increases and what is the maximum distance when an object can be detected reliably. How does the distance affect the detection also depends on the physical size of the objects. We show such results for the detection of standing pedestrians as a case study, wherein we use the algorithm described in Sec. III-B to detect standing pedestrians and furthermore estimate the distance of the detected person(s) from the bounding box size(s).
We quantized the distances to several bins and determined the AP within each bin. Fig. 7 illustrates how the detection accuracy drops as the object distance increases under different spatial resolutions. When the distance is very close, YOLO performs very well even at the WVGA resolution, but the accuracy drops quickly as the distance increases at this low resolution. As expected, higher spatial resolutions enjoy a slower decay rate. The 2.2K resolution leads to significantly better detection than 1080P only when the distance is greater than 21 m. Note that the 2.2K resolution did worse than other lower resolutions in the short distance range in this study. This is likely because the YOLO 5s model was trained on low-resolution video (close to WVGA). People within a short distance occupy a very large area in the 2.2K image, requiring bounding box sizes that rarely occur in the training data. Fig. 8 shows the detection ranges for different spatial resolutions. Here, we see clearly that going from WVGA to 1080P, we are able to extend the detection range from about 6 m to 12 m. This study shows that we should use at least 1080P video to be able to reliably detect people at a distance important for navigation planning.   Table 4. Note that the results would be similar even if the videos were uncompressed because the average detection accuracy for each particular resolution already plateaued at its corresponding bit rate, as shown in Fig. 5.  Table III. Table 4 summarizes the computation complexity (measured by the FLOP count), the inference time per video frame, and corresponding speed (frame/sec or fps) on the embedded processor in our VIS 4 ION backpack (Jetson Xavier NX running at 15 Watts, using a GPU at 1.1 GHz), the inference time per video frame and speed using an edge server equipped with an RTX 8000 GPU, of the YOLO 5s model, for videos at different spatial resolutions. As will be explained in Sec. V, to meet the total delay requirement for real-time navigation, local processing should be completed within 67 ms, which is barely possible with the WVGA video, severely limiting the achievable object detection accuracy and detection range (cf. Table 3, Fig. 8). On the other hand, offloading the computation to the edge server allows us to process the 1080P video and consequently significantly increase the detection performance, while still meeting the delay constraint.   4: Impact of spatial resolution on the detection model complexity, running time on local processor and edge server, respectively. The Jetson Xavier NX is used as the local processor, while the server uses an RTX 8000 GPU.

IV. WIRELESS EVALUATION
Having analyzed the bit rate and latency requirements for video processing, we now simulate the wireless network to determine what percentage of time these requirements can be met.

A. USER ROUTE AND RAY TRACING
We simulate a hypothetical end-user commuting through the streets of Manhattan -a challenging environment from a wireless perspective, due to the tall building blockage. In particular, we identified a walking route starting at Lighthouse Guild (a healthcare and research center that assists persons with BVI) and finishing at the NYU Tisch Hospital in Kips Bay, as depicted in Fig. 9. The environment of this route is similar to where the NYU-NYC StreetScene dataset described in Section III was collected. Along this route, we selected four sites (red rectangles of Fig. 9) to perform a realistic full stack 5G simulation. This simulation involved two main steps: 1) generation of ray-tracing data of the wireless environment from the realistic 3D layout of the city at these sites, and 2) end-to-end simulation using Network Simulator 3 (ns-3) and the ray-tracing information obtained in the previous step. Here, we describe the generation of ray-tracing data with the figures for the top site in Fig. 9 as an example.
In wireless communications, ray-tracing involves the calculation of the paths that electromagnetic waves, represented as rays, follow while propagating in a known 3D environment according to a set of transmitting and receiving positions. For each combination of transmitting and receiving locations, the output of a ray-tracing simulation consists of a set of propagation information for each ray: path loss, propagation time, phase offset, Angles of Departure (AoD), and Angles of Arrival (AoA). We used Remcom's Wireless InSite [33] software to generate ray-tracing data in our work. This software has been successfully used in a number of other studies [79]- [81]. Accurate ray-tracing of a 3D scenario at mmWave and sub-6-GHz frequencies requires precise information of buildings materials, since each material affects the propagation of the electromagnetic wave differently at each frequency range. In addition, details of the vegetation in a certain area are essential for accurate ray-tracing at mmWave frequencies. To capture all this information, we imported in Remcom the 3D layout of the City at each of the 4 sites, as provided by Geopipe [34]. Fig. 10 depicts the 3D layout with building materials and vegetation at the top site of Fig. 9. This data from GeoPipe provides one of the most accurate models for mmWave ray tracing. In particular, the models contain small building features which are known to influence mmWave propagation significantly [82]. For each material (i.e., concrete, wood, glass, etc.) and frequency range, Remcom uses different parameters for the reflection and diffraction coefficients, as well as different propagation properties. We performed the simulation at two frequencies: 1.9 GHz for a typical sub-6 GHz 4G LTE carrier, and 28 GHz for a typical 5G mmWave carrier.
Moreover, we considered a real placement of the base stations in each of the four sites as provided by [83]. This database contains information regarding the actual foreseen placement of 5G mmWave Base Stations (BSs) in New York City. We assumed that the LTE 4G and 5G mmWave base stations are co-located, meaning that, at each cell site, base station equipment is available for both frequencies. This colocation is common since, once the operator has secured a site, it generally utilizes it maximally. Note that in 3GPP terminology, a 4G base station is called eNB (evolved Node B), and a 5G base station is called gNB (next generation Node B).  Fig. 9, with material information for each building and vegetation provided courtesy of GeoPipe [34].
For ray tracing, each base station site represents a transmitting location, whereas the receiver positions are placed one meter apart along the route taken by the hypothetical user at each specific site (more details in IV-D). The ray tracing then provides the channel from each BS to each position along the route at both frequencies. Note that due to symmetry, the large scale channel parameters are identical in the uplink and downlink. Hence, we can use the estimated channel in both directions.

B. NETWORK SIMULATION
The second step in the evaluation of wireless offloading involves the end-to-end (full stack) network simulation using ns-3. Ns-3 is an open-source, discrete event network simulator which affords end-to-end simulations (i.e., simulations that model the entire network stack from the physical layer up to the application layer) with support for user mobility and traffic modeling, among other features. To compare the performance of mmWave connectivity with a standard sub-6 GHz system, we run two simulations: • An 4G LTE system at 1.9 GHz with 40 MHz downlink and 40 MHz uplink total bandwidth; and • A 5G mmWave system at 28 GHz with 400 MHz total bandwidth that is Time-Division Duplexed (TDD) for the uplink and downlink.
For both the 4G and 5G systems, the deployment would likely be on multiple carriers, as is common today. For example, the LTE system could be two standard carriers of 20 MHz each and the 5G mmWave system with four carriers of 100 GHz each. The parameter values for both systems are shown in Table 5 and are representative of typical 4G and 5G simulation studies, see, e.g., [7], [84], [85]. In addition, details for the 4G-LTE ns-3 module can be found at [86]. To account for loading, we assume an individual UE obtains a fraction 0.25 of the total bandwidth, which would represent a moderate loading level relative to standard evaluation VOLUME 4, 2016 methodologies [85]. Hence, we simulate the 4G user as operating in a system with 10 + 10 MHz bandwidth and the 5G user as operating in a system with 100 MHz total bandwidth. Morevoer, the 5G system operating at mmWave frequencies uses Numerology 2 and the TDD configuration allows symbols to be flexible, meaning that each symbol can be used for either uplink or downlink traffic. In both the 4G and 5G cases, we model the wearable as a User Equipment (UE) traversing the path described above. The channels from the ray tracing in Section IV-A are imported into the ns-3 simulator.
Since ray tracing captures only the buildings and foliage, it does not capture blockage from objects such as humans and vehicles in the environment. As discussed in the Introduction, modeling blockage is critical to accurately assess mmWave coverage [23], [24]. To model this additional blockage, we employ the Blockage Model A in [85] which is integrated into the ns-3 simulator [35]. This model adopts a stochastic approach for capturing human and vehicular blocking. In particular, multiple 2D angular blocking regions, in terms of azimuth and elevation angular spreads, are generated around the UE. One blocking region, denoted as self-blocking region, captures the effect of human body blocking, whereas K NSB non-self-blocking regions with random sizes are used to model other sources of blockage (K NSB can be changed to increase/decrease the density of blockers). Once the blocking regions are computed, each cluster (or ray, in the case of our ray-tracing data) is attenuated accordingly, based on the angular spreads and position of each blocking component. The parameter T NSB denotes the time interval at which new blockers are randomly generated.
Given the blocked channels, the ns-3 simulator then models the full stack communication. At the Physical (PHY) and Medium Access Control (MAC) layers, the modeling includes beam tracking, Channel Quality Information (CQI) reports, rate prediction, Hybrid Automatic Repeat reQuest (HARQ), and scheduling. The total number of HARQ processes for both systems is specified in Table 5. At the higher layers, the simulator models all the Radio Link Control (RLC) segmentation and buffering, Radio Resource Control (RRC) signaling, and handovers. In our simulations, we use the Acknowledged Mode (AM) for the RLC layer for both 4G and 5G systems.
An important parameter in the network configuration is the location of the edge server. In cellular systems, data in the uplink traverses a path: UE (wearable) → base station (4G eNB or 5G gNB) → core network → server in the public Internet. The downlink follows the reverse path. In conventional deployments, operators have relatively few gateway points from the core network to servers in the public Internet. The data may thus need to traverse a long path in the core network to the closest gateway resulting in high delay from the base station to the server -see, for example, measurements in commercial networks in [30]. Mobile edge computing reduces the core network delay by placing the edge servers much closer to the base station [87]. In this study, we will assume that the one-way delay from the base station to the edge server is D core = 5 ms. As we will see in the delay analysis below, this lower delay will be critical to meet the strict delay requirements for BVI navigation.

C. TRAFFIC MODELING
Ideally, in the uplink, we would model the adaptive multicamera video encoding application in the ns-3 simulator. This analysis would then be specific to the video adaption algorithm used. To provide a more general and simpler analysis, we instead model the uplink video data as a single Transmission Control Protocol (TCP) stream with a full buffer up to a maximum data rate of 120 Mbps. This maximum data rate is sufficient to support four cameras at 30 Mbps each. Since TCP has congestion control, it will automatically adjust the sender rate to the available uplink link capacity. As a simplification, we assume that the video encoding can be adapted to be exactly the same as the TCP rate. Hence, the full buffer TCP rate at any time can be regarded as an approximation of the actual video rate.
For the downlink, data from the edge server to the wearable is used to carry the object detection results. We model this downlink traffic as a constant bit rate application at 30 packets per seconds, corresponding to the expected video frame rate. We assume that the total rate is 1 Mbps, which is ample to specify a large number of detected objects, including their bounding boxes and probabilities belonging to different object classes.
For low-latency streaming applications, one should not use TCP as a transport protocol. For example, one can use User Datagram Protocol (UDP) or Real-time Transport Protocol (RTP) that are designed for real-time applications [88]. Here, we use TCP only to simulate the available link rate, since TCP's congestion control automatically adjusts to the link rate and therefore can be regarded as a proxy for the video rate adaptation that would need to be incorporated on top of any real-time transport protocol such as RTP.

D. SIMULATION RESULTS
As mentioned above, we performed end-to-end wireless simulation of an end user walking through four sites in Manhattan, near Lighthouse Guild, Midtown, Herald Square, and NYU Langone as shown in Fig. 9. As one example, Fig. 11 depicts the route of the user (UE) for the Herald Square site. The UE starts walking from the north-west corner in the figure and moves south-east across 34 th Street (yellow line in the picture). In this site, two base stations are present, both with 4G-LTE and 5G. Figure 12(a) shows the SINR over time for this scenario for the two mmWave base stations (gNB 1 and gNB 2), as well as the maximum SINR for the two LTE cells. We see that, in this case, the LTE SINR is continuously high (mostly >40 dB) due to the favorable propagation of the lower frequency (1.9 GHz) carrier. In contrast, there is a period from approximately 80 to 170 seconds where the SINR from both mmWave cells is low. This time period corresponds to a segment of the route where the UE is in Non Line of Sight (NLOS) to both mmWave cells.
The resulting TCP end-to-end throughput is shown in Fig. 12(b). We see that the LTE rate is more continuously available. However, the maximum rate is limited to ∼36 Mbps corresponding to the maximum modulation and coding scheme (MCS) with a 10 MHz bandwidth. In contrast, the mmWave system can obtain the full 120 Mbps rate, but the rate falls below the LTE rate during some parts of the NLOS segment.
The simulation results for each of the four scenarios have been aggregated and depicted in Fig. 13. Fig. 13(a) plots what we will call the delay-constrained throughput, which is calculated as follows. We divide the time into intervals of T seconds, where 1/T = 30 Hz corresponds to the expected video frame rate (T is the frame interval). For each TCP packet transmitted in the interval, we measure its uplink delay from the UE to the edge server application. We also measure the downlink delay for each feedback packet transmitted in that interval as well. Note that these delays contain all the air-link and core network delays. For a given delay constraint, D max , we define the delay-constrained throughput as b/T , where b is the number of uplink bits transmitted in the interval T for which the uplink + downlink delay ≤ D max . The delay-constrained rate is computed separately for the LTE and mmWave systems. Since mmWave systems are always deployed with a sub-6 GHz fallback carrier, we also estimate the rate of mmWave+LTE system as the maximum of these two rates under the same delay constraint. Fig. 13(a) plots the delay constrained rates for the LTE and mmWave+LTE systems under delay constraints of D max = 30, 40 and 50 ms.
We see in Fig. 13(a) that the LTE system achieves a peak uplink rate of approximately 36 Mbps, and attains over 20 Mbps more than 90% of the time. However, this rate is only achievable with a delay constraint of D max = 50 ms. At a tighter delay constraint of D max = 40 ms, the delayconstrained rate of approximately 36 Mbps is supported less than 80% of the time; at D max = 30 ms, there are virtually no LTE data within this delay constraint. In contrast, the mmWave 5G system coupled with LTE can obtain the peak rate of 120 Mbps at least 40% of the time, even under a delay constraint of D max = 30 ms. However, for approximately 25% of the time, the delay-constrained rate of the mmWave+LTE system is similar to that of the LTE system owing to the fact that the mmWave only coverage is not always available and falls back to the LTE carrier.
To understand the delay differences, Fig. 13(b) plots the CDF of the delays of packets without any delay constraint. We see that for the mmWave system, the minimum delay is ≈ 15 ms which includes two times the core network delay of D core = 5 ms along with an addition 5 ms for the transmission of the uplink and downlink data. The LTE packet delays are generally higher. Although the core network delay is assumed to be the same, the LTE frame structure as well as the lower throughput results in a higher air-link delay.
Finally, it is useful to compare these results with the URLLC requirements of 5G. As mentioned in the Introduction, the URLLC design goal is to achieve air-link latencies of 1 to 10 ms [40], [41]. When the UE has 5G mmWave connectivity, we see a median delay of 15 ms, which is consistent with an air-link round-trip latency of 5 ms along with our assumed core network delay of 5 ms each way. However, our study also includes blockage and environments where the 5G coverage is not uniformly available. As a result, the   UE must occasionally fall back to LTE links where the delay is higher and the bandwidth cannot sustain the peak rates. In these cases, the overall delay grows significantly, beyond the 5G URLLC levels.

V. PERFORMANCE EVALUATION A. OVERVIEW
We now combine the video processing requirements in Section III with the wireless simulation results in Section IV to assess the potential benefits of wireless offloading. We consider three scenarios for edge connectivity: (1) Local only, where all the processing is performed on the wearable; (2) LTE only, where edge processing can be accessed by the LTE link; and (3) mmWave+LTE where edge processing can be accessed by LTE or mmWave, whichever has the highest rate. For each such scenario, we consider different possible video options in terms of the number of cameras, spatial resolution, and bit rate. We can then use the wireless analysis in Section IV to assess the percent of time such video options would be available based within a delay budget close to the target of 100 ms. Although there are a large number of possible video configurations, in the sequel, we will focus on the options in Table 6 as these provide a good demonstration of the capabilities of the system. The table also highlights some of the key values in red, orange, and green to draw attention to the performance that are relative poor, medium, or good. We also examine an adaptive offloading strategy, which switches between edge and local computing and furthermore adapts the video resolution based on the wireless link throughput when edge computing is chosen. The remaining sub-sections will describe the details of these options and their analysis. Performance value is good † The average wmAP and AP presented are for a total delay of ≤ 100 ms. These numbers are 53.9% and 66.0%, respectively for a total delay ≤ 150 ms. There are many other feasible configurations, including (1) Processing 1080P video locally for increased detection accuracy, at a total delay of 211 ms.
(2) Sending both views of each stereo camera or one view plus depth map, with increased uplink rate, and reduced availability.

B. VIDEO CONFIGURATIONS
In Table 6, for edge computing with mmWave or LTE connectivity, we have considered the case where the video from each monocular camera would be delivered at 1080P spatial resolution, 30 Hz temporal resolution, at 26 Mbps. Based on the video analysis in Section III, this configuration provides a high object detection accuracy and good detection rangesee the wmAP and detection range rows in Table 6. Going beyond 1080P resolution and 26 Mbps brings only very slight gains, and yet processing 2.2K video will consume substantially more computation time.
For the local processing scenario, we have considered only WVGA and 720P. With local processing, the inference time for higher spatial resolution 1080P (see Table 4) would substantially exceed the delay budget. As shown in Table 6, this lower resolution results in both a lower object detection accuracy and reduced object detection range.

C. DELAY ANALYSIS
The delay computations in Table 6 consider four components: • Video frame delay which is the interval of one video frame (i.e., the inverse of the frame rate). For edge computing, the video frame interval needs to be considered because an object may appear any time between two adjacent frames. • Video encoding delay which is the time to encode the video for edge computing. • Round-trip time (RTT) which is the total time to transmit the packets from the wearable to the edge server and back. VOLUME 4, 2016 • Inference time which is the time for the object detection network (either local or edge) to compute the detection results. In our analysis, the video frame delay, encoding delay, and inference time are fixed. The only variable component is the RTT. Table 6 shows the median RTT and the corresponding median total time. We see that mmWave+LTE offers dramatically lower median RTT of 15 ms relative to the median RTT in LTE only of 37 ms. Recall that the RTT includes 2D core = 10 ms of assumed delay from the base station (gNB or eNB) through the core network to the edge server and back. Note that with direct local processing, only the frame delay and inference time are required since there is no video encoding or communication.
As suggested in [21], to avoid detection mismatch, we will assume that the local processor runs a simple object tracking algorithm that predicts the locations of objects detected for the last frame for which the edge server detection results were fed back. For example, if the total delay for edge processing is twice of the frame interval, at the time when frame t is captured, detection results for frame t − 2 will be used as the reference, the motion between frame t and frame t−2 will be used to predict the locations of these detected objects in frame t. In the mean time, frame t will be delivered to the edge server with its processing results to be fed back at time when frame t + 2 will be captured. The motion vectors between frames generated for video compression can be leveraged for local object tracking. Such an approach should be able to track small movements of previously detected objects within a few frames, while also reporting any newly appearing objects within the 100 ms delay.

D. SUPPORTABLE VIDEO STREAMS UNDER DIFFERENT DELAY CONSTRAINTS
From the wireless simulation results in Figure 13, we derive the probability of supporting one or more camera streams under different round-trip delay constraints. These are plotted as heat maps in Fig. 14. Fig. 14(a) shows that, with multi-connectivity using both mmWave and LTE links, we can support one video stream over 75% of the time, and support two streams over 73% of the time, under the the roundtrip delay constraint of 30 ms. All four cameras can be supported 65% of the time. Furthermore, we can support one and four cameras with high availability (91% and 67%, respectively) if the delay constraint is relaxed to 40 ms. On the other hand, with the LTE link only ( Figure. 14(b)) and a round-trip delay of 40 ms, the availability for supporting one and four cameras drops to 79% and 0%, respectively, since the peak LTE rate is approximately 36 Mbps. The LTE link can sustain one camera with high probability (93%) only if the delay constraint is relaxed to 50 ms.
For any configuration, we can also compute the probability that the total delay will meet a certain delay target. For example, from Section II-D, the estimated total delay requirement is 100 ms. This is not met by local processing.
When using offloading, the video frame delay, encoding delay, and inference take a total of 33 + 17 + 19 = 69 ms, so there would be D max = 100 − 69 = 31 ms for the RTT. Similarly, if we relax the total delay requirement to 150 ms, the communication delay constraint would be D max = 150 − 69 = 81 ms. The availability numbers listed in the final two rows of Table 6 are the percentage of time the delay-constrained throughput meets the minimum uplink data rate requirement at the RTT of 31 and 81 ms, respectively. The availability percentages for these two delay constraints are not shown in Fig. 14, but we have extracted the numbers from the wireless simulation results in a similar manner. Note that the availability for RTT of 31 ms is the practically the same as for 30 ms. Using Fig. 14, one can also find the corresponding availability for total delays of 110 ms (RTT of 40ms) or 120ms (RTT of 50ms).
In practice, given the limited total throughput, it may be better to only upload the front facing camera stream at a high rate (for the highest detection accuracy), and use a lower rate for other cameras (side and back facing) to enhance situational awareness. As an example, Fig. 14(c) shows the probability of supporting one camera at the full rate of 26 Mbps and additional cameras at 10 Mbps each. In this case, with mmWave+LTE connectivity and a total delay 100 ms, we could increase the probability to support 4 cameras from 65% to 72% .

E. ADAPTIVE OFFLOADING
Given the variability in availability, particularly in mmWave, it is natural to consider a strategy that adaptively selects the rate, number of cameras, and whether to use local or remote processing based on the available uplink bandwidth. For example, as a simple strategy, when the available throughput is ≥26 Mbps, we can transmit one or more cameras. When the throughput is lower than 26 Mbps, we can still offload the video for edge computing, but at lower rates, enabling adaptive switching between camera number and video quality based on bandwidth constraints and functional need (refer to Fig. 14 as an example). From Fig. 4, when the throughput is between 6 Mbps and 26 Mbps, the system should deliver the video at 1080P but at a lower rate, leading to proportionally lower detection accuracy. When the rate is between 1 Mbps and 6 Mbps, the system should upload the video at 720P resolution. In the "adaptive" column of Table 6, the detection accuracy is derived by assuming an average detection accuracy of 51.5 and 63.2 for multi-object and person, respectively, when the bit rate is between 6 and 26 Mbps (which occurs 2% of the time with a delay constraint of 30 ms, from Fig. 13(a)); and an accuracy of 41.1 and 49.7, respectively, when the bit rate is between 1 Mbps and 6 Mbps (which occurs 1% of the time). The overall availability for adaptive offloading under total 100 ms delay is the probability that the throughput is ≥ 1 Mbps at a RTT constraint of 30 ms.
Under the relaxed total delay constraint of 150 ms, when the throughput is between 10   of 2.2%), the system should still upload the 1080P video. When the throughput is below 10 Mbps (with probability of 0.8%), the wearable could locally process the uncompressed video at 720P video resolution for better detection performance (see Fig. 4). Therefore, the availability is 100%. The average accuracy would be 53.92% and 66.00%, for multiobject and person, respectively.

VI. CONCLUSIONS AND FUTURE WORK
Mobile edge computing coupled with the high data rate capabilities of mmWave holds significant promise for accessing powerful video analytics by wearable devices.
We have assessed the feasibility of such capabilities for a advanced smart wearable with multiple high-resolution cameras where the wireless and video requirements are particularly demanding. Several new elements were required in the analysis including developing a large labeled video data set, evaluation of object detection algorithms at variable resolutions and bit rates, and detailed and high accuracy wireless simulations with ray tracing. Overall, wireless simulations provide a high level of realism and can identify the key limitations in high data rate edge computing. These tools can be applied in other applications and may prove valuable as video processing and spatial intelligence becomes more widely-used in mobile scenarios.
For the VIS 4 ION application, our simulation results suggest that at bandwidths and loading similar to current deployments, systems in traditional sub-6-GHz bands combined with low delay mobile edge computing can provide gains by offloading camera data a large fraction of time, improving the accuracy and detection range. However, meeting the end-to-end delay requirements of 100 ms is challenging. The mmWave bands can reduce the delay to meet these requirements and provide additional capabilities including multiple cameras at high resolution. However, due to blockage and the limited range of mmWave signals, the VOLUME 4, 2016 peak performance is not uniformly available at typical cellsite densities. Thus, fall back to lower frequency carriers and local processing combined with adaptation in the video resolution and number of camera streams will be required.
In the current work, we have abstracted this adaptation by assuming that the number of cameras and their bit rate can be adjusted to the available throughput. An obvious line of future work is to actually simulate a particular adaptive video application over wireless links and assess its performance. Additionally, we have simply relied on the pretrained YOLO network that was trained on uncompressed low-resolution images. A second line of work is to train multiple detection networks for different resolutions and bit rates or a single network that can perform well across resolutions and bit rates. More ambitiously, one can also consider new compression schemes that are trained end-toend with object detection accuracy as the goal, as opposed to the standard compression algorithms, which are optimized for image reconstruction. Dr. Porfiri is the recipient of the National Science Foundation CAREER Award, the Outstanding Young Alumnus Award by the College of Engineering, Virginia Tech, the American Society of Mechanical Engineers (ASME) Gary Anderson Early Achievement Award, the ASME DSCD Young Investigator Award, and the ASME C.D. Mote, Jr. Early Career Award. His other significant recognitions include invitations to the Frontiers of Engineering Symposium and the Japan-America Frontiers of Engineering Symposium organized by the National Academy of Engineering He has served on the Editorial Board of the ASME Journal of Dynamics Systems, Measurements and Control, the ASME Journal of Vibrations and Acoustics, Flow, the IEEE CONTROL SYSTEMS LETTERS, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, and Mechatronics. He is a Fellow of the ASME. TODD E. HUDSON is an Assistant Professor of Rehabilitation Medicine at New York University's Grossman School of Medicine, holding crossappointments in Neurology, and also in the Department of Biomedical Engineering at the New York University Tandon School of Engineering.
Prof. Hudson is a Computational Neuroscientist whose research focuses on modeling sensory and motor systems, particularly with regard to eye and arm movements in healthy and disease states, movement planning, and spatial orientation. He received his PhD from Columbia University, and has authored over 50 peer reviewed articles, as well as the textbook 'Bayesian Data Analysis for the Behavioral and Neural Sciences' from Cambridge University Press. VOLUME 4, 2016 WILLIAM SEIPLE received the B.S. and M.S. degrees in psychology from the Albright College, Reading, PA and the University of North Carolina, Greensboro, NC and the Ph.D. degree in zoology from the University of Illinois, Urbana, IL.
He is the Chief Research Officer at Lighthouse Guild, Research Professor of Ophthalmology at New York University School of Medicine, and Adjunct Faculty at the Institut de la Vision, Paris. He is the author of two books chapters, and more than 150 peer-reviewed articles.
His research interests include development and assessment of functional interventions for people with vision loss, rehabilitation training, and visual electrophysiology and psychophysics of vision. He is currently a physician-scientist at NYU Langone Medical Center's Rusk Rehabilitation, where he serves as vice chair of Innovation and Equity for Physical Medicine and Rehabilitation with cross-appointments in the Department of Neurology and the Departments of Biomedical & Mechanical and Aerospace Engineering at NYU-Tandon School of Engineering. He is also the Associate Director of Healthcare for the NYU Wireless Laboratory in the Department of Electrical and Computer Engineering at NYU-Tandon. He leads the Visuomotor Integration Laboratory (VMIL) and the REACTIV Laboratory (Rehabilitation Engineering Alliance and Center Transforming Low Vision), where his team focuses on assistive technology for the visually impaired and benefits from his own personal experiences with vision loss. He is the author of 10 book chapters, more than 80 peer-reviewed articles and many poster presentations. His research interests lie within the realm of neurorehabilitation, assistive technologies, and health equity.
Dr. Rizzo was awarded the prestigious Crain's 40 under 40 award in New York Business for his medical devices, including his wearable technology. He has also been featured in several lay articles and also featured in videos and press releases. In 2018, he was a highlighted speaker in NYU's TEDx "Re-Vision" Series. He is a member of American Medical Association, American College of Physicians, American Academy of PM&R (AAPM&R), Association of Academic Physiatry (AAP), and American Heart Association.