Loading web-font TeX/Main/Regular
Infrastructure Enabled Guided Navigation for Visually Impaired | IEEE Journals & Magazine | IEEE Xplore

Infrastructure Enabled Guided Navigation for Visually Impaired


Abstract:

Despite advancements in navigation-assistive technology, independent outdoor traveling remains challenging for individuals with vision loss due to uncertain information. ...Show More

Abstract:

Despite advancements in navigation-assistive technology, independent outdoor traveling remains challenging for individuals with vision loss due to uncertain information. We present an outdoor navigation assistive system that collaborates with infrastructure to address these limitations. Our system includes an RGB-D inertial sensor, GPS sensor, Jetson Orin NX 16GB, and bone-conduction headphones. It improves localization accuracy through semantic segmentation, depth map enhancement (Depth-Decoder), and a prior map. The proposed semantic segmentation convolutional neural network (CNN) is designed to operate on low-compute devices, achieving competitive performance with 71.2 mean Intersection of Union (mIoU) on the Cityscapes and 74.46 mIoU on the Camvid. It operates at 209 frames per second (fps) with 480\times 848 image resolution on the GTX 1080 desktop and 50 fps on the low-compute Jetson. Additionally, our Depth-Decoder enhances raw depth maps using geometrical loss with semantic segmentation constraints and structure-from-motion. Depth-Decoder demonstrated an average deviation of 0.66m compared to 2.23m from raw depth maps for landmarks within 10m. Notably, the outcomes of both CNNs are inferred from a single RGB image. Two visually impaired individuals assessed the technology. Participant S001 completed 6/8 trials and S002 completed 7/7 trials. Infrastructure collaboration is demonstrated by using existing prior maps and by dynamically adjusting paths based on real-time information from sensing nodes, to avoid collisions with oncoming agents not visible with the wearable camera. This research provides fundamental design principles for future outdoor navigation assistive technologies, offering insights into addressing challenges for people with visual impairment and other vulnerable road users.
Page(s): 6764 - 6777
Date of Publication: 21 February 2025

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Navigation Assistive technology apps operating on mobile phones enhance the mobility of individuals with visual impairments [1], [2], [3], [4]. Such apps can provide location information (Google Maps), information about surroundings (Blind Square), and aid with specific tasks such as street crossing (OKO AI Copilot for the Blind). These capabilities augment traditional aids. A long cane serves as an efficient tool for collision avoidance, moving forward straightly, and obtaining nearby information through sound reflections [5], [6]. Some who are blind and visually impaired (BVI) use guide dogs for enhanced mobility, but adoption by BVI is about 2% [7], thus the responsibilities and costs of caring for a pet outweigh the small benefit provided [8]. Therefore, apps combined with a long cane or guide dog allow BVI to increase their independent mobility. Despite these advances, there remain significant limitations. Dynamic information and precise localization are still not possible with commercial GPS (range is 5m). Important information related to transportation, such as a specific bus or car location, requires local sensing and precision guidance. Wearable navigation systems with cameras and processors can help to detect and recognize obstacles even at greater distances and guide users through traversable areas in complex environments. Thus, while the conventional aids provide useful but limited support for navigation tasks, modern computer vision and robotics technology can add significant additional benefit [1], [2], [3], [4], [9], [10], [11], [12].

We have designed a novel outdoor navigation assistive system aimed at providing improved navigation assistance for BVI, particularly in outdoor urban environments where a user may need to rendezvous with a shuttle or car. The system guides BVI to their destinations through a collaborative interface with autonomous vehicles infrastructure. This technology anticipates a future where there will be infrastructure for autonomous vehicles accessible to pedestrians and interconnected through smart city infrastructure. Our system uses a head-worn camera, bone conduction headphones, a single color-depth (RGB-D) inertial sensor, a GPS sensor, and a Jetson Orin NX 16 GB (NVIDIA) for processing perception, localization, path planning, and feedback in audible cues in real-time. We developed a multi-task network that takes the RGBD data as input and simultaneously performs semantic segmentation and depth map enhancement from raw depth maps, which allows the system to detect in-path obstacles and provide navigation assistance to the user to avoid collision with those obstacles. We chose to use a head-worn camera as it provides more natural feedback regarding pose and direction of movement, is hands-free, and allows for more flexible movement compared to a smartphone camera held in the hand or worn on the torso. The head-worn camera, however, can only detect objects within a clear line-of-sight, while other relevant objects may not be viewable. For example, a shuttle that is on another street will be blocked from view by buildings. To obtain precise location information of such relevant objects, our system is connected with the autonomous systems infrastructure available in the Mcity facility (https://mcity.umich.edu/).

We conducted a pilot user study with two participants who are BVI to evaluate specific functionalities and the overall user experience of our system (see Fig. 1). The study focused on the user’s ability to adapt to our navigation assistive system, hesitancy and confusion during the navigation with our system, design intuitiveness for future mobility systems for pedestrians and infrastructures, and possible cooperation with infrastructure. Two participants with severe visual impairment traversed five different routes. The location of a virtual shuttle and moving objects (e.g., pedestrians) was obtained from the Mcity infrastructure network. The participants followed cues generated by the wearable system to reach the shuttle location.

Fig. 1. - The figure shows the prototype system being used by a visually impaired person for navigation guidance. It shows the RGB-D inertial sensor mounted on the head, a bone conduction headphone for guidance cues, a GPS receiver on the shoulder and the Jetson Orin NX 16GB embedded compute with batteries inside the backpack.
Fig. 1.

The figure shows the prototype system being used by a visually impaired person for navigation guidance. It shows the RGB-D inertial sensor mounted on the head, a bone conduction headphone for guidance cues, a GPS receiver on the shoulder and the Jetson Orin NX 16GB embedded compute with batteries inside the backpack.

Our three main contributions are listed below:

  • We demonstrate a novel outdoor navigation assistive technology that is able to collaborate with infrastructure designed for autonomous vehicles. This prototype system can be the basis of future navigation assistive systems.

  • We developed a multi-modal convolutional neural network (CNN) that simultaneously performs semantic segmentation (Semantic-Decoder) and depth map enhancement (Depth-Decoder) from raw depth maps. Our semantic segmentation performance is comparable to other CNNs that pursue lightweight design. Additionally, the developed modules can be easily adapted to CNN architectures. The Depth-Decoder achieves more accuracy compared to raw depth maps. Notably, our multi-modal model uses a single RGB input, which leads to efficient operation on low-compute devices.

  • Through a user study, we support the concept of a future navigation assistive technology that includes wearable components and augmentation through autonomous systems infrastructure.

SECTION II.

Related Work

Navigation assistive technology for the visually impaired has made significant strides in recent years, with smartphone applications playing a key role in providing essential assistance. These applications offer features such as GPS navigation, voice guidance, and tactile feedback to help individuals with visual impairments navigate outdoor environments with greater independence and confidence. By leveraging the capabilities of smartphones, these tools have become increasingly accessible and widely used, empowering users to overcome navigation challenges and engage more actively in their daily lives. Aira [13] for example provides real-time guidance by connecting the user to a trained remote human agent who assists the visually impaired person through the app. BlindSquare [14] and NaviBlind [15] uses high precision GPS and smart phones for navigation and requires a human agent to plan a proper path for the user before traveling. There are also Bluetooth Low Energy (BLE) beacon-based navigation systems [1], [16], [17] that have been shown to be very effective for indoor navigation like airports and shopping malls. DeepNavi [4] proposed a smartphone-based CNN for real-time object detection and navigation. These solutions provide approximate location information which is beneficial to users but they often rely on a human agent or are limited to a small indoor area.

Researchers have also developed navigation assistance robots and drones. Cabot [9] is a suitcase-shaped autonomous navigation robot that uses a lidar, stereo cameras, and a local floor plan for localization, point-of-interest notifications, and path planning. The robot conveys the guidance cues through a haptic handle and was demonstrated in a university building user study. Alternatively, drone-based navigation robots [18], [19], can also be a solution. However, navigation assistive robots and drones might not be preferred due to safety, noise, lack of control, or large size, which can be impractical. Cane-based robots (a.k.a. smart canes) [20], [21], [22], [23] help users avoid collisions and change direction but add weight and interfere with existing cane skills. Therefore, modifying the long cane may not be ideal for navigation assistance.

Wearable systems have been studied for decades as alternatives for navigating people with vision loss [11], [12], [24], and advances in hardware and software have made them increasingly feasible. A system using a portable GPU and custom glasses with an integrated camera and headphones [25] has shown navigation benefits. Researchers have used body-worn cameras that can process visual information via a laptop or cloud server [26], [27], [28], [29] for navigation assistance. However, standalone wearable systems can’t detect obstacles beyond their field of view, and using cloud servers [30] can introduce latency which is problematic for real-time navigation. Moreover, many systems don’t address initial user alignment, requiring multiple steps for calibration, which can create a hazard for the user.

Vehicle-to-Everything (V2X) technology enhance safety and efficiency by enabling communication between vehicles and their surrounding environment. Significant research has been conducted on V2X, particularly in the areas of protocols, architectures, and supporting technologies [31], [32], [33], [34], [35]. It is well-established that V2X communication improves the safety of road vulnerable users (RVUs) by sharing objects’ locations with GPS [36], [37], [38], [39]. However, the majority of studies have primarily focused on the development of efficient V2X protocols, networks, and services. In this work, we demonstrates the concept of novel outdoor navigation assistive technology that collaborates with autonomous vehicle infrastructure, which can integrate V2X technology, especially based on people with visual impairment scenarios.

SECTION III.

Method

A. System Architecture

In this work, we use a RGB-D inertial sensor mounted on the head of the visually impaired, and an embedded computer (see Table I for specifications of the prototype system) that can communicate with the infrastructure sensing system (IX-node) [40], [41] usually installed for helping autonomous vehicles. The diagram in Fig. 2 illustrates the framework of our system that performs mainly three operations to efficiently guide the users: (i) perception, (ii) localization, and (iii) path planning and guidance. Perception and localization, including initial orientation to the proper path, is executed using deep learning and sensor fusion, and appropriate feedback is provided via a bone conduction headphone. The system uses autonomous vehicle infrastructure in two ways (i) the prior map used for localization is generated from an autonomous vehicle, (ii) the system receives real-time information regarding dynamic obstacles and autonomous vehicles from the IX-node [42].

TABLE I Our System Hardware and Software Specifications
Table I- Our System Hardware and Software Specifications
Fig. 2. - System architecture: The data stream comes from Intel Realsense D435i and Sparkfun GPS. “AV” is an autonomous vehicle indicating a shuttle, and IX-node is a router for smart mobility systems to help autonomous vehicles.
Fig. 2.

System architecture: The data stream comes from Intel Realsense D435i and Sparkfun GPS. “AV” is an autonomous vehicle indicating a shuttle, and IX-node is a router for smart mobility systems to help autonomous vehicles.

B. Perception

Perception is an essential part of our system because the outcome of perception affect the accuracy of localization and path planning. Within the perception module, semantic segmentation and improved depth maps are obtained and applied to enhance the localization process during navigation. Semantic segmentation is challenging, especially with constraints like real-time operation and low-computer hardware, which are critical for mobile robots. Lightweight semantic segmentation methods have emerged recently, to address the challenging requirements for mobile robots. BiSeNetv2 [44], an evolution of BiSeNet [45], MGSeg [46], MSCFNet [47], DDPNet [48], SGCPNet [49], FBSNet [50], PMSIEM [51] have been designed to accomplish real-time performance through dual or multiple pathway architectures, efficient convolution layers, and fusion modules with variable scale feature maps. Although these lightweight semantic networks show real-time performance but semantic segmentation is only part of the information that is required for robust navigation. We also need accurate depth maps because the raw depth obtained from the RGB-D inertial sensor is generally very noisy and incomplete. So we built a new multi-task network based on our prior work [52], such that we can perform semantic segmentation and depth map enhancement through a single network that shares a common backbone.

1) CNN Model Architecture:

The proposed CNN architecture for the perception module is illustrated in Fig. 3. Our multi-modal CNN is composed of a Semantic-Depth encoder, Semantic-Decoder (see Fig. 4), Depth-Decoder (see Fig. 5), and PoseNet for semantic segmentation, depth map enhancement, and relative pose evaluations (see Fig. 4, 5, and 6). The outcomes for semantic segmentation and depth map enhancement are produced through the shared backbone and each decoder infers its respective outcome. PoseNet is designed to leverage depth map enhancement using relative pose estimations between adjacent frames (see Fig. 6). Its backbone is the same as the Semantic-Depth encoder except for the last layer. During training, all CNNs are trained with three images and ground truth. However, during navigation experiments, a single RGB image is utilized to obtain the semantic segmentation result and enhanced depth map, and PoseNet is not operational when the system operates.

Fig. 3. - The overall architecture of our multi-modal CNN: It shows the inputs for training, which consists of three sequential RGB inputs. However, only a single RGB input is used for the inference. The highlighted in red indicate an input and outputs for inference. The images by the Semantic-Decoder are predictions and ground truth. The images next to the Depth-Decoder are predictions and raw depth maps. Outcomes of the pose estimation backbone are utilized in the ${\mathcal {L}}_{pose}$
.
Fig. 3.

The overall architecture of our multi-modal CNN: It shows the inputs for training, which consists of three sequential RGB inputs. However, only a single RGB input is used for the inference. The highlighted in red indicate an input and outputs for inference. The images by the Semantic-Decoder are predictions and ground truth. The images next to the Depth-Decoder are predictions and raw depth maps. Outcomes of the pose estimation backbone are utilized in the {\mathcal {L}}_{pose} .

Fig. 4. - Semantic segmentation CNN architecture: “Stem” consists of convolution-batch normalization-Relu as described in [43]. “C” is a “Stem” with stride 2. “CC” is concatenation. “CW” is a simple convolution layer. The abbreviations “MPD”, “MPDU” and “MPDS” are described in Fig. 7 and 8. The outputs of colorized blocks are inputs of the “Dilated Fusion” blocks of our Depth-Decoder shown in Fig. 5.
Fig. 4.

Semantic segmentation CNN architecture: “Stem” consists of convolution-batch normalization-Relu as described in [43]. “C” is a “Stem” with stride 2. “CC” is concatenation. “CW” is a simple convolution layer. The abbreviations “MPD”, “MPDU” and “MPDS” are described in Fig. 7 and 8. The outputs of colorized blocks are inputs of the “Dilated Fusion” blocks of our Depth-Decoder shown in Fig. 5.

Fig. 5. - Depth-Decoder architecture: “CS” is a convolution-sigmoid layer. The abbreviations “MPDU”, “MPDU-S” and the “Dilated Fusion” block are described in Fig. 7 and 8.
Fig. 5.

Depth-Decoder architecture: “CS” is a convolution-sigmoid layer. The abbreviations “MPDU”, “MPDU-S” and the “Dilated Fusion” block are described in Fig. 7 and 8.

Fig. 6. - PoseNet architecture: The PoseNet is an auxiliary to train the Depth-Decoder. It was only used for training.
Fig. 6.

PoseNet architecture: The PoseNet is an auxiliary to train the Depth-Decoder. It was only used for training.

We focus on minimizing the number of network parameters and operations thereby reducing the overall computational cost of the system to achieve real-time operations. To consider these constraints, we propose three-way dilation blocks employed in the short and standard blocks (MPDS and MPD in Fig. 7). Each feature map is split into three sub feature maps and dilation convolution operates on each of these feature maps. It helps increasing receptive fields to understand contextual information without increasing the computational costs. The Semantic-Decoder mainly consists of MPDU blocks from different scale levels. The Depth-Decoder is constructed with upsampling blocks resembling those in the Semantic-Decoder. However, the upsampling block in the final layers (MPDU-S in Fig. 8) is simplified using a pixel shuffling layer, which does not increase computational cost compared to interpolation. The module using squeeze-and-excitation layers [53] can help to remove mosaic patterns. The enhanced depth map is evaluated based on the raw depth map and projection errors to previous frames and future frames with semantic predictions. The fusion of semantic information helps decrease the projection loss.

Fig. 7. - This figure shows sub-modules to construct our multi-modal CNN. “DC” is a dilation convolution layer with different sizes of dilation ranging from 2 - 6. “BN” is batch normalization.
Fig. 7.

This figure shows sub-modules to construct our multi-modal CNN. “DC” is a dilation convolution layer with different sizes of dilation ranging from 2 - 6. “BN” is batch normalization.

Fig. 8. - This figure shows sub-modules to construct our multi-modal CNN.
Fig. 8.

This figure shows sub-modules to construct our multi-modal CNN.

Our multi-modal CNN is trained as semi-supervised learning. Distinct loss functions tailored for each network were applied during training. For PoseNet training, we used three sequential images and a structure-from-motion approach. For semantic segmentation, {\mathcal {L}}_{sem} uses OHEM loss [54]. {\mathcal {L}}_{depth} , and {\mathcal {L}}_{dontcare} are smooth l1 loss with raw depth maps in the equation 1.\begin{equation*} {\mathcal {L}}_{depth}, {\mathcal {L}}_{dontcare} = \frac {1}{n*m}\sum _{t=1}^{m}\sum _{i=1}^{n}z_{i}^{t} \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.where z_{i}^{t} is given by:\begin{align*} z_{i}^{t} = \begin{cases} \displaystyle 0.5*(x_{i}^{t} -y_{i}^{t})^{2}/\beta, \quad if \left |{{ x_{i}^{t}-y_{i}^{t} }}\right | \lt \beta \\ \displaystyle \left |{{ x_{i}^{t}-y_{i}^{t} }}\right | - 0.5*\beta, \quad otherwise \end{cases} \tag {2}\end{align*}
View SourceRight-click on figure for MathML and additional features.

y_{i}^{t} is a raw depth map and x_{i}^{t} is an enhanced depth map at time \it {t} . Depending on the distance, when it is closer or further than the threshold value (23m), {\mathcal {L}}_{dontcare} and {\mathcal {L}}_{depth} are used, respectively. The \beta is 1.0. We used 23m to reduce computational costs, and it is unnecessary to consider farther objects for localization estimation, particularly for pedestrian navigation. In calculating the pose loss, we modified a cosine embedding loss with semantic segmentation information in the equation 3.\begin{align*} & {\mathcal {L}}_{pose} = \sum _{i=1}^{D}(1 - \cos \theta _{i}), \\ & \ \ cos\theta _{i} = \frac {1}{N}\sum _{j=1}^{N}\frac {P_{t-\gt t'}A_{j}^{t} \cdot {B_{j}^{t'}}}{||P_{t-\gt t'}A_{j}^{t}|| \cdot ||{B_{j}^{t'}}||} \tag {3}\end{align*}

View SourceRight-click on figure for MathML and additional features.N is the generated number of valid 3d points and (t,t') \in {(t,t-1), (t,t+1), (t-1,t), (t+1,t)} indicates adjacent time frames of PoseNet from time t to t' . D is the number of enhanced depth maps from Depth-Decoder at different scales. P_{t-\gt t'} denotes the outcomes from the PoseNet from time t to t' .\begin{align*} A_{j}^{t} & = f(disp_{j}^{t}, seminf^{t}, intrinsic) \odot mask_{A_{j}^{t}} \\ B_{j}^{t'} & = f(disp_{j}^{t'},seminf^{t'}, intrinsic) \odot mask_{B_{j}^{t'}} \tag {4}\end{align*}
View SourceRight-click on figure for MathML and additional features.
A_{j}^{t} and B_{j}^{t'} are generated point clouds at time t and t' with predicted semantic segmentation results and enhanced depth maps. The seminf^{t} means the results of the segmentation inference of each input RGB image at time t. disp_{j}^{t} represents the disparity maps from the Depth-Decoder at time t at the scale j. The intrinsic is the camera’s intrinsic parameters to compute the point cloud. mask_{A_{j}^{t}} and mask_{B_{j}^{t'}} are filters based on semantic inferences and predicted depth maps. The total loss is the equation 5.\begin{align*} {\mathcal {L}}_{total} & = W_{sem}*{\mathcal {L}}_{sem} + W_{depth}*{\mathcal {L}}_{depth} \\ & \quad + W_{pose}*{\mathcal {L}}_{pose} + W_{dontcare}*{\mathcal {L}}_{dontcare} \tag {5}\end{align*}
View SourceRight-click on figure for MathML and additional features.

The weights W_{sem} is set to 1.2, W_{depth} is set to 1, W_{dontcare} is set to 0.05, and W_{pose} is set to 0.8.

2) Datasets and Training:

We used two public datasets to evaluate the performance of the proposed semantic segmentation network: (i) Cityscapes [55] and (ii) Camvid [56]. In addition to that, we adopted our semantic segmentation CNN to the dataset we collected at the Mcity, where we tested our system with individuals with visual impairment. Cityscapes is a large-scale urban street scene dataset including 19 classes. The dataset is divided into training (2975 images), validation (500 images), and testing (1525 images) sets to facilitate robust evaluation and analysis. Cambridge-driving Labeled Video Database (Camvid) is a relatively small dataset for understanding driving scenes. Our approach combined the training and validation sets into a unified training set comprising 468 images. The original test set, consisting of 233 images, was employed for testing due to its limited size. Although the dataset encompasses 32 classes, our analysis focuses on 11 out of the 32 classes, aligning with similar studies [57], [58], [59] to assess the accuracy. The final dataset was gathered at Mcity. We collected the data from March to July and performed the test in October. It is helpful to measure overfitting. Transfer learning was applied to a pre-trained model using the Cityscapes dataset for model training. The training dataset comprises 2000 images, while 743 images were set aside for validation, randomly sampled from a total pool of 2743 images. The data collection involved various routes and included five distinct classes with an image resolution size of 480\times 848 . Ultimately, all the training and validation sets were utilized to train our network for human subject navigation tests. We trained all of our CNNs using a stacking approach. Initially, we exclusively conducted training for the semantic segmentation CNN first using the Cityscapes dataset. We used 800 epochs, 0.05 learning rate, 0.9 momentum, 0.0001 weight decay, and poly learning rate scheduler. We set up 8 batch sizes and applied random augmentation [60]. Training input size is 1024\times 1024 and validation input size is 1024\times 2048 . For Camvid, we used the scratch model and pre-trained model with Cityscapes. For the Camvid training with the scratch model, we used 200 epochs with a 0.025 learning rate, and other options are the same as the configuration of Cityscapes training. The training input size is 720\times 960 and the validation input size is 720\times 960 . With pre-trained model using Cityscapes on the Camvid, we changed learning rate to 0.01 and 100 epochs. Following the training of the semantic segmentation CNN, we utilized the pre-trained model with Cityscapes to train the Depth-Decoder using the Mcity dataset and fine-tuning for the semantic segmentation CNN. We used 800 epochs, 0.9 momentum, 0.0001 weight decay, and a poly learning rate scheduler. The learning rates are assigned differently to train our multi-modal model. The semantic segmentation CNN is trained with a 0.015 learning rate, the Depth-Decoder used a 0.025 learning rate, and the PoseNet CNN used a 0.025 learning rate. The input and output image size is 480\times 848 resolution.

C. Prior Map

Our system leverages a prior map comprising various layers, including an RGB, semantic, and intensity map. The prior map of Mcity test facility that we used for our experiments is created by Ford as described in [61]. In order to enhance computational efficiency and robustness, we only utilize the semantic layer of this prior map. The prior map is synchronized with GPS coordinates, facilitating the mapping of GPS coordinates from our system’s GPS sensor onto the prior map. It’s noteworthy that the prior map; was originally intended for use by Ford autonomous vehicles (AV) and was created by a completely different set of sensors typically used in an AV. The prior map’s different modality from our sensors underscores its distinct purpose within our navigation system. Fig. 9 shows the satellite RGB prior map as an example.

Fig. 9. - The main image shows a prior map based on satellite images, produced by Ford Motor Company. The inset images show point of view video frames obtained from a smart phone (“A” and “B”). “C” shows a painted façade in the Mcity. Dashed lines are routes tested.
Fig. 9.

The main image shows a prior map based on satellite images, produced by Ford Motor Company. The inset images show point of view video frames obtained from a smart phone (“A” and “B”). “C” shows a painted façade in the Mcity. Dashed lines are routes tested.

D. Localization

The localization begins by using pose data from GPS to get an approximate location of the user on the prior map. Based on this initial pose, a relevant segment of the prior map is extracted via brute force sampling around the GPS pose. We sample the prior map every 50cm in the x-axis and y-axis, along with every 15 degress in the yaw. The sampled data from the map is matched with the estimated birds-eye-view semantic segmentation and depth map patches obtained from the user camera image. By comparing the Mean Intersection over Union (mIoU) between the live camera patches and the prior map samples, top K results are identified. On these top K results, kernel density estimation is applied to mitigate the impact of outliers and get the measured pose of the user. We use IMU sensors to compensate for motion between the successive camera measurements via complementary filter. The compensated yaw is used for the next particles sampling as well as particularly measuring the rotation motion at 90-degree turn area. The three parameters (x, y, and yaw) derived from this process determine users’ current position and heading on the prior map.

E. Moving Obstacles and Information Through IX-Node

We also integrate the obstacles detected by the infrastructure (IX) node into our navigation system. Mcity has several IX-nodes installed, which consist of sensors, compute, and V2X communication devices. These IX-nodes are capable of detecting pedestrians and other objects [62] that are otherwise not visible to the onboard sensing of our navigating system. Obtaining information about moving obstacles that are outside the line of sight of the onboard sensors is important to perform safe path planning. The IX-nodes broadcast the information about moving obstacles with object types (pedestrians and cars). Our system receives this information from the IX-node through Wi-Fi network connections, and we use this information for safe path planning. We perform several experiments where the visually impaired participants were exposed to controlled scenarios where the obstacles were outside the line of sight of the onboard sensors but were easily detected by the IX-node and transmitted to our navigation system.

F. Path Planning

We use a D* lite [63] path planning algorithm to provide one of the five guidance cues as shown in Fig. 10 along with a stay and stop command. The path planning is performed whenever localization is executed based on pre-defined waypoints such as braille blocks and sharp turn regions (see Fig. 9). On every braille block and location demanding a 90-degree turn, our system aligned users properly toward the next direction. The figures 18, 17, 19 show how the path changes depending on different situations in yellow. Our system guides users in the proper directions to their final destinations. Fig. 9 shows 5 different paths we used for the experiments. (A,bus), (A,C), (B,bus), (bus,B), and (C,bus) are the pairs of start and end position.

Fig. 10. - Expected motion for each guidance cue.
Fig. 10.

Expected motion for each guidance cue.

Fig. 11. - This figure compares performance and computational costs.
Fig. 11.

This figure compares performance and computational costs.

Fig. 12. - Visualized point clouds with RGB colors using raw depth maps (left) and enhanced depth maps (right).
Fig. 12.

Visualized point clouds with RGB colors using raw depth maps (left) and enhanced depth maps (right).

Fig. 13. - The orange and yellow symbols mark the location where we collected data on initial alignment. The arrows show the headings tested, and the light green arrows provide a reference.
Fig. 13.

The orange and yellow symbols mark the location where we collected data on initial alignment. The arrows show the headings tested, and the light green arrows provide a reference.

Fig. 14. - Initial heading estimation results: None of the data is contained in the training set. The red arrows indicate the current headings. All the illustrated cases demonstrate successful results. Rectangular red and blue are sampled headings and locations (particles). Illustrated points in red, blue, and yellow are point clouds from predicted depth maps and semantic information (road, sidewalk, and braille block respectively). The green arrows point at the estimated location and heading. The left column of each image is the RGB input image, and the middle of each image is the result of semantic segmentation. The faint cyan color is a 3d point cloud map used for visualization only. White lines enhance the outline of the sidewalk.
Fig. 14.

Initial heading estimation results: None of the data is contained in the training set. The red arrows indicate the current headings. All the illustrated cases demonstrate successful results. Rectangular red and blue are sampled headings and locations (particles). Illustrated points in red, blue, and yellow are point clouds from predicted depth maps and semantic information (road, sidewalk, and braille block respectively). The green arrows point at the estimated location and heading. The left column of each image is the RGB input image, and the middle of each image is the result of semantic segmentation. The faint cyan color is a 3d point cloud map used for visualization only. White lines enhance the outline of the sidewalk.

Fig. 15. - These images show participant trajectories. The trajectory in green is ground truth based on recorded videos of footage. The trajectory in cyan is the estimated localization using semantic patches. The trajectory in red is the recorded GPS coordinate. “A” is the result of S001 on the route 5 and “B” is the outcome of S002 on the route 3.
Fig. 15.

These images show participant trajectories. The trajectory in green is ground truth based on recorded videos of footage. The trajectory in cyan is the estimated localization using semantic patches. The trajectory in red is the recorded GPS coordinate. “A” is the result of S001 on the route 5 and “B” is the outcome of S002 on the route 3.

Fig. 16. - It is the result of our system processing with S002 on the Route 5. Each image shows the guidance cue changes in the center of the image depending on the user location. The top left images are the point of view images and generated semantic patches with inferred semantic information. The different colors of points clouds represent roads (dark red), sidewalks (cyan), braille blocks (red), 90-degree turn areas (gray), and destinations (yellow stars). Yellow dots are the planned path from the current location to the next waypoints.
Fig. 16.

It is the result of our system processing with S002 on the Route 5. Each image shows the guidance cue changes in the center of the image depending on the user location. The top left images are the point of view images and generated semantic patches with inferred semantic information. The different colors of points clouds represent roads (dark red), sidewalks (cyan), braille blocks (red), 90-degree turn areas (gray), and destinations (yellow stars). Yellow dots are the planned path from the current location to the next waypoints.

Fig. 17. - Simulation on the results of S002 Route 3 results: “A” image is “forward” because any detection of collision is not detected between the planned path and cars based on TTC. When our user starts to cross a street and the collision is detected, our system provides “Collision prediction. stop” to the user in the simulation (B, C, and D).
Fig. 17.

Simulation on the results of S002 Route 3 results: “A” image is “forward” because any detection of collision is not detected between the planned path and cars based on TTC. When our user starts to cross a street and the collision is detected, our system provides “Collision prediction. stop” to the user in the simulation (B, C, and D).

Fig. 18. - Simulation on the results of S002 Route 5 outcomes: Each image shows the guidance cue changes when our system obtains the information through the IX-node (“Collision prediction. stop”). The pedestrian stays for a while to align heading before crossing a street (A). After the moving object passes the planned path (B), a “forward” cue comes out (C).
Fig. 18.

Simulation on the results of S002 Route 5 outcomes: Each image shows the guidance cue changes when our system obtains the information through the IX-node (“Collision prediction. stop”). The pedestrian stays for a while to align heading before crossing a street (A). After the moving object passes the planned path (B), a “forward” cue comes out (C).

Fig. 19. - Simulation on the results of S001 Route 2: The red pedestrian is a moving obstacle on the sidewalk. Depending on the existence of moving obstacle, the planned path (yellow lines) is different (A and B).
Fig. 19.

Simulation on the results of S001 Route 2: The red pedestrian is a moving obstacle on the sidewalk. Depending on the existence of moving obstacle, the planned path (yellow lines) is different (A and B).

G. User Study

We conducted a user study with two BVI participants. Since we tested a small number of participants, the results are qualitative [64], but do give some insight into people’s needs and the context within which our future technology might be used. There are five different routes. Each scenario has an invisible destination at each start point, such as shuttles parked by stores and bus stops. One of the routes was not used for collecting data to train our multi-modal CNN, and the rest were used for training. However, the environment has changed since the time we collected the data. The length of each route is denoted in Table IX.

TABLE II Model Comparison on Cityscapes Test Set With Lightweight Semantic Segmentation CNNs. We Used Reported Values From Their Studies. The GPU Computational Efficiency: GTX 1080 < GTX 1080 Ti < TITAN XP < RTX 2080 Ti) According to the GPU Benchmark. “*” Indicates TensorRT Use. The Number in the Parenthesis Is the mIoU on Cityscapes Validation Set
Table II- Model Comparison on Cityscapes Test Set With Lightweight Semantic Segmentation CNNs. We Used Reported Values From Their Studies. The GPU Computational Efficiency: GTX 1080 < GTX 1080 Ti < TITAN XP < RTX 2080 Ti) According to the GPU Benchmark. “*” Indicates TensorRT Use. The Number in the Parenthesis Is the mIoU on Cityscapes Validation Set
TABLE III mIoU and FPS Comparison on CamVid Test Set. The mIoU and FPS Are Taken From Published Results. The mIoU in the Parenthesis Is the Accuracy Without Cityscape Pre-Training
Table III- mIoU and FPS Comparison on CamVid Test Set. The mIoU and FPS Are Taken From Published Results. The mIoU in the Parenthesis Is the Accuracy Without Cityscape Pre-Training
TABLE IV Its Maximum Distance Threshold Is 16m. The Unit Is in Meter. “NM” Indicates “Not Measurable.”. “INF” Means Cannot Compute. “AVG” Is the Mean Error of 13 Landmarks
Table IV- Its Maximum Distance Threshold Is 16m. The Unit Is in Meter. “NM” Indicates “Not Measurable.”. “INF” Means Cannot Compute. “AVG” Is the Mean Error of 13 Landmarks
TABLE V “Step Error” Indicates an L1 Error From the Ground Truth Location (the Center of Sampled Particles With a 50cm step). “Heading Success” Evaluates Whether It Can Provide Correct Guidance Cues
Table V- “Step Error” Indicates an L1 Error From the Ground Truth Location (the Center of Sampled Particles With a 50cm step). “Heading Success” Evaluates Whether It Can Provide Correct Guidance Cues
TABLE VI Localization Error in the Meter. “ESTM” Is the Estimated Location Through Our System. “Safe” Is Less Than 65cm Error Because of Using a Long White Cane. “Minor Risk” Is Less Than 1m and More Than 65cm Error. The “Major Risk” Is More Than a 1m Error; When Localized, They Are Standing on the Roads or Not on Sidewalks. It Is Abandoned From the Three Decimal Places
Table VI- Localization Error in the Meter. “ESTM” Is the Estimated Location Through Our System. “Safe” Is Less Than 65cm Error Because of Using a Long White Cane. “Minor Risk” Is Less Than 1m and More Than 65cm Error. The “Major Risk” Is More Than a 1m Error; When Localized, They Are Standing on the Roads or Not on Sidewalks. It Is Abandoned From the Three Decimal Places
TABLE VII Measurement of Processing Time of Each Module on the Jetson Orin NX 16 GB During the Experiments
Table VII- Measurement of Processing Time of Each Module on the Jetson Orin NX 16 GB During the Experiments
TABLE VIII Trust and Mental Efforts Rated by the Participants
Table VIII- Trust and Mental Efforts Rated by the Participants
TABLE IX We Counted the Number of Assistants During the Navigation and the Length of Each Route. “S” Represents Success and “T” Represents Trials
Table IX- We Counted the Number of Assistants During the Navigation and the Length of Each Route. “S” Represents Success and “T” Represents Trials

SECTION IV.

Results

A. Semantic Segmentation Evaluation

We separately evaluated our semantic segmentation performance with lightweight semantic segmentation CNNs, which usually have lower accuracy than semantic segmentation CNNs, which focus on high accuracy. It is because our system requires a trade-off between accuracy and usability. Small objects can matter to obtain high accuracy but Wayve [65] shows even autonomous driving cars can drive without very accurate inference results. Pedestrians move slowly relative to automobiles. Thus, the navigation assistive system does not need to consider far objects, and we focused on balancing computational costs and performance. Multiple papers claim they achieved real-time or efficient real-time, lightweight semantic segmentation [66], [67], [68]. However, in many cases, real-time performance was based on FPS on the desktop, not on a mobile computing platform. We need to consider low-compute devices, including smartphones. We selected more lightweight semantic segmentation CNNs for evaluation recently published [44], [46], [47], [48], [49], [50], [51]. Our code to estimate the FPS uses torch.cuda.Event(enable_timing=True) and torch.cuda.synchronize() to measure inference time only. The time for warming-up iterations was considered and discarded to calculate the inference speed. Table II shows our semantic segmentation model achieved comparable performance with respect to accuracy and real-time performance on low-compute devices. For extensive comparison, our prior work contains more models and evaluation.

We used the Net-fps score [52] to evaluate multi-modal factors, including computational costs. We measured the fps of SGCPnet and PMSIEN using GTX 1080, which was used to compute the Net-fps score. Even though our prior semantic segmentation CNN [52] shows a higher net-fps score, we used the current semantic segmentation CNN. It is because with 480\times 848 image resolution, out current model operates faster than our prior model on Jetson Orin NX (53 fps vs 43 fps). In addition to that, the proposed model has fewer parameters which enables real-time operations during the inference that can be slower with other tasks (e.g., depth map enhancement and localization). The result of semantic segmentation CNN is illustrated in Fig. 11. The circle size indicates the model size, which is the summation of the number of partners and the number of floating operations (GFLOPs). The number of parameters is re-scaled to the GFLOPs. The fps of SGCPNet and PMSIEN are remeasured on the GTX 1080, and BiSeNet-V2 used Tensorrt to optimize the network. With the Camvid dataset, our current model outperformed other models except PMSIEM with the Cityscapes pre-trained model (see Table III).

B. Depth Enhancement Evaluation

We evaluate the depth enhancement by comparing it with the raw depth maps from the RGB-D inertial sensor. Even though other studies demonstrated their model predicting semantic segmentation and depth completion [69], [70], [71], [72], it is problematic to compare our outcomes due to dataset modality. We defined landmarks on each route we tested and measured the distance using Google Maps. Table IV shows the results. Raw depth maps are from RGB-D inertial sensor using Intel Realsense D435i SDK filtering. The error is computed with respect to the ground truth coming from google satellite images, respectively. We collected the data for depth estimation separately, so the data was not used for training our networks. Our predictions outperformed the raw depth map, with an average error of 1.29m compared to 2.54m. Fig. 12 shows comparisons between the raw depth map and our proposed depth map enhancement output. The crosswalks are not detected with the raw depth maps. Stop lines and middle orange lines are reconstructed more accurately with enhanced depth maps. You compare ellipses between the left and right colorized in orange and green.

C. Initial Alignment Estimation Evaluation

We evaluate the initial alignment process separately. We designed 15 test cases (see Fig. 13). The locations are similar to the data set we collected for our multi-modal CNN training. Still, headings are not used to collect data for training our multi-modal CNN except for three forward directions toward the destination we used for human subject experiments (see Fig. 14). Table V shows the results. Overall, 13 out of 15 test cases were successful. The success criterion is whether the system can start to provide guidance cues (rotate body left, rotate body right, and stay) for alignments. There are two failures because of the lack of semantic information. Route 3 - 90 initial heading case, there are no semantically labeled regions such as sidewalks in view of the camera. Route 3 - 45 initial heading case has a similar issue. We did not test the cases where obstacles block the field of view of the camera. For example, roads are occluded by a bus stop or heading toward store panels because our semantic segmentation CNN cannot predict properly.

D. Localization Estimation Evaluation

We conducted a test on five different routes with two participants with severe visual impairment. To prepare them for the test, we provided a 30-minute training session using an app that allowed them to familiarize themselves with our system and provide feedback on the different routes. The routes were not used for training our CNN model and navigation experiments. Fig. 15 shows the example results, and table VI shows the error. Compared to the GPS-only localization, the estimation using semantic information outperformed the GPS-only results except the Route 2. “Safe” is less than 65cm difference from the ground truth because of using a long white cane. “Minor risk” is less than 1m and more than 65cm difference from the ground truth. The “major risk” includes the cases when they were more than 1m different from the ground truth and they walked on the roads or not on sidewalks.

E. Real-Time Performance Evaluation

We measured the operation time for each module working on the Jetson Orin NX 16 GB. Table VII shows the results. We evaluated frames per second on a smartphone of our CNN model (full model with depth enhancement) independently. It takes 190 ms about 5 fps on the Galaxy Samsung Note 8, which was released in 2018. Our previous model [52] showed 180 fps on the GTX 1080 and our semantic segmentation CNN shows 209 fps on the same machine with 480\times 848 image resolution used for our navigation system. In addition to that, we measured the fps on the Jetson Orin NX 16 GB. Our previous model [52] shows 43 fps, and our Semantic segmentation CNN model shows 50 fps.

F. User Study

In our study, we recruited the participation of two individuals with visual impairment, both devoid of residual vision. They conducted navigation experiments on five different routes (see Fig. 9), and the total length of traveling was 213.36m based on Google Maps. To enhance the robustness of our findings, we replicated the experiments on identical routes. S002 tested 8 trials and success on 6 out of 8 trials. S001 tested 7 trials and success on all trials. During the experiments, several assistance happened when users were shifted and had no chance to recover by themselves or by our system (see Table IX). It was because of semantic information error or GPS error which is the seed location to generate semantic patches. Fig. 16 demonstrates how our system guided S002.

G. User Feedback

Table VIII shows the participants’ ratings on our system regarding trust and mental effort. The trust is about how much our system can navigate the users reliably so higher is better. The mental effort indicates how much our system requires users’ effort while navigating, so lower is better. Depending on their trials and success, the trustness scores are variable, and overall mental efforts show very low. However, the S001 was confused about the veer left and right guidance cues, so the mental effort score was higher than the S002.

H. Moving Obstacles Simulation

The moving obstacles scenario was simulated to understand the benefit of connecting to the IX-node. Fig. 18, 17, and 19 show how the guidance cue changes depending on a simple time-to-collision (TTC) algorithm. When the current path planning is blocked, our system delivers “Collision prediction. stop.” to make users stop. If the current path planning remains unaffected by the predicted trajectories of moving obstacles, the current guidance is unaffected by the moving obstacles, as illustrated in Fig. 17. Fig. 19 shows when the moving obstacle is a pedestrian. Notably, the planned path avoids the moving obstacle that is not in the line of site of the user and is observed through the IX-node.

SECTION V.

Conclusion

In this paper, we demonstrated a new wearable system for outdoor navigation assistance. The novelty of this system is direct cooperation with infrastructure meant for self-driving cars. Specifically, we use a prior map generated for self-driving cars and communicate directly with IX-nodes to obtain information about vehicle location (destination) and other dynamic agents (hazards). Our system can process invisible information with feasible localization accuracy and guidance using semantic information and enhanced depth maps. Two participants agreed our system could have the potential to improve their confidence while navigation, requiring low mental effort for adaptation to the system. We also proposed a new efficient semantic segmentation architecture based on our prior work and a model for enhancing depth maps simultaneously with monocular image inputs. Overall, our system can help with all the requirements of navigation problems for people with visual impairments. However, further research is needed to better deal with the more dynamic environments on both aspects of algorithms and feedback, building more lightweight and efficient multi-modal CNN to obtain semantic information and depth values.

References

References is not available for this document.