Introduction
Navigation Assistive technology apps operating on mobile phones enhance the mobility of individuals with visual impairments [1], [2], [3], [4]. Such apps can provide location information (Google Maps), information about surroundings (Blind Square), and aid with specific tasks such as street crossing (OKO AI Copilot for the Blind). These capabilities augment traditional aids. A long cane serves as an efficient tool for collision avoidance, moving forward straightly, and obtaining nearby information through sound reflections [5], [6]. Some who are blind and visually impaired (BVI) use guide dogs for enhanced mobility, but adoption by BVI is about 2% [7], thus the responsibilities and costs of caring for a pet outweigh the small benefit provided [8]. Therefore, apps combined with a long cane or guide dog allow BVI to increase their independent mobility. Despite these advances, there remain significant limitations. Dynamic information and precise localization are still not possible with commercial GPS (range is 5
We have designed a novel outdoor navigation assistive system aimed at providing improved navigation assistance for BVI, particularly in outdoor urban environments where a user may need to rendezvous with a shuttle or car. The system guides BVI to their destinations through a collaborative interface with autonomous vehicles infrastructure. This technology anticipates a future where there will be infrastructure for autonomous vehicles accessible to pedestrians and interconnected through smart city infrastructure. Our system uses a head-worn camera, bone conduction headphones, a single color-depth (RGB-D) inertial sensor, a GPS sensor, and a Jetson Orin NX 16 GB (NVIDIA) for processing perception, localization, path planning, and feedback in audible cues in real-time. We developed a multi-task network that takes the RGBD data as input and simultaneously performs semantic segmentation and depth map enhancement from raw depth maps, which allows the system to detect in-path obstacles and provide navigation assistance to the user to avoid collision with those obstacles. We chose to use a head-worn camera as it provides more natural feedback regarding pose and direction of movement, is hands-free, and allows for more flexible movement compared to a smartphone camera held in the hand or worn on the torso. The head-worn camera, however, can only detect objects within a clear line-of-sight, while other relevant objects may not be viewable. For example, a shuttle that is on another street will be blocked from view by buildings. To obtain precise location information of such relevant objects, our system is connected with the autonomous systems infrastructure available in the Mcity facility (https://mcity.umich.edu/).
We conducted a pilot user study with two participants who are BVI to evaluate specific functionalities and the overall user experience of our system (see Fig. 1). The study focused on the user’s ability to adapt to our navigation assistive system, hesitancy and confusion during the navigation with our system, design intuitiveness for future mobility systems for pedestrians and infrastructures, and possible cooperation with infrastructure. Two participants with severe visual impairment traversed five different routes. The location of a virtual shuttle and moving objects (e.g., pedestrians) was obtained from the Mcity infrastructure network. The participants followed cues generated by the wearable system to reach the shuttle location.
The figure shows the prototype system being used by a visually impaired person for navigation guidance. It shows the RGB-D inertial sensor mounted on the head, a bone conduction headphone for guidance cues, a GPS receiver on the shoulder and the Jetson Orin NX 16GB embedded compute with batteries inside the backpack.
Our three main contributions are listed below:
We demonstrate a novel outdoor navigation assistive technology that is able to collaborate with infrastructure designed for autonomous vehicles. This prototype system can be the basis of future navigation assistive systems.
We developed a multi-modal convolutional neural network (CNN) that simultaneously performs semantic segmentation (Semantic-Decoder) and depth map enhancement (Depth-Decoder) from raw depth maps. Our semantic segmentation performance is comparable to other CNNs that pursue lightweight design. Additionally, the developed modules can be easily adapted to CNN architectures. The Depth-Decoder achieves more accuracy compared to raw depth maps. Notably, our multi-modal model uses a single RGB input, which leads to efficient operation on low-compute devices.
Through a user study, we support the concept of a future navigation assistive technology that includes wearable components and augmentation through autonomous systems infrastructure.
Related Work
Navigation assistive technology for the visually impaired has made significant strides in recent years, with smartphone applications playing a key role in providing essential assistance. These applications offer features such as GPS navigation, voice guidance, and tactile feedback to help individuals with visual impairments navigate outdoor environments with greater independence and confidence. By leveraging the capabilities of smartphones, these tools have become increasingly accessible and widely used, empowering users to overcome navigation challenges and engage more actively in their daily lives. Aira [13] for example provides real-time guidance by connecting the user to a trained remote human agent who assists the visually impaired person through the app. BlindSquare [14] and NaviBlind [15] uses high precision GPS and smart phones for navigation and requires a human agent to plan a proper path for the user before traveling. There are also Bluetooth Low Energy (BLE) beacon-based navigation systems [1], [16], [17] that have been shown to be very effective for indoor navigation like airports and shopping malls. DeepNavi [4] proposed a smartphone-based CNN for real-time object detection and navigation. These solutions provide approximate location information which is beneficial to users but they often rely on a human agent or are limited to a small indoor area.
Researchers have also developed navigation assistance robots and drones. Cabot [9] is a suitcase-shaped autonomous navigation robot that uses a lidar, stereo cameras, and a local floor plan for localization, point-of-interest notifications, and path planning. The robot conveys the guidance cues through a haptic handle and was demonstrated in a university building user study. Alternatively, drone-based navigation robots [18], [19], can also be a solution. However, navigation assistive robots and drones might not be preferred due to safety, noise, lack of control, or large size, which can be impractical. Cane-based robots (a.k.a. smart canes) [20], [21], [22], [23] help users avoid collisions and change direction but add weight and interfere with existing cane skills. Therefore, modifying the long cane may not be ideal for navigation assistance.
Wearable systems have been studied for decades as alternatives for navigating people with vision loss [11], [12], [24], and advances in hardware and software have made them increasingly feasible. A system using a portable GPU and custom glasses with an integrated camera and headphones [25] has shown navigation benefits. Researchers have used body-worn cameras that can process visual information via a laptop or cloud server [26], [27], [28], [29] for navigation assistance. However, standalone wearable systems can’t detect obstacles beyond their field of view, and using cloud servers [30] can introduce latency which is problematic for real-time navigation. Moreover, many systems don’t address initial user alignment, requiring multiple steps for calibration, which can create a hazard for the user.
Vehicle-to-Everything (V2X) technology enhance safety and efficiency by enabling communication between vehicles and their surrounding environment. Significant research has been conducted on V2X, particularly in the areas of protocols, architectures, and supporting technologies [31], [32], [33], [34], [35]. It is well-established that V2X communication improves the safety of road vulnerable users (RVUs) by sharing objects’ locations with GPS [36], [37], [38], [39]. However, the majority of studies have primarily focused on the development of efficient V2X protocols, networks, and services. In this work, we demonstrates the concept of novel outdoor navigation assistive technology that collaborates with autonomous vehicle infrastructure, which can integrate V2X technology, especially based on people with visual impairment scenarios.
Method
A. System Architecture
In this work, we use a RGB-D inertial sensor mounted on the head of the visually impaired, and an embedded computer (see Table I for specifications of the prototype system) that can communicate with the infrastructure sensing system (IX-node) [40], [41] usually installed for helping autonomous vehicles. The diagram in Fig. 2 illustrates the framework of our system that performs mainly three operations to efficiently guide the users: (i) perception, (ii) localization, and (iii) path planning and guidance. Perception and localization, including initial orientation to the proper path, is executed using deep learning and sensor fusion, and appropriate feedback is provided via a bone conduction headphone. The system uses autonomous vehicle infrastructure in two ways (i) the prior map used for localization is generated from an autonomous vehicle, (ii) the system receives real-time information regarding dynamic obstacles and autonomous vehicles from the IX-node [42].
System architecture: The data stream comes from Intel Realsense D435i and Sparkfun GPS. “AV” is an autonomous vehicle indicating a shuttle, and IX-node is a router for smart mobility systems to help autonomous vehicles.
B. Perception
Perception is an essential part of our system because the outcome of perception affect the accuracy of localization and path planning. Within the perception module, semantic segmentation and improved depth maps are obtained and applied to enhance the localization process during navigation. Semantic segmentation is challenging, especially with constraints like real-time operation and low-computer hardware, which are critical for mobile robots. Lightweight semantic segmentation methods have emerged recently, to address the challenging requirements for mobile robots. BiSeNetv2 [44], an evolution of BiSeNet [45], MGSeg [46], MSCFNet [47], DDPNet [48], SGCPNet [49], FBSNet [50], PMSIEM [51] have been designed to accomplish real-time performance through dual or multiple pathway architectures, efficient convolution layers, and fusion modules with variable scale feature maps. Although these lightweight semantic networks show real-time performance but semantic segmentation is only part of the information that is required for robust navigation. We also need accurate depth maps because the raw depth obtained from the RGB-D inertial sensor is generally very noisy and incomplete. So we built a new multi-task network based on our prior work [52], such that we can perform semantic segmentation and depth map enhancement through a single network that shares a common backbone.
1) CNN Model Architecture:
The proposed CNN architecture for the perception module is illustrated in Fig. 3. Our multi-modal CNN is composed of a Semantic-Depth encoder, Semantic-Decoder (see Fig. 4), Depth-Decoder (see Fig. 5), and PoseNet for semantic segmentation, depth map enhancement, and relative pose evaluations (see Fig. 4, 5, and 6). The outcomes for semantic segmentation and depth map enhancement are produced through the shared backbone and each decoder infers its respective outcome. PoseNet is designed to leverage depth map enhancement using relative pose estimations between adjacent frames (see Fig. 6). Its backbone is the same as the Semantic-Depth encoder except for the last layer. During training, all CNNs are trained with three images and ground truth. However, during navigation experiments, a single RGB image is utilized to obtain the semantic segmentation result and enhanced depth map, and PoseNet is not operational when the system operates.
The overall architecture of our multi-modal CNN: It shows the inputs for training, which consists of three sequential RGB inputs. However, only a single RGB input is used for the inference. The highlighted in red indicate an input and outputs for inference. The images by the Semantic-Decoder are predictions and ground truth. The images next to the Depth-Decoder are predictions and raw depth maps. Outcomes of the pose estimation backbone are utilized in the
Semantic segmentation CNN architecture: “Stem” consists of convolution-batch normalization-Relu as described in [43]. “C” is a “Stem” with stride 2. “CC” is concatenation. “CW” is a simple convolution layer. The abbreviations “MPD”, “MPDU” and “MPDS” are described in Fig. 7 and 8. The outputs of colorized blocks are inputs of the “Dilated Fusion” blocks of our Depth-Decoder shown in Fig. 5.
PoseNet architecture: The PoseNet is an auxiliary to train the Depth-Decoder. It was only used for training.
We focus on minimizing the number of network parameters and operations thereby reducing the overall computational cost of the system to achieve real-time operations. To consider these constraints, we propose three-way dilation blocks employed in the short and standard blocks (MPDS and MPD in Fig. 7). Each feature map is split into three sub feature maps and dilation convolution operates on each of these feature maps. It helps increasing receptive fields to understand contextual information without increasing the computational costs. The Semantic-Decoder mainly consists of MPDU blocks from different scale levels. The Depth-Decoder is constructed with upsampling blocks resembling those in the Semantic-Decoder. However, the upsampling block in the final layers (MPDU-S in Fig. 8) is simplified using a pixel shuffling layer, which does not increase computational cost compared to interpolation. The module using squeeze-and-excitation layers [53] can help to remove mosaic patterns. The enhanced depth map is evaluated based on the raw depth map and projection errors to previous frames and future frames with semantic predictions. The fusion of semantic information helps decrease the projection loss.
This figure shows sub-modules to construct our multi-modal CNN. “DC” is a dilation convolution layer with different sizes of dilation ranging from 2 - 6. “BN” is batch normalization.
Our multi-modal CNN is trained as semi-supervised learning. Distinct loss functions tailored for each network were applied during training. For PoseNet training, we used three sequential images and a structure-from-motion approach. For semantic segmentation, \begin{equation*} {\mathcal {L}}_{depth}, {\mathcal {L}}_{dontcare} = \frac {1}{n*m}\sum _{t=1}^{m}\sum _{i=1}^{n}z_{i}^{t} \tag {1}\end{equation*}
\begin{align*} z_{i}^{t} = \begin{cases} \displaystyle 0.5*(x_{i}^{t} -y_{i}^{t})^{2}/\beta, \quad if \left |{{ x_{i}^{t}-y_{i}^{t} }}\right | \lt \beta \\ \displaystyle \left |{{ x_{i}^{t}-y_{i}^{t} }}\right | - 0.5*\beta, \quad otherwise \end{cases} \tag {2}\end{align*}
\begin{align*} & {\mathcal {L}}_{pose} = \sum _{i=1}^{D}(1 - \cos \theta _{i}), \\ & \ \ cos\theta _{i} = \frac {1}{N}\sum _{j=1}^{N}\frac {P_{t-\gt t'}A_{j}^{t} \cdot {B_{j}^{t'}}}{||P_{t-\gt t'}A_{j}^{t}|| \cdot ||{B_{j}^{t'}}||} \tag {3}\end{align*}
\begin{align*} A_{j}^{t} & = f(disp_{j}^{t}, seminf^{t}, intrinsic) \odot mask_{A_{j}^{t}} \\ B_{j}^{t'} & = f(disp_{j}^{t'},seminf^{t'}, intrinsic) \odot mask_{B_{j}^{t'}} \tag {4}\end{align*}
\begin{align*} {\mathcal {L}}_{total} & = W_{sem}*{\mathcal {L}}_{sem} + W_{depth}*{\mathcal {L}}_{depth} \\ & \quad + W_{pose}*{\mathcal {L}}_{pose} + W_{dontcare}*{\mathcal {L}}_{dontcare} \tag {5}\end{align*}
The weights
2) Datasets and Training:
We used two public datasets to evaluate the performance of the proposed semantic segmentation network: (i) Cityscapes [55] and (ii) Camvid [56]. In addition to that, we adopted our semantic segmentation CNN to the dataset we collected at the Mcity, where we tested our system with individuals with visual impairment. Cityscapes is a large-scale urban street scene dataset including 19 classes. The dataset is divided into training (2975 images), validation (500 images), and testing (1525 images) sets to facilitate robust evaluation and analysis. Cambridge-driving Labeled Video Database (Camvid) is a relatively small dataset for understanding driving scenes. Our approach combined the training and validation sets into a unified training set comprising 468 images. The original test set, consisting of 233 images, was employed for testing due to its limited size. Although the dataset encompasses 32 classes, our analysis focuses on 11 out of the 32 classes, aligning with similar studies [57], [58], [59] to assess the accuracy. The final dataset was gathered at Mcity. We collected the data from March to July and performed the test in October. It is helpful to measure overfitting. Transfer learning was applied to a pre-trained model using the Cityscapes dataset for model training. The training dataset comprises 2000 images, while 743 images were set aside for validation, randomly sampled from a total pool of 2743 images. The data collection involved various routes and included five distinct classes with an image resolution size of
C. Prior Map
Our system leverages a prior map comprising various layers, including an RGB, semantic, and intensity map. The prior map of Mcity test facility that we used for our experiments is created by Ford as described in [61]. In order to enhance computational efficiency and robustness, we only utilize the semantic layer of this prior map. The prior map is synchronized with GPS coordinates, facilitating the mapping of GPS coordinates from our system’s GPS sensor onto the prior map. It’s noteworthy that the prior map; was originally intended for use by Ford autonomous vehicles (AV) and was created by a completely different set of sensors typically used in an AV. The prior map’s different modality from our sensors underscores its distinct purpose within our navigation system. Fig. 9 shows the satellite RGB prior map as an example.
The main image shows a prior map based on satellite images, produced by Ford Motor Company. The inset images show point of view video frames obtained from a smart phone (“A” and “B”). “C” shows a painted façade in the Mcity. Dashed lines are routes tested.
D. Localization
The localization begins by using pose data from GPS to get an approximate location of the user on the prior map. Based on this initial pose, a relevant segment of the prior map is extracted via brute force sampling around the GPS pose. We sample the prior map every 50
E. Moving Obstacles and Information Through IX-Node
We also integrate the obstacles detected by the infrastructure (IX) node into our navigation system. Mcity has several IX-nodes installed, which consist of sensors, compute, and V2X communication devices. These IX-nodes are capable of detecting pedestrians and other objects [62] that are otherwise not visible to the onboard sensing of our navigating system. Obtaining information about moving obstacles that are outside the line of sight of the onboard sensors is important to perform safe path planning. The IX-nodes broadcast the information about moving obstacles with object types (pedestrians and cars). Our system receives this information from the IX-node through Wi-Fi network connections, and we use this information for safe path planning. We perform several experiments where the visually impaired participants were exposed to controlled scenarios where the obstacles were outside the line of sight of the onboard sensors but were easily detected by the IX-node and transmitted to our navigation system.
F. Path Planning
We use a D* lite [63] path planning algorithm to provide one of the five guidance cues as shown in Fig. 10 along with a
Visualized point clouds with RGB colors using raw depth maps (left) and enhanced depth maps (right).
The orange and yellow symbols mark the location where we collected data on initial alignment. The arrows show the headings tested, and the light green arrows provide a reference.
Initial heading estimation results: None of the data is contained in the training set. The red arrows indicate the current headings. All the illustrated cases demonstrate successful results. Rectangular red and blue are sampled headings and locations (particles). Illustrated points in red, blue, and yellow are point clouds from predicted depth maps and semantic information (road, sidewalk, and braille block respectively). The green arrows point at the estimated location and heading. The left column of each image is the RGB input image, and the middle of each image is the result of semantic segmentation. The faint cyan color is a 3d point cloud map used for visualization only. White lines enhance the outline of the sidewalk.
These images show participant trajectories. The trajectory in green is ground truth based on recorded videos of footage. The trajectory in cyan is the estimated localization using semantic patches. The trajectory in red is the recorded GPS coordinate. “A” is the result of S001 on the route 5 and “B” is the outcome of S002 on the route 3.
It is the result of our system processing with S002 on the Route 5. Each image shows the guidance cue changes in the center of the image depending on the user location. The top left images are the point of view images and generated semantic patches with inferred semantic information. The different colors of points clouds represent roads (dark red), sidewalks (cyan), braille blocks (red), 90-degree turn areas (gray), and destinations (yellow stars). Yellow dots are the planned path from the current location to the next waypoints.
Simulation on the results of S002 Route 3 results: “A” image is “forward” because any detection of collision is not detected between the planned path and cars based on TTC. When our user starts to cross a street and the collision is detected, our system provides “Collision prediction. stop” to the user in the simulation (B, C, and D).
Simulation on the results of S002 Route 5 outcomes: Each image shows the guidance cue changes when our system obtains the information through the IX-node (“Collision prediction. stop”). The pedestrian stays for a while to align heading before crossing a street (A). After the moving object passes the planned path (B), a “forward” cue comes out (C).
Simulation on the results of S001 Route 2: The red pedestrian is a moving obstacle on the sidewalk. Depending on the existence of moving obstacle, the planned path (yellow lines) is different (A and B).
G. User Study
We conducted a user study with two BVI participants. Since we tested a small number of participants, the results are qualitative [64], but do give some insight into people’s needs and the context within which our future technology might be used. There are five different routes. Each scenario has an invisible destination at each start point, such as shuttles parked by stores and bus stops. One of the routes was not used for collecting data to train our multi-modal CNN, and the rest were used for training. However, the environment has changed since the time we collected the data. The length of each route is denoted in Table IX.
Results
A. Semantic Segmentation Evaluation
We separately evaluated our semantic segmentation performance with lightweight semantic segmentation CNNs, which usually have lower accuracy than semantic segmentation CNNs, which focus on high accuracy. It is because our system requires a trade-off between accuracy and usability. Small objects can matter to obtain high accuracy but Wayve [65] shows even autonomous driving cars can drive without very accurate inference results. Pedestrians move slowly relative to automobiles. Thus, the navigation assistive system does not need to consider far objects, and we focused on balancing computational costs and performance. Multiple papers claim they achieved real-time or efficient real-time, lightweight semantic segmentation [66], [67], [68]. However, in many cases, real-time performance was based on FPS on the desktop, not on a mobile computing platform. We need to consider low-compute devices, including smartphones. We selected more lightweight semantic segmentation CNNs for evaluation recently published [44], [46], [47], [48], [49], [50], [51]. Our code to estimate the FPS uses torch.cuda.Event(enable_timing=True) and torch.cuda.synchronize() to measure inference time only. The time for warming-up iterations was considered and discarded to calculate the inference speed. Table II shows our semantic segmentation model achieved comparable performance with respect to accuracy and real-time performance on low-compute devices. For extensive comparison, our prior work contains more models and evaluation.
We used the Net-fps score [52] to evaluate multi-modal factors, including computational costs. We measured the fps of SGCPnet and PMSIEN using GTX 1080, which was used to compute the Net-fps score. Even though our prior semantic segmentation CNN [52] shows a higher net-fps score, we used the current semantic segmentation CNN. It is because with
B. Depth Enhancement Evaluation
We evaluate the depth enhancement by comparing it with the raw depth maps from the RGB-D inertial sensor. Even though other studies demonstrated their model predicting semantic segmentation and depth completion [69], [70], [71], [72], it is problematic to compare our outcomes due to dataset modality. We defined landmarks on each route we tested and measured the distance using Google Maps. Table IV shows the results. Raw depth maps are from RGB-D inertial sensor using Intel Realsense D435i SDK filtering. The error is computed with respect to the ground truth coming from google satellite images, respectively. We collected the data for depth estimation separately, so the data was not used for training our networks. Our predictions outperformed the raw depth map, with an average error of 1.29
C. Initial Alignment Estimation Evaluation
We evaluate the initial alignment process separately. We designed 15 test cases (see Fig. 13). The locations are similar to the data set we collected for our multi-modal CNN training. Still, headings are not used to collect data for training our multi-modal CNN except for three forward directions toward the destination we used for human subject experiments (see Fig. 14). Table V shows the results. Overall, 13 out of 15 test cases were successful. The success criterion is whether the system can start to provide guidance cues (rotate body left, rotate body right, and stay) for alignments. There are two failures because of the lack of semantic information. Route 3 - 90 initial heading case, there are no semantically labeled regions such as sidewalks in view of the camera. Route 3 - 45 initial heading case has a similar issue. We did not test the cases where obstacles block the field of view of the camera. For example, roads are occluded by a bus stop or heading toward store panels because our semantic segmentation CNN cannot predict properly.
D. Localization Estimation Evaluation
We conducted a test on five different routes with two participants with severe visual impairment. To prepare them for the test, we provided a 30-minute training session using an app that allowed them to familiarize themselves with our system and provide feedback on the different routes. The routes were not used for training our CNN model and navigation experiments. Fig. 15 shows the example results, and table VI shows the error. Compared to the GPS-only localization, the estimation using semantic information outperformed the GPS-only results except the Route 2. “Safe” is less than 65
E. Real-Time Performance Evaluation
We measured the operation time for each module working on the Jetson Orin NX 16 GB. Table VII shows the results. We evaluated frames per second on a smartphone of our CNN model (full model with depth enhancement) independently. It takes 190 ms about 5 fps on the Galaxy Samsung Note 8, which was released in 2018. Our previous model [52] showed 180 fps on the GTX 1080 and our semantic segmentation CNN shows 209 fps on the same machine with
F. User Study
In our study, we recruited the participation of two individuals with visual impairment, both devoid of residual vision. They conducted navigation experiments on five different routes (see Fig. 9), and the total length of traveling was 213.36
G. User Feedback
Table VIII shows the participants’ ratings on our system regarding trust and mental effort. The trust is about how much our system can navigate the users reliably so higher is better. The mental effort indicates how much our system requires users’ effort while navigating, so lower is better. Depending on their trials and success, the trustness scores are variable, and overall mental efforts show very low. However, the S001 was confused about the veer left and right guidance cues, so the mental effort score was higher than the S002.
H. Moving Obstacles Simulation
The moving obstacles scenario was simulated to understand the benefit of connecting to the IX-node. Fig. 18, 17, and 19 show how the guidance cue changes depending on a simple time-to-collision (TTC) algorithm. When the current path planning is blocked, our system delivers “Collision prediction. stop.” to make users stop. If the current path planning remains unaffected by the predicted trajectories of moving obstacles, the current guidance is unaffected by the moving obstacles, as illustrated in Fig. 17. Fig. 19 shows when the moving obstacle is a pedestrian. Notably, the planned path avoids the moving obstacle that is not in the line of site of the user and is observed through the IX-node.
Conclusion
In this paper, we demonstrated a new wearable system for outdoor navigation assistance. The novelty of this system is direct cooperation with infrastructure meant for self-driving cars. Specifically, we use a prior map generated for self-driving cars and communicate directly with IX-nodes to obtain information about vehicle location (destination) and other dynamic agents (hazards). Our system can process invisible information with feasible localization accuracy and guidance using semantic information and enhanced depth maps. Two participants agreed our system could have the potential to improve their confidence while navigation, requiring low mental effort for adaptation to the system. We also proposed a new efficient semantic segmentation architecture based on our prior work and a model for enhancing depth maps simultaneously with monocular image inputs. Overall, our system can help with all the requirements of navigation problems for people with visual impairments. However, further research is needed to better deal with the more dynamic environments on both aspects of algorithms and feedback, building more lightweight and efficient multi-modal CNN to obtain semantic information and depth values.