Pedestrian Lane Detection for Assistive Navigation of Vision-Impaired People: Survey and Experimental Evaluation

Pedestrian lane detection is a crucial task in assistive navigation for vision-impaired people. It can provide information on walkable regions, help blind people stay on the pedestrian lane, and assist with obstacle detection. An accurate and real-time lane detection algorithm can improve travel safety and efficiency for the visually impaired. Despite its importance, pedestrian lane detection in unstructured scenes for assistive navigation has not attracted sufficient attention in the research community. This paper aims to provide a comprehensive review and an experimental evaluation of methods that can be applied for pedestrian lane detection, thereby laying a foundation for future research in this area. Our study covers traditional and deep learning methods for pedestrian lane detection, general road detection, and general semantic segmentation. We also perform an experimental evaluation of the representative methods on a large benchmark dataset that is specifically created for pedestrian lane detection. We hope this paper can serve as an informative guide for researchers in assistive technologies, and facilitate urgently-needed research for vision-impaired people.

1 https://documents.uow.edu.au/ phung/plvp3.html 3) We discuss the technical challenges and future research 85 directions in pedestrian lane detection to bridge the 86 gaps towards a practical assistive navigation system. 87 The remainder of this paper is organized as follows. 88 Section II reviews the traditional detection methods that are 89 based on hand-crafted features. Section III reviews road seg-90 mentation methods that are based on deep neural networks. 91 Section IV presents experimental evaluations of the major 92 methods on the PLVP3 dataset. Section V discusses the tech-93 nical challenges and future directions for pedestrian lane 94 detection. Section VI gives the concluding remarks.

96
This section reviews representative feature-based meth-97 ods, which are categorized into three different groups: 98 (i) color-based approaches (Section II-A); (ii) border-based 99 approaches (Section II-B); and (iii) combined approaches 100 using both road color and border features (Section II-C).

101
A. COLOR-BASED APPROACHES 102 Color-based approaches classify image pixels by comparing 103 each pixel to a reference color model. The reference color 104 model can be constructed using different color spaces [8], [9], 105 [10], [11], [20]. 106 In [8], Crisman and Thorpe proposed a road detection 107 method called SCARF. This method constructs color mod-108 els as multiple Gaussian distributions in the red-green-blue 109 (RGB) color space for both road and off-road classes. First, 110 regions corresponding to road and the background in the 111 previous frame are selected to construct color models for 112 the current frame. Next, each region is clustered into four 113 homogeneous color groups. Then, four Gaussian distribu-114 tions are generated for each class from the color groups. 115 Finally, two color models are constructed to segment road 116 and background regions. The road location in the first frame 117 needs to be defined manually or by another algorithm (e.g., 118 UNSCARF [20]). Because color models are represented by 119 multiple Gaussian distributions, this method can cope with 120 variations in road colors and textures. However, it relies heav-121 ily on the continuity of adjacent frames, which may produce 122 errors if there are sudden changes between two frames. 123 In [11], Ceryen et al. proposed a road detection algorithm 124 that uses color histograms to represent road models in the 125 normalized red and green color space. Compared to the stan-126 dard RGB color space, the normalized space can cope better 127 with illumination variations. This method assumes that the 128 center-bottom part of an input image is a homogeneous road 129 region. Accordingly, the sample region is defined as a rect-130 angle at the center-bottom part of the input image. For each 131 frame, one color distribution is generated from the pixels in 132 the sample region. The final road model is represented by 133 four color distributions generated over time from different 134 frames. Since this method considers the frame continuity with 135 multiple color distributions, it improves the detection stability 136  chromatic information, whereas achromatic pixels (extreme 143 intensity or low saturation) are classified using intensity 144 only. To classify a pixel, a threshold value is defined as 145 a function of two parameters: (i) the distance between the 146 pixel and the previously-predicted road center, and (ii) the 147 maximum threshold value of the previous frame. The initial 148 or vanishing points [17], [18], [19]. In (Fig. 3). This method assumes that the two lane boundaries 218 are approximately parallel. The above methods can accurately detect simple roads with 220 clear boundaries and structured scenes, but they are not effec-221 tive in coping with occlusions, degraded lane edges, or atyp-222 ical road shapes. This is because road models used to match 223 lanes are simplified, and the performance of these methods 224 depends highly on the clear road features.

225
To overcome this problem, several methods have employed 226 vanishing points to determine lane boundaries [17], [18], 227 [19]. In [17], Rasmussen      point. To construct road models, input images are clustered 291 into homogeneous color regions. One Gaussian distribution 292 is generated for each homogeneous pixel group in the sample 293 region. One GMM represents the road colors in one video 294 frame. A fixed number of GMMs from different frames are 295 stored for lane detection. This method can cope with various 296 road shapes and road surface textures. However, because the 297 sample regions are refined by the lines connecting the van-298 ishing points and two predefined points at the image bottom, 299 this method cannot cope well with road regions far from the 300 image center.

301
In [21], He et al. generated color models as Gaussian dis-302 tributions from the pixels within estimated road boundaries. 303 First, edge images are generated by applying edge detectors 304 on projection images of lanes. Next, pseudo road bound-305 aries are determined from edge images using vanishing points 306 and eight curvature models. The pseudo road boundaries are 307 much narrower than the real boundaries, which ensures that 308 all pixels within these boundaries belong to the road class. 309 Finally, color models constructed from these pixels are used 310 to segment the real lane areas. Due to the assumption of the 311 predefined curvature models, this model can detect only a few 312 types of road structures. then computes a weighted sum of the output segmentation 390 masks from each model. Higher weights are given to regions 391 whose shapes are more similar to a typical road (usually a 392 trapezoid). However, the performance of this method may 393 be severely affected if one model produces imprecise predic-394 tions. In [29], Yadav et al. proposed a conditional random 395 field (CRF) framework, in which the segmentation masks 396 produced by SegNet is used as prior knowledge to create 397 two color lines models, one for road and the other for the 398 background. The above methods apply boundary features to 399 refine segmentation results, but they only use these features 400 in the last few stages and therefore may not fully utilise the 401 road features.   In lane detection, the network might misclassify roads in 459 the distance, preventing the users from planning their routes 460 ahead. The network might also misjudge small obstacles or 461 tripping hazards as roads, which could endanger the user. 462 Second, the FCN focuses on more local information than the 463 global context, which makes prediction outputs lack spatial 464 consistency. This might cause the network to predict lanes in 465 the incorrect places, such as on water or in the sky.

466
Similarly, there are two challenges in applying classifi-467 cation networks for semantic segmentation. The first chal-468 lenge is the lost of image resolution. Classification networks 469 utilise down-sampling and max-pooling operations to extract 470 low-resolution feature maps. However, semantic segmenta-471 tion tasks require precise location information, which can 472 only be achieved with high-resolution feature maps. The sec-473 ond challenge is the lack of multi-scale context. Objects in an 474 image are at multiple scales, and convolutions with fixed-size 475 kernels are not enough to capture both local and global con-476 texts. Over the years, many network structures have been 477 developed to address these two challenges. can be bridged by the skip-pathways, shown in green and 531 blue in Fig. 11. Moreover, because UNet++ produces full 532 resolution predictions at multiple semantic levels (X 0,1 , X 0,2 , 533 X 0,3 , X 0,4 ), the network can be trained using deep supervi-534 sion. At inference time, UNet++ can operate in a fast mode, 535 where the final segmentation map is generated from one of 536 the intermediate semantic levels (e.g., X 0,1 , X 0,2 , or X 0,3 ).

DeepLabv3+
[38] is the latest extension with the high-612 est performance in the DeepLab family. Compared to 613 DeepLabv3, DeepLabv3+ adopts the encoder-decoder archi-614 tecture to restore the spatial resolution (Fig. 15). The output 615 feature maps from the encoder are bilinearly upsampled by a 616 factor of 4 and then concatenated with the low-level feature 617 maps. DeepLabv3+ also uses the atrous separable convolu-618 tion, which consists of an atrous depthwise convolution and 619 a 1 × 1 convolution. Atrous separable convolution reduces 620 the computation complexity significantly while maintaining 621 similar or better performance. In summary, dilated convolu-622 tion is effective for enlarging the receptive field and handling 623 multiscale features.

625
Attention mechanism is designed to improve segmentation 626 performance by placing a stronger emphasis on important 627 features. It is inspired by the way way humans perceive 628 images by focusing on the important parts rather than the 629 entire image. This mechanism is helpful to the lane detection 630 task as it enforces segmentation networks to focus on roads 631 instead of other non-related classes.   Training deep learning models often requires a huge amount 665 of computation time and labeled data. Using encoders with 666 pre-trained weights found on the ImageNet dataset is a com-667 mon practice when designing deep networks for downstream 668 tasks such as lane segmentation. Many road detection meth-669 ods also pre-train the entire network on other large road-670 scene datasets, such as the Cityscapes [59] and Mapillary 671 dataset [60]. This section discusses the conceptual differences between 680 traditional and deep learning methods. The key difference is 681 in the feature extraction approach. Traditional lane detection 682 methods use hand-engineered feature extractors, designed 683 based on some prior knowledge. For example, methods in [8], 684 [9], and [10] are based on the observation that the difference 685 in lane positions between two adjacent video frames is usually 686 negligible. Methods in [17], [18], and [19] are based on the 687 assumption that the lane shape is a simple arc and has no 688 branches. However, these heuristics and assumptions work 689 under limited circumstances, and the hand-engineered feature 690 extractors have difficulty in coping with different lane types. 691 In We evaluated different methods on the pedestrian lane 729 dataset PLVP3 [28]. This dataset is extended from the PLVP 730 dataset [24] and the PLVP2 dataset [27], which have 2,000 731 and 5,000 images, respectively. The PLVP3 dataset con-  Note that other datasets exist for the visually impaired,  Hence, these datasets are not used for the experimental eval-752 uation in this survey.

753
Other benchmark datasets such as Mapillary [60] and 754 Cityscapes [59] contain pixel-wise annotations of the 755 sidewalk class. However, these datasets are created for 756 self-driving vehicles, and hence not suitable for pedestrian 757 lane detection. In these datasets, the images are taken near 758 the center of the vehicle roads; the pedestrian regions are 759 often on the side with relatively small areas. In other words, 760 there is a domain gap between these datasets and our desired 761 application of assistive navigation for blind people. To measure model performances, we use three quantitative 766 metrics which have been widely accepted for semantic seg-767 mentation research: 1) pixel accuracy, 2) mean intersection 768 over union, and 3) F1 score. To obtain the overall evaluation 769 score on the test set, the metrics are computed for individual 770 images and then averaged over the entire test set. 771 1) Pixel accuracy is the ratio between the correctly-772 classified pixels versus the total number of pixels.
773 2) Mean intersection over union (mIoU) computes 774 the average IoU over all semantic classes. Let S be a 775 machine-predicted segmentation map, and G be the corre-776 sponding ground-truth mask. Intersection over union (a.k.a. 777 Jaccard Index) is defined as the area of overlap between S 778 and G, divided by the area of union between S and G: 2) EXPERIMENTAL SETUP 789 We employed the 5-fold cross-validation to evaluate the rep-790 resentative methods. The dataset was divided randomly into 791 five equal-sized partitions. For each fold, one partition was 792 used as the test set, and the remaining four partitions are 793 used as the training set. This step was repeated five times for 794 different choices of the test partition, and the segmentation 795 measures were then averaged. Note that each training set 796 was further divided into 90% images for training, and 10% 797 images for validation. Collectively, each cross-validation 798 fold consisted of 7200 training images, 800 validation 799 images, and 2000 test images. The images were resized to 800 320 × 320 pixels.

801
To train the deep neural networks, we used the Adam 802 optimizer [64] with a learning rate of 0.001. The exponen-803 tial decay rates for the first and the second moment esti-804 mates were set to 0.9 and 0.999, respectively. All models 805 used pretrained weights on ImageNet for the encoders, and 806 VOLUME 10, 2022      Assistive navigation for blind people requires real-time 908 algorithms. In Table 3, five methods achieve inference speed 909 higher than 30 FPS: U-Net, BGN, DeepLabv3+, PSPNet, 910 and the NAS-based method. Among them, the BGN and the 911 NAS-based method achieve the highest inference speed. The 912 BGN uses Bayesian Gabor layers instead of the common 913 convolutional layers, which significantly reduces the num-914 ber of trainable weights. The NAS-based method searches 915 directly on the pedestrian lane dataset instead of relying on 916 networks found with other image datasets. Consequently, 917 it obtains a deep network with an optimized structure for the 918 lane detection task. Note that the two methods with the high-919 est segmentation accuracy (multiscale HRNet and multiscale 920 HRNet-OCR) have the lowest inference speed (below 4 FPS). 921 Fig. 19 presents examples of pedestrian lane detection 922 results produced by different segmentation methods. The 923 visual results indicate that the border-based method does not 924 cope well with the variations of lane shapes. This is because 925 it assumes that all lanes are formed by two straight edges 926 pointing to the vanishing point. The combined method per-927 forms better than the border-based method. However, it only 928 has medium performances, especially when the lane region 929 has varying textures, or the lane region has a similar color 930 to the background. This is because the combined method 931 uses the color model constructed from the lower half of the 932 lane to detect the entire lane regions. This method also relies 933 substantially on the accuracy of the detected vanishing point. 934 For example, in Row 3, Column 4, this method misses the 935 true lane, and miss-classifies the handrail as lanes 936 The deep learning methods achieve significant improve-937 ments compared to the traditional methods in both inference 938 speed and detection accuracy. However, they still produce 939 segmentation errors, especially when the background has 940 similar textures as lane regions (Rows 6 and 8). This 941 finding indicates that, despite their high performances as 942 shown in Table 3, the deep-learning-based methods are still 943 not robust enough to maintain high accuracy in complex 944 scenes. For example, as demonstrated in Fig. 19, even the 945 VOLUME 10, 2022    Table 5.

985
In terms of segmentation accuracy, MobileNetV3 achieves 986 the best performance with an mIoU of 96.10% and an 987 F1-score of 97.98%. EfficientNet-b6 achieves the second-best 988 accuracy with an mIoU of 95.92%. Compared to the orig-989 inal encoder in U-Net [34], SOTA backbones improve the 990 mIoU score by 1.34% to 2.46%. In terms of inference 991 speed and model size, MobilenetV3 has the smallest model 992 size (19.80 MB) and the second-highest inference speed 993 (64.516 images/s). MobilenetV2 achieves the highest infer-994 ence speed (64.935 images/s), and the second smallest model 995 size (26.80 MB).

997
Despite the high performances that the deep networks 998 achieve, many aspects still need to be tackled before a prac-999 tical assistive navigation system is possible. This section dis-1000 cusses future research directions to address the current tech-1001 nical limitations for pedestrian lane detection.  results in Section IV-C show that even the  annotations for other objects in traffic scenes. Note that 1041 creating a new large dataset containing all class labels 1042 required for assistive navigation is time-consuming and 1043 costly. Therefore, developing models that can learn 1044 from multiple existing road-scene datasets is useful to 1045 address the lack of labeled data.

4) Developing methods to compress lane detection net-1047
works for real-time inference on smartphones or edge 1048 devices: As shown in Section IV, most methods with 1049 larger storage. See also Fig. 21 and Fig. 22 for a sum-