Loading [MathJax]/extensions/MathMenu.js
Enhancing Object Detection in Assistive Technology for the Visually Impaired: A DETR-Based Approach | IEEE Journals & Magazine | IEEE Xplore

Enhancing Object Detection in Assistive Technology for the Visually Impaired: A DETR-Based Approach


The system architecture for the proposed Obstacle Detection Model

Abstract:

This paper presents a real-time obstacle detection and recognition system designed to enhance navigation for visually impaired individuals through assistive technology. T...Show More

Abstract:

This paper presents a real-time obstacle detection and recognition system designed to enhance navigation for visually impaired individuals through assistive technology. The system integrates a mobile application equipped with a mini camera for real-time image capture and employs advanced deep learning techniques for object detection and classification. A comparative evaluation of YOLOv8, Faster R-CNN and DETR (Detection Transformer) is conducted based on precision, Recall, F1-score, confidence score and processing efficiency. DETR demonstrates superior performance, achieving a 99% confidence score, 98% precision and a processing speed of 40ms per frame. While faster R-CNN and YOLOv8 provide competitive results, they offer a trade-off between accuracy and computational efficiency. The system follows a structured a structured workflow, including real-time acquisition, preprocessing, innovative data augmentation and optimization for edge devices using TensorFlow Lite for efficiency deployment. It classifies 80 types of obstacles, such as pedestrians, vehicles and traffic signal and provides immediate audio feedback to ensure safe navigation. The model trained over 20 epochs achieves an accuracy of 98% in the final epoch. This study introduces a scalable and practical solution integrating IoT and real-time image processing, empowering visually impaired users with enhanced mobility and safety.
The system architecture for the proposed Obstacle Detection Model
Published in: IEEE Access ( Volume: 13)
Page(s): 71647 - 71661
Date of Publication: 07 April 2025
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

The visually impaired need visual aids their fundamental interaction tool to navigate their environment and identify objects while generating audible outputs [1]. The world requires focused efforts to address specific challenges which demand innovative solutions. Machines utilize object detection to identify and determine the presence of objects in images [2]. To address these needs, this work proposes an IoT-enabled warning system designed for real-time environments. The system integrates a mobile application with a mini camera to capture real-time images, processes these using advanced deep learning techniques and delivers immediate audio feedback through an assistive device, ensuring enhanced safety and autonomy [3].

Deep learning advancements have dramatically enhanced the accuracy while increasing the speed of models which detect objects, including Single-Shot Detector (SSDs) proposal and object classification into efficient frameworks, have been pivotal in real-time applications. Most recent models, such as YOLOv8 [4] and Vision Transformers (ViTs) [5], extend these capabilities by leveraging attention mechanisms and computational optimizations, achieving unprecedented performance in real-time environments. Various technologies, including sensors, cameras and machine learning have been explored for obstacle detection [6]. Traditional object detection techniques require multiple sequential steps which begin with region proposal generation followed by object classification. Features such as bounding boxes, confidence scores and thresholds are widely employed in the existing object detection research [7]. Mini cameras have been widely used in live systems, with models like YOLO proving effective for obstacle detection and recognition [8]. These techniques are enhanced with text-to-speech systems to provide audio feedback [9].

The motivation behind this proposed work includes designing a user-friendly interface that enables obstacle detection systems to interact with users and deliver understandable, trustworthy outputs. Ensuring long-term reliability and easy maintenance of these systems, which may degrade over time due to environmental changes or sensor wear. Achieving real-time performance in obstacle detection requires quick responses in autonomous vehicles while maintaining accuracy standards.

Solving these challenges demands a collaborative method that unites computer vision techniques with machine learning algorithms and specialized domain expertise [10]. It also requires rapid processing to ensure immediate feedback which is crucial for safety and effective navigation [11]. In this system, ultrasonic sensors embedded in walking stick and shoes sticks help identify obstacles and provide immediate alerts to users [12]. This paper introduces DBS-YOLOv8 model to improve object detection in low-light environments, addressing critical challenges faced by visually impaired individuals [13], which enhances detection accuracy in challenging conditions. While prior systems have utilized computer vision algorithms to enhance outdoor navigation, these have often been insufficient for providing the required accuracy [14].

The primary objective of this work is to comprehensively analyze the performance of several advanced models.

This paper has the following contributions:

  • The central aim of this study will transform assistive technologies into more accessible solutions that maintain high accuracy standards for visually impaired users in real-time environment.

  • The study aims to enhance assistive technologies by improving their accessibility and effectiveness, with a particular focus on accuracy for visually impaired individuals.

  • The study extensively evaluates and benchmarks the performance of detection algorithms, including DETR [15], Faster R-CNN [16]and YOLOv8 [17] in real-time environment, focusing on their accuracy, confidence scores and processing speeds for obstacle detection.

  • This research adds value to autonomous systems science through its thorough examination and assessment of different detection methods. The research findings alongside recommendations presented in this document create substantial opportunities to enhance obstacle detection systems for mining applications.

The visually impaired experience daily obstacles when using navigation systems and when attempting to avoid objects. Detection algorithms have become prominent because they provide fast and efficient performance in real-time applications.

The paper follows this structure.

Section I: The Literature Review presents an in-depth examination of existing obstacle detection models.

Section II: This Section presents an analysis of detection algorithm frameworks and architectures.

Section III: Implementation Results of three detection algorithms from outdoor environment.

Section 4: Discussion, Limitation and Conclusion.

SECTION II.

Related Work

Blind people encounter numerous obstacles in their everyday activities because they depend on their daily tasks. Safety emerges as the primary challenge for visually impaired people since they become vulnerable to accidents and fall object collisions with others when they lack visual assistance. The ability to navigate represents a major obstacle for people who are visually impaired. The inability to crossroads safely and detect obstacles, find destinations, and maintain independence stands as major challenges for these individuals. The review includes multiple technologies, methodologies, and manufactured approaches which developers use to create these systems. The review examines typical obstacles and system constraints alongside demonstrating the practical effects of this technology in real-world applications [18].

As shown in Table 1, the existing literature demonstrates a broad spectrum of technologies alongside hardware configurations and objectives which focus on existing work techniques and their achieved outcomes. Different approaches have shown various levels of achievement across different implementation settings. Most of these methods depend on traditional approaches, yet they demonstrate noteworthy accomplishments but continue to face challenges regarding efficiency, accuracy, and adaptability. Table 1 presents an extensive summary of research activities across the entire field. The following section presents details about previous research studies.

TABLE 1 Summary of literature review: Methods, Goals, techniques and achievements
Table 1- Summary of literature review: Methods, Goals, techniques and achievements

A deep learning-based obstacle detection system has been developed for Unmanned Surface Vehicles (USV) that use computer vision techniques combined with sensor fusion to identify and avoid obstacles in real-time, ensuring safe navigation in dynamic maritime environments [19]. Kulkarni et al. in [20] explored multiple attempts to develop assistive gadgets for visually impaired individuals. The manuscript explores various sensors which navigate systems for people. The review examines recent advancements in assistive technologies which include visual sensors and proximity sensors and LiDAR sensors, among others. The sensors generate abundant data which is processed into environmental simulations. The manuscript examines sensor strengths alongside identifying the optimal prior information which yields superior results. The review presents sensor-related issues alongside potential solutions for these problems.

In [21] by Rajesh et al. Sensors were employed to acquire quantitative datasets for model training. The mobile application serves both a security platform and obstacle sensor. Radiofrequency technology determines the path without obstacles. The system employs GPS tracking for navigation purposes.

Jayachitra et al. in [22] described a smart glove system to provide directional guidance through hand gestures. The model uses Deep Neural Networks and YOLOv5 for training process. The system incorporates both cameras and Microphones as part of its design. The researchers employed image-based data to develop their model. This research achieves the objective of identifying object colors. Sissodia et al. in [23] developed a smart car system which helps visually impaired users through voice commands and object detection sensors. The system contains multiple sensors which identify objects while simultaneously receiving voice commands from users. The system tracks how far the user stands from the object. The developed application enables users to choose paths while tracking the car’s movements. The sensors generate quantitative measurements to determine object-to-object distance. Artificial intelligence powered the application which trained the model.

Chauhan et al. in [24], proposed a wheelchair system based on IoT technology to assist disabled people. The path-finding decisions of the machine learning algorithm operate with enhanced security and safety measures. The sensor employed in this research gathers numerical data to produce results. Lima et al. in [25] aimed to improve the technology for visual impairment navigation abilities to enhance quality of life for visually impaired individuals. The research examined environmental directions and obstacles to gain better understanding. The K-Nearest Neighbor (KNN) algorithm’s brute force approach supports features matching processes that help prevent dangerous situations. The YOLO algorithm from the Convolutional Neural Network (CNN) family was implemented in [26] by Prathibha et al. improved both accuracy and efficiency in a system that helps visually impaired individuals. This research addresses multiple challenges that visually impaired people encounter while navigating their environment. White canes function as devices that detect obstacles. This white cane incorporates sensors which collect data. The cane system enhances mobile application performance through training with YOLO algorithm which leads to better model accuracy.

Masud et al. in [27] combine cameras with detection sensors to detect objects. The user employs a stick to detect objects which then sends the collected dataset to a mobile application for classification. The authors employed a deep learning algorithm for their classification work. The TensorFlow Object Detection algorithm performed model training and classification operations. The ultrasonic sensor detected objects then sent beep alerts to users through a camera system that transmitted object details. The device functions to provide users with enhanced safety when surrounded by people.

The obstacles which visually impaired encounter restrict their ability to perform daily tasks in their environment. The navigation system represents the most essential requirement for visually impaired people to receive path directions. Through the IoT field users with vision loss can navigate with assistance in outdoor areas. In [28] by Dhou et al., a mobile application was designed to detection objects. The system collects data through sensors, which is then sent to a mobile application for preprocessing and classification. Google Maps serves as the navigation tool for pathfinding. The system’s prototype undergoes testing in multiple environmental conditions. The accuracy level rises from 81% to 98% during the implementation period. The research evaluates the CNN model against the KNN classifier to enhance its operational capabilities.

The following research questions guide the comparative analysis of obstacle detection systems for visually impaired individuals:

RQ1: what is the accuracy and reliability level of various detector algorithms when detecting obstacles on outdoor environments?

RQ2: what is the comparative performance of DETR, Faster R-CNN and YOLOv8 in terms of accuracy, confidence score and processing speed for real-time obstacle detection?

The foundation of object detection research has evolved into a specialized detection algorithm which focuses on improving detection methods. The new techniques bring innovations which aim to solve problems discovered in previous work. The examination of recent detection methods helps us understand their strengths and weaknesses in addressing the issues identified in the broader literature presented in Table 1.

SECTION III.

Materials and Methods

The proposed system leverages a real-time obstacle detection and recognition framework tailored for visually impaired individuals. This system integrates state-of-the-art deep learning models and innovative hardware configurations to provide accurate and timely assistance in outdoor environments. Below, we outline the materials, experimental setup and methodology employed in this study.

A. System Overview

The system consists of the following components:

1) Mini Camera

A lightweight, high-resolution mini camera is mounted on the user’s cap to capture real-time video streams from the user’s perspective. This setup ensures hands-free operation and seamless data acquisition.

2) Mobile Application

A dedicated mobile application is designed to process the captured image data. The application handles tasks such as data preprocessing, augmentation, classification and feedback generation. It is optimized for edge devices using TensorFlow Lite to ensure efficient processing.

3) Internet Connectivity

The system employs a stable internet connection to transmit image data from the mini camera to the mobile application. This ensures uninterrupted real-time processing.

4) Deep Learning Models

The Mobile application integrates advanced deep learning models such as YOLOv8, Faster R-CNN and DETR for object detection. Among these, DETR demonstrated superior performance in terms of accuracy and speed, making it the primary model used for classification.

B. Architecture of the Proposed Model

1) Real-Time Data Collection

The data collection process is conducted in real-time outdoor environments on busy roads. The mini camera captured real-time images of dynamic environments, including moving vehicles, pedestrians, and stationary obstacles such as traffic lights and signboards. The captured data transmitted to the mobile application in real time for further processing as shown in Figure 1. The environment conducted on outdoor roads during active traffic conditions. The dynamic nature of this setting provided a robust test environment to evaluate the system’s performance under real-time scenarios. Visually impaired individuals evaluated the system to assess its usability and effectiveness. Real-time video data streamed from the mini camera to the mobile application via Wi-Fi internet connection.

FIGURE 1. - The system architecture for the proposed Obstacle Detection Model.
FIGURE 1.

The system architecture for the proposed Obstacle Detection Model.

2) Data Preprocessing, Augmentation and Classification

Once the image data is received by the mobile application, the images were resized and normalized to ensure compatibility with the input dimensions of the deep learning models. Noise reduction techniques were applied to enhance the clarity of the images [29]. Data Augmentation techniques, such as random rotation, scaling and flipping were employed to increase the variability of the dataset. This step improved the model’s ability to generalize across different scenarios and lighting conditions [30], [31]. The preprocessing and augmented data fed into the DETR model for object detection and classification. Obstacles were categorized into classes such as pedestrians, vehicles and traffic signals with a high degree of accuracy and speed [32].

C. Training Module

To evaluate the effectiveness of the obstacle detection system, a real-time experiment conducted with a visually impaired user navigating an outdoor road environment. The experiment aimed to evaluate the system’s performance under varying conditions, including daylight and nighttime scenarios.

The mini camera mounted on the user’s cap captured real-time image data of the surroundings. The camera connected to a mobile application via the internet, enabling seamless data transmission for processing. The user wore headphones to receive audio feedback generated by the application. The experiment was conducted on a moderately busy road with regular traffic. The road included various obstacles, such as moving vehicles, stationary objects (e.g cars, motorbikes, bus, traffic lights, pedestrians) [33]. The user walked continuously for 30 minutes in two distinct sessions: one during daylight and the other at night.

  • During the daylight session, the user navigates the road under optimal light conditions. The mini camera captured video frames with clear visibility of obstacles. The applications processed approximately 12,000 boundary boxes during the session. Classifying objects like cars, motorbikes, bus and pedestrians. Reflections from parked vehicles occasionally caused minor distortions in object detection. However, the system consistently provided accurate classifications and timely audio feedback to guide the user.

  • In the nighttime session, the user walked along the same road. The mini camera utilized its built-in low light enhancement capabilities to capture real-time images. Streetlights and vehicles headlights provided additional illumination. The application processed approximately 10,500 boundary boxes during this session.

D. Experimental Results of Preprocessing Ang Augmentation

The experimental data generated through a python environment using Jupyter Notebook as the interface of model development and testing. This process uses NumPy for advanced numerical calculations alongside TensorFlow for deep learning model creation and training. The established setup provided an efficient process to manage the intricate tasks involved in dataset preprocessing and augmentation. A camera integrated into the visual impairment cap collects visual data for its users. Process optimization during preprocessing remains essential to achieve optimal model input conditions. The camera-generated images undergo resizing to match a standard $224^{\ast } 224$ -pixel format that aligns with our model’s input requirements. Data resizing techniques both accelerate processing speed and preserve uniformity throughout the dataset. The training dataset receives data augmentation treatment to boost its diversity because this technique directly impacts model generalization abilities. The images undergo data manipulation through rotation, flipping, zooming and brightness adjustment. The original dataset consisted of two thousand images before augmentation and various augmentation techniques were applied to increase the dataset size and improve model generalization.

The data augmentation methods used for training dataset enhancement appear in Table 2 to boost model generalization capabilities and robustness. Augmentation method applied to the original 2000 images; each method would generate a corresponding number of augmented images e.g 20% of $2000=400$ images. The dataset received multiple augmentation techniques to introduce diverse real-time scenarios. The distribution of augmentation methods demonstrates to introduce diverse real-time scenarios. The distribution of augmentation methods demonstrates their impact on the overall augmentation process while showing the number of generated images per technique. These augmentations help model handle challenges like varying object size, orientations, lighting conditions and distortions during inference, contributing to its high accuracy and adaptability.

TABLE 2 Overview of data augmentation techniques and their proportional contribution to the augmented dataset used for model training
Table 2- Overview of data augmentation techniques and their proportional contribution to the augmented dataset used for model training

E. Image Classification Results by Using DETR

Detecting objects correctly when they appear stands as the essential operational aspect of this model. Negligence in measuring parameters leads to accidents and other types of losses [34]. The proposed intelligent real-time obstacle detection model functions to prevent accident events. The system uses intelligent notifications to inform visually impaired users about their present environment to prevent accidents.

The system demonstrates image transformation steps in Figure 2, which shows raw outdoor images (a), (b), (c), (d), (e), (f) undergo preprocessing and augmentation before classification to boost object model performance. The first column displays unprocessed outdoor environment images which the mini camera recorded directly. The system evaluated preprocessing and augmentation techniques through model training using the augmented database.

FIGURE 2. - Transformation of outdoor environment images through various stages: Original image, Augmented image and classified image using DETR.
FIGURE 2.

Transformation of outdoor environment images through various stages: Original image, Augmented image and classified image using DETR.

The model successfully identifies key objects in the scene, such as Person, car, and motorcycle and eighty other objects with high confidence scores ranging from 0.92 to 0.99. The annotation system delimits objects through red boxes that display objects labels along with their associated confidence metrics for precise observation. The tight alignment of bounding boxes with objects shows how DETR performs efficiently in outdoor environments for object localization and classification. The model demonstrates strong robustness in object differentiation because of its high confidence score which allows it to manage size and positional variations.

The outdoor environment with its combination of trees and streetlights visible road surface creates a complex dataset for object detection tasks. The model demonstrates flexible detection capabilities which maintain precise results across different lighting situations and distances. The image demonstrates how DETR performs simultaneous detection of multiple objects which makes it suitable for real-time applications such as obstacle detection systems for visually impaired users. The model demonstrates versatility because it detects multiple object classes in the same frame.

Mathematical Equations used by the preprocessing:

Preprocessing involves normalizing, resizing, and scaling the input images to prepare them for the model.\begin{equation*} I_{norm}\left ({{ x,y }}\right )=\frac {I(x, y)}{255}, \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • I(x,y) is defined as original pixel value at position (x,y)

  • Inorm(x,y) is defined as the normalized pixel value at the position (x,y), where pixel values are scaled between 0 and 1 by dividing by 255.

To ensure all input images have the same destinations, they are resized:\begin{equation*} I_{resize}\left ({{ x,y }}\right )=I \left ({{\frac {x.w_{new}}{w_{orig}}, \frac {{y.h}_{new}}{h_{orig}}}}\right ), \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Isize is the pixel value at position (x,y) in the resized image.

  • I(x,y) is the original pixel value at position (x,y) in the original image.

  • worig, horig is the original width and height of the image.

  • wnew, hnew is the target width and height of the image.

Images scaled to a specific range for consistency:\begin{equation*} I_{scaled}=\frac {I_{norm}- \mu }{\sigma }, \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Iscaled represents the scaled pixel value.

  • Inorm is the normalized pixel value (previously scaled between 0 and 1).

  • $\mu $ is the mean pixel value of the dataset or image.

  • $\sigma $ is the standard deviation of pixel values of the dataset or image.

  • The equation standardizes the pixel values to have a mean of 0 and a standard derivation of 1, which is typically done to improve model performance in machine learning tasks.

Augmentation involves transformations to increase dataset variability.

Images rotated by a certain angle:\begin{align*} \left ({{ \frac {x'}{y'} }}\right )= \left ({{\begin{array}{cccccccccccccccccccc} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \\ \end{array}}}\right ) \left ({{ \frac {x}{y} }}\right ), \tag {4}\end{align*} View SourceRight-click on figure for MathML and additional features. where,

  • (x’, y’) represent the new pixel coordinates after rotating the original image by the angle $\theta $ .

  • (x,y) are the original pixel coordinates before rotation.

  • $\theta $ is the angle of rotation, typically in radians.

  • This equation applies the 2D rotation transformation to all pixel coordinates of the image.

Augmented images are resized using scaling factors.\begin{equation*} I_{scaled}\left ({{ x,y }}\right )=I(\alpha x,\alpha y), \tag {5}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Iscaled (x,y) represents the scaled pixel value at the position (x,y).

  • I($\alpha $ x, $\alpha $ y) represents the pixel value from the original image at the scaled coordinated ($\alpha $ x,$\alpha $ y).

  • $\alpha $ is the scaling factor that determines the zoom level of the image. If $\alpha \lt 1$ , the image zoomed out.

  • $\alpha $ is scaling factor.

pixels are shifted by a translation vector (tx,t${}_{\mathrm {y}}$ )\begin{align*} \left [{{\begin{array}{l} x' \\ y' \\ \end{array}}}\right ]= \left [{{\begin{array}{cccccccccccccccccccc} x+ & t_{x} \\ y+ & t_{y} \\ \end{array}}}\right ], \tag {6}\end{align*} View SourceRight-click on figure for MathML and additional features.where,

  • (x’,y’) are the pixel coordinates after translation.

  • (x,y) are the original pixel coordinates.

  • tx is the translation along the x-axis.

  • ty is the translation along the y-aixs.

The preprocessed and augmented images are then passed into the classification model for object detection and classification. DETR utilized in this study uses an innovative approach by combining a CNN backbone for feature extraction with a Transformer-based architecture for detection. The following equations outline its working:

The input image passed through a convolutional neural network (CNN) to extract spatial feature maps:\begin{equation*} F = CNN(I), \tag {7}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • I is the input image.

  • F is the extracted feature map containing rich spatial and semantic information.

The dataset used for model training consists of 2000 images collected in outdoor road environments using a mini camera mounted on the user’s cap. The dataset includes diverse object classes such as pedestrians, vehicles and motorcycles captured under various lighting conditions (day and night). The dataset split into 75% for training and 25% for testing. During training, the images preprocessed and resized into $512\times 512$ resolution before fed into CNN for feature extraction. The feature maps flattened, positioned encodings (PE) added, and they passed through the Transformer encoder:\begin{equation*} Z^{enc}=Encoder\left ({{ F+PE }}\right ), \tag {8}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Zenc is an output transformer encoder.

  • F represents the feature input (which could be the image features, for example).

  • PE is the positional encoding, which helps the model understand the spatial relationships in the data, especially in models like Transformers.

  • Encoder refers to the encoding process or function (such as the transformer encoder), which processes the sum of the feature and positional encoding to produce the encoded.

The flattened feature maps enriched with positional encodings to ensure spatial relationships preserved. The encoder processes these enriched features to generate a robust representation of the objects in the image. The DETR model trained for 20 epoch in a mobile application integrated with the collected dataset, using hardware specifications including a quad-core processor and 4GB RAM on a standard Android phone.

Learnable object queries Q interact with the encoded features in the Transformer decoder to produce predictions:\begin{equation*} Z^{dec}=Decoder(Q, Z^{dec}), \tag {9}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Q is object queries used to detect objects.

  • Zdecis the output of the decoder.

  • Decoder is the decoding function (e.g transformer decoder), which processes both the query and the encoded representation to produce the decoded output.

Object queries enable the model to detect objects within a single frame, making it suitable for real-time applications. During training, twenty objects’ queries were utilized, corresponding to the maximum number of objects expected in a frame. The model achieved consistent improvement in detection accuracy, reaching 98% on the validation set by the 20th epoch. The training was conducted on a python-based framework, leveraging PyTorch for DETR implementation.

Bounding boxes predicted using linear projections and a sigmoid activation function to normalize the coordinates:\begin{equation*} b^{i} = \sigma (W_{bbox} Z_{i}^{dec}), \tag {10}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • bibounding box for object i.

  • $\sigma $ is the activation function, commonly a sigmoid function when predicting bounding box coordinates in object detection tasks, ensuring that the inputs are within a valid range (e.g between 0 to 1).

  • zi is decoder output for object i

  • Wbbox is the weight matrix applied to the decoded feature representation Z${}_{\mathrm {i}}^{\mathrm {de}}$ .

Bounding boxes are critical for making the detected objects in real-time. During testing, bounding box predictions overlaid on the live camera feed from the mini camera. The bounding box was dynamically resized based on object proximity and camera angle. These results were transmitted to the mobile application for visualization.

Class probabilities $P_{i}$ where i for detected objects calculated using SoftMax activation.\begin{equation*} P_{i}=Softmax(W_{cls}Z_{i}^{dec}), \tag {11}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Pi is probability distribution over object classes for object i.

  • Wcls learnable weights for classification.

  • Z${}_{\mathrm {i}}^{\mathrm {dec}}$ is the decoded feature representation of the ith object.

  • The SoftMax function is applied to generate probability distribution over all classes.

Object classification was performed across multiple classes, including pedestrians, cars, motorcycles and eighty other classes. Each object class assigned a confidence score, DETR provided a confidence score of over 99% for most detections, ensuring reliability in real-world conditions. The classification outputs are sent to the mobile application and converted into audio feedback for the user using a text-to-speech engine.

DETR uses a set-based loss that combines classification loss, bounding box loss, and generalized IoU (Interaction over Union) loss:\begin{equation*} L_{Hungarian}\left ({{ y,y' }}\right )= \lambda L_{class}+ \lambda _{bbox}L_{bbox}+ \lambda _{giou}L_{giou}, \tag {12}\end{equation*} View SourceRight-click on figure for MathML and additional features. where,

  • Lclassis cross entropy loss for classification, typically calculated using cross-entropy loss for predicting object classes.

  • Lbbox is the bounding box regression loss, typically calculated using cross-entropy loss for predicting object classes..

  • Lgiou is the Generalized Intersection over Union (IoU) loss, usually L1 loss, which measures the difference between predicted and true bounding box coordinates.

  • $\lambda _{\mathrm {cls}}$ , $\lambda _{\mathrm {bbox}}$ , $\lambda _{\mathrm {giou}}$ are weights that balance the contribution of each loss term to the total loss. These weights are hyperparameters that can adjust to tune the importance of classification, bounding box regression and IoU during training.

  • The y’ and y are the ground truth and predicted values, respectively.

These equations highlight the DETR model’s strength in object detection by efficiently learning from diverse inputs and accurately identifying multiple object classes in real-time.

The training process optimized the DETR model by minimizing a combination of classification and bounding box losses. The Hungarian class algorithm ensured an optimal assignment of predicted and ground-truth objects. The loss steadily decreased over the training epoch, contributing to the model’s high accuracy. Training logs are monitored on a laptop with 8GB RAM and NVIDIA GTX 1050 GPU, which are used for data preprocessing and augmentation before deploying the trained model on the mobile application.

Figure 3 below illustrates the comparative classification results obtained from three advanced detection algorithms, Faster R-CNN, YOLOv8 and DETR. The first column represents the original images used as input for detection, while the second column displays the corresponding augmented images after applying preprocessing techniques such as random cropping, rotation, and brightness adjustments. The subsequent columns highlight the classified results generated by each algorithm. This visual representation highlights the strength and limitations of each model, including their accuracy and ability to detect objects in challenging conditions, such as varying lighting, occlusions and distortions introduced during augmentation.

FIGURE 3. - Object detection and recognition using detection models.
FIGURE 3.

Object detection and recognition using detection models.

The first raw of Figure 3 displays the outputs of Faster R-CNN, with column (a) showing the original and augmented images and column (b) presenting the detection results. The second row contains the outputs of YOLOv8 while the third row shows the results from DETR. The output images also include the confidence scores of each algorithm.

Classification Algorithm Faster RCNN (c) introduces Region Proposal Networks (RPN) to generate region proposals and then applied Fast R-CNN for object detection.

Class prediction probability:\begin{equation*} P\left ({{ class_{a} }}\right )=\frac {e^{S_{i}}}{\sum \nolimits _{j=1}^{c} e^{s_{j}}}^{}, \tag {13}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Si is the score (or logit) for class a.

  • C is the number of classes.

  • esi is Exponential of the score (or logit) si for class a.

  • $\sum ^{\mathrm {c}}_{\mathrm {j=1}}$ e${}^{\mathrm {s}}_{\mathrm {j}}$ is the sum of the exponentials of the scores for all c classes.

  • P(class${}_{\mathrm {a}}$ ) is the predicted probability of class a.

\begin{equation*} Box Loss_{Faster R\_CNN}=Smooth L1 Loss (X_{true}, Y_{pred}), \tag {14}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

  • Smooth L1 aims to make the model robust to outliers. L1 Loss reduces sensitivity to outliers.

  • Xtrue are the true bounding box coordinates (the ground truth).

  • Ypred are the predicted bounding box coordinates.

YOLOv8 (d) algorithm is a real-time object detection system that frames object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.

For each grid cell, YOLOv8 predicts:\begin{equation*} \lambda _{box}=1-IoU+\frac {\rho \left ({{ b,b^{gt} }}\right )}{c^{2}}+ \alpha \nu , \tag {15}\end{equation*} View SourceRight-click on figure for MathML and additional features.where,

$\lambda $ box is the weight or contribution to the box loss.

  • IoU is an intersection over union between the predicted b and ground truth box bgt. It measures the overlap between the predicted and true boxes.

  • $\rho $ (b,bgt) is the squared distance between the centers of the predicted bounding box b and the ground truth boxes.

  • $\rho ^{2}$ squared distance between the centers of the boxes.

  • c is diagonal length of the smallest enclosing box.

  • v aspect ratio consistency term.

  • a is balancing factor.

Each class in a bounding box receives its final score by multiplying its objectness score with its class probability.

The comparison highlights DETR’s superior performance in accurately identifying and classifying objects, particularly in complex scenarios, as evidence of its ability to preserve details in augmented images. YOLOv8 and Faster-RCNN also exhibit robust detection capabilities but demonstrate slight variations in handling certain features, reflecting their respective trade-offs between accuracy and computational efficiency. This visualization underscores the importance of integrating multiple detection algorithms for a comprehensive evaluation and emphasizes the role of augmentation in enhancing model robustness for real-world applications.

F. Performance Metrics Results

The evaluation of the proposed system is based on standard performance metrics, including accuracy, precision, recall, F1 score and Intersection Over Union (IoU). These metrics are critical for understanding the effectiveness and reliability of the object detection algorithms, YOLOv8, Faster R-CNN and DETR used in the system. The model was trained using a dataset of 2000 images, 75% for testing and 25% for training and evaluated for 20 epochs, and the results highlight the comparative strengths of the algorithms in detecting and classifying objects under varying conditions. DETR demonstrates exceptional performance, achieving a confidence score of 99%, a precision of 98% and an IoU score exceeding 95% highlighting its ability to handle complex and diverse scenarios.

The following subsections provide a detailed analysis of each performance metric, supported by graphical visualizations and tables to illustrate the results comprehensively.

The distribution of confidence scores between DETR, YOLOv8 and Faster R-CNN detection models appears in Figure 4. The reliability of object detection predictions made by each model appears through confidence scores. The DETR model demonstrates outstanding performance in object prediction by achieving a median confidence score of 99%. Such a narrow interquartile range shows that predictions from the dataset maintain regular consistency without major confidence score variations. Faster R-CNN produces confidence scores that reach 0.90 an average but its IQR extends more widely than DETR’s. The wide distribution of values indicates that predictions show varying levels of sensitivity to object types and environmental conditions. YOLOv8 reaches a median confidence score of 92% which surpasses Faster R-CNN yet remains below DETR’s score. The IQR from Faster R-CNN is wider than YOLOv8’s IQR which demonstrates more consistent prediction outcomes.

  • DETR outperforms the other models in terms of median confidence and consistency, making it the most reliable choice for this application.

  • YOLOv8 balances performance and stability, while Faster- R-CNN despite its variability still performs well.

FIGURE 4. - Confidence score distribution of detection algorithms.
FIGURE 4.

Confidence score distribution of detection algorithms.

The performance of three object detection models in terms of Intersection over Union (IoU) values shown in Figure 5 across twenty training epochs. The DETR model demonstrates superior performance by maintaining the highest IoU values throughout all training epochs. With a final IoU exceeding 0.95 YOLOv8 shows average performance by reaching an IoU value of 0.85 during the $20^{\mathrm {th}}$ epoch yet Faster R-CNN maintains a lower stable point at 0.80. The experimental results show that DETR improves its bounding box prediction accuracy throughout training sessions thus becoming an advanced detection model for object identification tasks. Over epochs the trade-off between speed and accuracy performance of YOLOv8 appears favorable but the limited detection quality of Faster R-CNN may need more refined training datasets or optimization. The analysis demonstrates that DETR delivers superior performance when precision requirements are high.

FIGURE 5. - Intersection Over Union (IoU) vs. Epochs for object detection algorithms.
FIGURE 5.

Intersection Over Union (IoU) vs. Epochs for object detection algorithms.

Three object detection models show their time-series processing speeds through Figure 6. Time series analysis reveals that Faster R-CNN maximizes speed while DETR optimizes accuracy and YOLOv8 achieves a middle ground between these two detection performance factors. The time-series processing speed (measured in ms/frame) of three object detection models illustrated in Figure 6. The processing speed of Faster R-CNN remains above 65ms/frame while demonstrating the fastest performance among the models. The model demonstrates stability which makes it suitable for applications requiring precise results rather than immediate execution. YOLOv8 maintains moderate processing speeds, averaging around fifty ms-frame. This balance suggests it optimized for real-time applications with reasonable accuracy. DETR shows the lowest processing speed consistently near forty mn/frame. While slower, it compensates with high accuracy, making it ideal for precision-critical tasks. Although DETR demonstrates a slightly slower processing speed compared to YOLOv8, its superior confidence scores 99% and precision 98% make it more suitable for ensuring reliable obstacle detection in critical real-time scenarios, such as navigation for visually impaired users. To address the processing speed limitation, we optimized DETR using TensorFlow Lite, achieving a processing time of 40ms/frame, which is within the acceptable range for real-time application.

FIGURE 6. - Time-Series Plot of Processing Speed (ms-frame) for 3 detection algorithms.
FIGURE 6.

Time-Series Plot of Processing Speed (ms-frame) for 3 detection algorithms.

Figure 7 illustrates the accuracy achieved by each detection model, where DETR achieved the highest accuracy of 98%, while both Faster R-CNN achieved 90% and YOLOv8 achieved 92%. DETR selected for this model due to its superior performance. Table 3 gives the results of three models.

TABLE 3 Precision, Recall and F1-Score of the detection models used
Table 3- Precision, Recall and F1-Score of the detection models used
FIGURE 7. - Accuracy performance of detection algorithms evaluated in Obstacle detection model.
FIGURE 7.

Accuracy performance of detection algorithms evaluated in Obstacle detection model.

Equations used by the model to measure accuracy are:\begin{equation*} Precision= \frac {True positive\left ({{ TP }}\right )}{True Positive\left ({{ TP }}\right )+False Positive(FP)}, \tag {16}\end{equation*} View SourceRight-click on figure for MathML and additional features.This metric assessed model accuracy by measuring a single value across all object categories. Precision values for all classes. The metric delivers a unified accuracy assessment for model performance across every object category.\begin{equation*} Recall=\frac {True Positive \left ({{ TP }}\right )}{True Positive \left ({{ TP }}\right )+ False Negative (FN)}, \tag {17}\end{equation*} View SourceRight-click on figure for MathML and additional features.Recall determines the percentage of correct obstacle detections among all actual obstacles including true positive and false negative.\begin{equation*} F1 score=2\ast \frac {Precision\ast Recall}{Precision+Recall}, \tag {18}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The chart legend shows both the algorithms and their associated performance metrics to help viewers differentiate between the lines.

The heatmap in figure 8 shows how well the model predicts each class, with darker cells indicating stronger performance. The diagonal dominance indicates good classification accuracy. It provides a detailed summary of the model’s performance across all eighty classes, which is essential for multi-class classification problems by using DETR model.

FIGURE 8. - Confusion Matrix visualization of proposed model.
FIGURE 8.

Confusion Matrix visualization of proposed model.

The above Table 4 illustrates the conducted results which indicate DETR delivers superior outcomes in proposed models for real-time accuracy measurements. The detection model uses DETR as its training foundation.

TABLE 4 Comparison based on 2 Categories
Table 4- Comparison based on 2 Categories

Table 5 shows the accuracy levels from previous research alongside the accuracy of the proposed model. The DETR model in the proposed study achieves a slightly higher Precision and F1 score than previous work which demonstrates improved detection accuracy. The proposed model demonstrates improved Recall performance which indicates superior precision-recall trade-off. The proposed study’s evaluated algorithms demonstrate better performance than existing work which indicates progress in both object detection methods and algorithm optimization approaches. Table 4 presents an extensive accuracy evaluation of SSD algorithms from the proposed study relative to existing work reported in [40]. The proposed study demonstrates the effectiveness of applied optimization methods which can help users select appropriate algorithms for their specific applications. Scientific studies need to advance existing algorithms while making them more precise and quicker in operation.

TABLE 5 Accuracy comparison with existing work
Table 5- Accuracy comparison with existing work

G. Discussion

This study evaluates and compares the performance of three state-of-the-art object detection models, DETR, Faster R-CNN and YOLOv8. In the context of obstacle detection and recognition. The analysis emphasizes two primary metrics, accuracy and confidence score, which are critical for accessing the reliability and robustness of these models in real-world applications.

1) Accuracy

The study highlights significant differences in the accuracy of the three models, reflecting their strength and areas of application:

  • DETR (Detection Transformer) emerged as the most accurate model, achieving a precision of 98%, recall of 97% and an F1-score 98%. Its ability to capture fine-grained details and process object relations effectively positions it as the best choice for scenarios requiring high, precision and recall. Despite its slightly slower inference time compared to YOLOv8, its superior accuracy makes it highly dependable for complex obstacle detection tasks.

  • Faster R-CNN demonstrates robust performance, achieving precision of 94%, recall of 91% and an F1-score of 93%. Its robust architecture ensures accurate detections, particularly in cases involving small or overlapping objects. However, its slower inference time limits its applicability in real-time scenarios, making it more suitable for applications where accuracy prioritized over speed, such as medical imaging or controlled environments.

  • YOLOv8 delivered moderate results, with a precision of 91%, recall of 88%, and an F1-score of 89%. While its accuracy is lower than DETR and Faster R-CNN, it compensates with faster detection time, making it ideal for real-time applications where quick responses are critical.

2) Confidence Score

Plays a pivotal role in demonstrating the reliability of obstacle detection systems, particularly in reducing false positives and negatives.

  • DETR consistently achieved the highest confidence score of 98%, demonstrating its reliability and robustness. This high-level confidence ensures minimal errors, making DETR suitable for high-stakes applications such as autonomous driving or assistive technologies for visually impaired individuals.

  • Faster R-CNN followed with a confidence score of 94%, reflecting its strong ability to accurately detect and classify objects. This reliability makes it an excellent choice for applications where detection accuracy is critical, even at the expense of slower processing speeds.

  • YOLOv8 achieved a confidence score of 89%, indicating moderate reliability. While this score is lower compared to DETR and Faster R-CNN, YOLOv8 remains a viable option for scenarios where faster detection times outweigh the need for the highest confidence levels, such as low-resources environments or real-time obstacle avoidance systems.

3) Implications and Application Suitability

The results underscore the importance of aligning model selection with application requirements:

DETR, with its unmatched accuracy and confidence, is well-suited for applications demanding precise obstacle detection and recognition, such as autonomous systems and healthcare diagnostics. However, its slower inference time may limit its usability in time-sensitive environments.

Faster R-CNN strikes a balance between high accuracy and reliability, making it a strong candidate for tasks where precision detection prioritized, such as detecting small or occluded objects in cluttered environments.

YOLOv8, while delivering moderate accuracy and confidence excels in scenarios requiring real-time performance such as navigation assistance for visually impaired individuals or dynamic obstacle detection avoidance in robotics.

4) Limitations and Future Directions

Despite the promising results, certain limitations are identified in this study:

  • Dataset Bias: The dataset used may contain inherent biases or limitations in diversity which could influence results. Future studies should consider using larger, more diverse datasets to improve generalization.

  • Environmental Constraints: The simulations were conducted under controlled conditions, and the models’ real-world performance may vary due to factors such as occlusions, varying lighting, conditions, and environmental dynamics.

  • Computational Requirements: The reliance on specific hardware configurations, including high-performance GPUs, may limit the feasibility of deploying certain models in low-resources environments.

  • Model-Specific Optimization: Additional fine-tuning of hyperparameters specific to each model could further improve their performance and provide a more nuanced comparison.

SECTION IV.

Conclusion

This study provides valuable insights into the performance of DETR, Faster R-CNN and YOLOv8 for obstacle detection and recognition. DETR leads to accuracy and confidence, making it the most reliable option for high-stakes applications. Faster R-CNN offers a strong balance between accuracy and reliability, while YOLOv8 stands out for its speed and efficiency in real-time applications.

Future work should focus on addressing dataset biases, incorporating real-world environmental factors and optimizing models for specific deployment scenarios to enhance their robustness and generalization.

Data Availability

The primary data used to support the findings of this study is available and will be provided on request.

Conflicts of Interest

The authors have disclosed no potential conflicts of interest concerning the authorship and/or publication of this article.

Authors’ Contributions

Sunnia Ikram conceptualized the research, designed the methodology, conducted the experiments and wrote the initial manuscript draft; Imran Sarwar Bajwa supervised the project, provided critical insights for model development, and contributed to the manuscript review and editing; Sujan Gyawali has validated the model; Amna Ikram assisted in data collection; and Najah revised and edited the manuscript.

ACKNOWLEDGMENT

No AI tool has been used to prepare this manuscript. All the figures in the article were created by the author. Can submit in editable format on request.

References

References is not available for this document.