PerceptGuide: A Perception Driven Assistive Mobility Aid Based on Self-Attention and Multi-Scale Feature Fusion

The paper introduces a novel wearable aid, PerceptGuide to help for visually impaired individuals to perceive scene around them. It is designed as a wearable, light weight chest rig bag, that incorporates a monocular camera, ultrasonic sensors, vibration motors, and a mono-earphone, powered by an embedded Nvidia Jetson development board. The system provides directional obstacle alerts through the vibration motors, allowing users to avoid obstacles on their path. A user-friendly push-button enables user to inquire about scene information in front of them. The scene details are effectively conveyed through a novel scene understanding approach, that combines multi-scale feature fusion, self-attention models, and a multilayer GRU (Gated Recurrent Unit) architecture on the ResNet50 backbone. The proposed system generates coherent and descriptive captions by capturing image features at different scales, enhancing the quality and contextual understanding of the scene details. The self-attention in both the encoder (ResNet50 + Feature fusion model) and decoder (multilayer GRU), effectively captures long-range dependencies and attend to relevant image regions. The quantitative evaluations conducted on the MSCOCO and Flicker8k datasets show the effectiveness of the model with improved Bleu-67.7, RougeL – 47.6, Meteor – 22.7 and CIEDR-67.4 scores. The PerceptGuide system exhibits exceptional real-time performance, generating audible captions in just 1.5 to 2 seconds. This rapid response time significantly aids visually impaired individuals in understanding the scenes around them. The qualitative evaluation of the aid emphasizes its real-time performance, demonstrating the generation of context-aware, semantically meaningful captions. This validates its potential as a wearable assistive aid for visually impaired people, with the added advantages of low power consumption, compactness, and a lightweight design.


I. INTRODUCTION
According to a report released by the World Health Organization (WHO) in October 2022 [1], approximately 2.2 billion individuals worldwide are affected by either near or distant vision difficulties.Between 1990 and 2020, there has been a significant surge in the number of individuals affected by both blindness and moderate to severe vision impairment.The prevalence has increased by 51% for The associate editor coordinating the review of this manuscript and approving it for publication was Santosh Kumar .blindness and 92% for moderate and severe vision impairment [2].The global population aged 65 years or older is projected to double between 2020 and 2050, reaching approximately 2 billion people.This significant increase in the aging population will have profound implications for agerelated conditions, including age-related blindness.The rise in age-related diseases associated with vision loss will pose critical challenges in the coming decades [3].
One of the significant issues is the lack of accessibility in various environments such as public spaces, transportation, and visual media.The limited access to information, infrastructure, and services poses challenge for their independent mobility.Vision impairment significantly impacts the quality of life for adults, resulting in reduced workforce participation, productivity, and increased rates of depression and anxiety [4].In older adults, it leads to social isolation, difficulty in walking, higher risks of falls and fractures, and increased reliance on nursing or care homes.Economically, vision impairment imposes a substantial global burden, with an estimated annual productivity loss of US$ 411 billion [5].
Blind and visually impaired individuals encounter challenges in their daily lives concerning about objects around them and finding the obstacles free navigation path.The assistive solutions developed till date are useful in aiding for navigation, and obstacle detection.The most used methods for mobility assistance include white cane, guide dogs and smartphone applications.These solutions have limitations.White canes are effective for detecting ground-level obstacles but have limited spatial awareness for higher-level obstacles and may not detect small objects or people in close proximity, increasing the risk of collisions or accidents.The guide dogs need special training and attention, which may not be feasible to average person.Despite rapid advancements in smartphone applications like voice assistance and navigation maps for visually impaired people [6], [7], achieving complete and optimal utilization of these technologies still poses a challenge.
Significant advancements have been made in the field of wearable blind assistive devices.One notable example is a wearable aid that utilizes an RGB camera and convolutional neural network to calculate the plane, providing indoor object detection and safe walking route assistance for visually impaired individuals [8].Various other wearable mobility aids have also been developed to cater to the needs of blind or visually impaired individuals.For instance, a system using sensor data fusion and fuzzy logic-based decisionmaking offers safety orientation assistance [9].Additionally, the StereoPilot, a head-mounted target location system, incorporates an RGB-D camera to capture and process 3D spatial information of the surroundings, providing users with intuitive navigation cues.These cues are further enhanced through the integration of spatial audio rendering (SAR) technology, allowing them to be transmitted as 3D sound, which effectively localizes the target [10].
A situational awareness assistance in indoor environment is provided using RGB-D camera-based depth map information is processed to generate verbal or haptic feedback about the mutitarget recognition, face recognition, text reading and navigation [11].A cloud based wearable assistive aid built using convolutional transformer with weak-attention suppression.A novel approach of navigation based on speech recognition was presented [12].It provided traffic light detection, obstacle avoidance, payment, and navigation assistance.
Despite the notable advancements in assistive technologies, wearable assistive aids, there has been significant resistance to their adoption among blind individuals.This resistance stems from the fact that many blind individuals are accustomed to traditional white cane.Scene Perception plays a crucial role in their ability to navigate, interact with others based on sensory cues.When confronted with complex or unfamiliar environments, the blind or visually impaired people find navigation challenging.
Advancements in scene understanding [13], [14], [15], [16] and image captioning [17], [18], [19], [20], [21] are driving the evolution of novel assistive solutions, enabling visually impaired individuals to navigate the world more effectively.The pursuit of improved quality of life for blind and visually impaired individuals fuels the increasing research and development of enhanced assistive devices.These innovative devices play a crucial role in granting them access to audible information about their surroundings, empowering them with independence and reducing their reliance on caretakers.
Image captioning has emerged as a prominent research area in the domains of computer vision and natural language processing.It focuses on generating meaningful and descriptive captions for images, attracting considerable interest from researchers.It serves as a bridge between visual understanding and textual comprehension, enabling applications such as content indexing, image retrieval, and accessibility enhancement for the visually impaired.To tackle this challenging task, researchers have proposed various approaches that leverage deep learning and sequence modeling techniques [22].Extracting informative visual features from images and combining them with a language model capable of generating accurate and contextually meaningful captions is crucial.In this regard, feature fusion models have been introduced to capture multi-scale visual information effectively.These models aim to combine features extracted from different levels of the image hierarchy, allowing for a more comprehensive understanding of visual content.ResNet50, a widely used convolutional neural network architecture, has shown promising results in extracting discriminative features from images [23].
Attention mechanisms have been instrumental in the progress of image captioning, as they allow models to dynamically concentrate on essential regions of the image, leading to improved caption quality and relevance [24].Notably, Anderson et al. introduced a two-stage attention mechanism that incorporates both bottom-up and top-down attention [25].This method enhances the model's capability to grasp both local and global context, resulting in a more comprehensive image understanding.Additionally, Lu et al. proposed a visual sentinel mechanism that assesses the importance and relevance of visual features for caption generation [26].These attention mechanisms have significantly contributed to the advancements in image captioning by enhancing the model's ability to focus on relevant image regions.Evaluation metrics are crucial for assessing the quality of generated captions.Commonly employed evaluation metrics in this context encompass BLEU, METEOR, ROUGE, and CIDEr, which compare the generated captions against human-generated reference captions to measure their similarity and overall quality [27].
This research paper presents a wearable assistive chest rig bag that serves as a complementary aid, enabling visually impaired individuals to gain understanding of their surroundings.In this paper, we demonstrate the design and functional evaluation of the innovative wearable aid.It includes a novel approach for image captioning that combines the ResNet50 multi-scale visual multi scale feature fusion model and ResNet50 architecture for effective extraction of multi-scale visual features.Additionally, we leverage self-attention models in both the encoder (ResNet50 + feature fusion) and decoder (multilayer GRU) components to capture long-range dependencies and enhance the accuracy of both image features and the generated textual features.The proposed aid aims to generate fluent, coherent, and contextually aware caption which is then converted into auditory feedback for visually impaired people by effectively integrating visual and textual information.
The main objective of this research is to create a user-friendly and efficient wearable device specifically designed for visually impaired individuals.The device will process scenes in real-time without requiring excessive computational power, ensuring long-lasting battery operation.By using advanced computer vision methods and real-time processing, the proposed system will deliver quick and precise auditory feedback to visually impaired users.The aim is to provide visually impaired individuals with an assistive tool that can effectively interpret their surroundings, offering relevant auditory information to help them navigate and understand their environment with ease.The PerceptGuide was experimentally evaluated on two popular datasets, Flickr8k and MSCOCO, using standard evaluation metrics like Bleu, RougeL, Metor, and Rouge scores.The results demonstrate the effectiveness of our method in generating accurate and semantically meaningful captions.Moreover, the aid is lightweight, easy to wear, making it comfortable and convenient for everyday mobility assistance to visually impaired people.

II. RELATED WORK
The assistive solutions for visually impaired people have been significantly evolved to help visually impaired people navigate independently.The navigation assistance involves finding the obstacles in indoor and outdoor environment and generating feedback to avoid those.An indoor object detection system developed utilizing a deep convolutional neural network (CNN) framework [13], RetinaNet model.The evaluation of the system was done using ResNet, DenseNet, and VGGNet backbones to enhance detection accuracy and processing speed, with 84.61% mAP (mean average precision).
Researchers have conducted significant studies in the field of image captioning using self-attention, driven by their potential applications in improving the accuracy and contextuality of image descriptions, benefiting areas such as visual understanding, human-computer interaction, and content-based image retrieval.In recent years, there has been significant research in the field of image captioning, with several notable approaches proposed.Vinyals et al. introduced a neural image caption generator, commonly known as ''Show and tell,'' which utilized neural networks to generate captions for images.This method, although not explicitly specifying the model used, focused on generating image captions and utilized the MS COCO dataset [28].Ryan et al. presented an encoder-decoder pipeline that effectively learns a shared multimodal embedding space for images and text.This enables the system to rank images and sentences while also generating novel descriptions from scratch [29].To enhance the image captioning process, Xu et al. proposed a method called ''Show, attend and tell,'' which incorporated a visual attention mechanism.By attending to different regions of an image, the model generated captions that were more aligned with the relevant visual features.The study also employed the MS COCO dataset for training and evaluation [24].
Another approach in image captioning was presented by Karpathy and Fei-Fei, who focused on aligning visual and semantic information for generating image descriptions.Their method, based on deep visual-semantic alignments, utilized the Flickr8k dataset and aimed to bridge the gap between the visual and textual domains in image captioning [30].To collect and evaluate image captions effectively, Chen et al. introduced the Microsoft COCO captions dataset, accompanied by a data collection and evaluation server [31].
Addressing the challenge of generating diverse captions for images containing multiple objects, Wu et al. proposed a method specifically designed for captioning images with diverse objects.By leveraging the MS COCO dataset, the model aimed to generate a variety of captions that accurately described the different objects present in the image [32].Chen et al. explored the concept of order-embeddings of images and language in their research.By embedding images and language in an ordered representation, their method aimed to enhance the understanding and generation of image captions.The study employed the MS COCO dataset for experimentation [33].Moving beyond static images, Yao et al. focused on describing videos by exploiting their temporal structure.By leveraging the MPII Movie Description dataset, their method generated descriptions for videos by considering the temporal dependencies and structure inherent in video data [34].
Hendricks et al. addressed the challenge of generating captions for novel object categories without paired training data.Their approach, known as deep compositional captioning, aimed to generate captions for objects that were not present in the training set.The Visual Genome dataset was utilized to evaluate the effectiveness of their method [35].Gu et al. introduced the stack-captioning approach, which employed a coarse-to-fine learning strategy for image captioning.By iteratively refining the generated captions, the model aimed to improve the quality and coherence of the generated captions [36].Steven et al. introduces a reinforcement learning framework for training image captioning models using self-critical sequence training, which optimizes caption generation through reinforcement learning [37].Justin et al. introduces a method for generating image captions from scene graphs, which encode both the visual and structural information of a scene [38].Li et al. introduced the ''Person-Object Interaction Network'' for image captioning, which focused on capturing the interactions between people and objects in images.The model incorporated object recognition and relationship modeling to generate more informative and contextually rich captions [39].
Chen et al. proposed the ''Image Caption with Global-Local Attention'' model, which combined global and local attention mechanisms to capture both the overall context and fine-grained details in images.This approach enhanced the descriptive quality and relevance of the generated captions [40].Zhang et al. introduced the ''Conceptual Captioning'' approach, which utilized external knowledge from textual resources to generate captions.The model employed a graph convolutional network to encode the knowledge graph and incorporated it into the captioning process, resulting in more informed and detailed captions [41].Lu et al. presented the ''Neural Baby Talk'' model, which focused on generating fine-grained descriptions for objects in images [42].The approach employed an LSTM network and an attention mechanism to capture object-level details, leading to more specific and precise captions [43].Li et al. proposed a cross-modal retrieval framework called ''Stacked Cross Attention for Image-Text Matching'' (SCAN).Although primarily designed for image-text retrieval, this model can also be applied to image caption generation by aligning image and text representations [44].Li, Jiuxiang, et al. introduced the ''Contextual Attention LSTM'' (CALSTM) model, which utilized a contextual attention mechanism to selectively attend to relevant regions of the image while generating captions.This approach enhanced the generation of detailed and contextually coherent captions [45].Chen et al. introduced the ''Groupcap'' model, which focuses on group-based image captioning.This approach incorporates structured relevance and diversity constraints to generate captions that are not only relevant to the image group but also diverse within the group.By considering the interplay between relevance and diversity, the model produces captions that capture both the shared context and the distinctive characteristics of the image group [46].Yao et al. proposed a method to enhance image captioning by incorporating attributes.The model leverages attribute detectors to extract visual attributes from images, which are then used to enrich the captioning process.By explicitly considering attributes, the model generates more detailed and informative captions that capture fine-grained visual characteristics [47].Aneja et al. introduced a convolutional image captioning model that combines convolutional neural networks (CNNs) and recurrent neural networks (RNNs).The model utilizes a hierarchical CNN to extract spatial features from images, which are then fed into an RNN for generating captions [48].Pan et al. presented a study on automatic image captioning, using the visual content of images to generate corresponding textual descriptions [49].An unsupervised approach for image captioning, with unlabeled images [50], [51], utilizes visual-semantic embedding and a language model, which learns a joint representation space for images and captions.The approach achieves competitive results compared to supervised methods, generating captions without relying on paired training data.Haque et al. presented an approach to mimic human using attention and object features [52].A different approach with geometrical semantics to generate captions was introduced in [53].
While the existing literature in image captioning has explored various aspects such as attention mechanisms, semantic information, and compositional structures, there is a common limitation of lack of effective feature fusion models that capture multi-scale visual information comprehensively.Additionally, some existing approaches focus on either the encoder or decoder side attention mechanisms, but not both.Our proposed model integrates feature fusion, selfattention models, and a multilayer GRU architecture in both the encoder and decoder, allowing us to capture long-range dependencies and improve the accuracy of image and textual features.This holistic approach enhances the model's ability to generate fluent, coherent, and contextually aware captions, filling the gap in existing literature.Our proposed model addresses this gap by combining the feature fusion model, ResNet50 architecture, and self-attention mechanisms in a holistic manner.This allows us to capture multi-scale visual features, attend to salient regions, and generate contextually aware and semantically meaningful captions.The objective of this research is to design a low computational battery-operated embedded system that will process the scene in real-time and generate the fastest audible feedback to the visually impaired user.The experimental evaluation on the Flickr8k and MSCOCO datasets demonstrates the superiority of our approach, as evidenced by improved Bleu and Rouge scores, validating its effectiveness.

III. METHODOLOGY
The PerceptGuide, a novel real-time embedded systems that provides obstacle detection and scene recognition to the visually impaired individuals.The proposed aid is a wearable chest rig bag, equipped with essential components, including an embedded Jetson Nano Unit, a monocular camera, ultrasonic range finder, 101170 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
vibration motors, a mono-earphone, a push button, and a rechargeable battery.Figure 1 illustrates the system level block diagram with input sensors and output actuators, and embedded processing Jetson Nano Unit.
The proposed system is powered by 5V 3A rechargeable battery.It utilizes the Maxbotix -MB1040-000 LV-EZ4 sensor for obstacle detection using sonic time of flight concept.There are two vibration motors placed on left-side and right-side bottom of the aid to generate obstacle directional vibration alert to the user.It is equipped with 2 Mega pixel -OV2710 monocular camera with FOV (field of view) 140 deg, to capture the scene in the front direction.The captured scene is converted to caption and then using text-tospeech module it is presented to user as the audible message.
The compact and portable setup of the PerceptGuide aid, as depicted in figure 2. Figure 2. (a) presents a real-life scenario where a blind-folded volunteer is wearing the Per-ceptGuide, an innovative smart wearable aid. Figure 2. (b), (c) show the details of placement of sensors, actuators, embedded processing unit, and battery bank.As a hands-free aid, it can effectively complement the traditional white cane.The proposed aid is designed as a chest rig bag that is worn beneath the chest, as depicted in Figure 2(b).The design process placed significant emphasis on considering anthropometric factors to ensure that the aid is well-suited for visually impaired individuals of various sizes, ages, and genders.The wearable features adjustable shoulder straps that enable users to customize the height, as well as a back strap that allows for the tightening or loosening of the upper waist strap.This design approach ensures that the aid can effectively accommodate users of varying sizes, ages, and genders, making it a versatile and inclusive solution.The use of mono-earphone ensures no obstruction to the sensory system of the user.Its ergonomic design ensures ease of use, allowing to integrate the PerceptGuide into their daily routines.It is a convenient and user-friendly solution to enhance the mobility of the visually impaired people.

A. SAFETY-INTEGRATED HARDWARE DEVELOPMENT FOR THE PERCEPTGUIDE
The PerceptGuide aid is equipped with a low-power embedded microcontroller, utilizing the Jetson Nano board.This board features a Quad-core ARM A57 CPU clocked at 1.43 GHz, a 4GB RAM, and a powerful 128-core GPU.The aid is powered by a rechargeable portable battery pack.The PerceptGuide's power requirement is efficient, with the CPU consuming 364 mW and 72 mA.The USB connected camera and mono earphone consume 448 mA with 2.3 W, which is reasonable considering their functionalities.The power requirement of the PerceptGuide is well-managed, balancing functionality and energy efficiency.The efficient power consumption of the PerceptGuide is crucial for ensuring a longer battery life, making it practical for use by visually impaired individuals.
The hardware integration of the PerceptGuide prioritizes safety precautions to ensure the well-being of users.
Throughout the development process, safety standards and guidelines for assistive devices have been closely adhered to maintain safety and reliability in the PerceptGuide aid.Every electronic component is carefully insulated and securely harnessed, guaranteeing zero human contact with any exposed electrical parts.The battery circuits are carefully enclosed and shielded.The proper isolation measures have been implemented throughout the design to prevent any potential risks to the users.The device is ergonomically designed to be comfortable and lightweight, minimizing any physical strain or discomfort during use.

B. FASTER SCENE PERCEPTION WITH OBSTACLE AVOIDANCE MODEL
The PerceptGuide introduces a novel algorithm called Faster Scene Perception with Obstacle Avoidance.This innovative algorithm plays a crucial role in governing the operations of the multisensory system, facilitating efficient scene perception and obstacle avoidance.The PerceptGuide system is divided into two parts i) Faster Scene Perception with audio feedback, and ii) Way Finding with obstacle detection and directional vibratory feedback.
The Faster Scene Perception architecture (FSPA) is a novel module combining CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) components.It consists of two submodules, namely the Encoder and the Decoder.Figure 3 details the captured image is processed by FSPA module.The encoder submodule extracts the image features, the decoder block is fed with image captions for training.The FSPA module generates a caption which describes the details in the scene.
Figure 4 details the FSPA -Faster Scene Perception Architecture -Scene Understanding module.The novelty of the FSPA module is, it uses three novel approaches i) multi scale feature fusion to capture dense features and ii) self-attention to capture important contextual relationships between different elements of the input image, iii) 3 layer -GRU (Gated Recurrent Unit) as decoder, with sequential self-attention technique to generate meaningful captions.

1) DATASET DETAILS
MSCOCO 2017 and Flickr8k are the two datasets used to train and test the model.The MSCOCO dataset consists of over 330,000 images with 80 object categories and 5 captions per image.It serves as a popular benchmark for image captioning due to its diverse content.In contrast, the Flickr8k dataset contains 8000 images with multiple captions, offering a variety of scenes and objects.For training, 6000 images from Flickr8k are used, with 1000 images each for validation and testing.The model is initially designed using Flickr8k and then optimized for MSCOCO, providing a comprehensive evaluation of its performance.
The dataset is pre-processed to ensure compatibility and optimize results of the model.The images are firstly converted to RGB and resized to 281 × 281.This image is then randomly cropped to a fixed dimension of 224 × 224.and transforms it into a feature representation that captures relevant information.The Resnet-50 processes the image through multiple convolutional and pooling layers, extracting hierarchical features at different scales.The feature fusion module extracts feature from five branches.This helps to make the encoder features more robust.Further, the self-attention module is utilized to optimize the features.It allows the model to selectively attend to relevant image regions and improve the overall representation.After passing through the attention module, the output undergoes a series of operations, including an average pooling layer, fully connected layer, ReLU activation, and dropout layers.These operations contribute to producing the encoder's output, which is a condensed representation known as the image feature vector or image embedding.This feature vector, along with the text embedding, is subsequently fed into the decoder module to generate the caption.

a: BACKBONE ResNet-50 NEURAL NETWORK
The backbone module is initialized with 8 layers from the ResNet-50 model and fine-tuning these layers by requiring their parameter be included in the gradient computation during the training process.These layers include convolutional, batch normalization, ReLU activation functions and max pooling layers.The backbone network is illustrated in figure 6.

b: MULTI-SCALE FEATURE FUSION MODEL
The Feature fusion module includes additional branches that perform convolutional operations on the backbone features as shown in figure 7.These additional branches have different numbers of layers and are used to extract different levels of information from the backbone features.Each branch has input feature size as 2048, and two convolutional layers.The five branches generate 256, 12, 64, 32, and 16 channel feature maps.These feature maps from the additional five branches are resized to match the spatial dimensions of the backbone features using bilinear interpolation.The resized feature maps are then concatenated with the original backbone features.This allows the model to capture both low-level, high-level features and at different spatial resolutions.This feature fusion module enhances the model's representation learning capability, enabling it to extract more comprehensive and discriminative features from the input images, which can ultimately improve the model's performance.Total number of channels at the output of concatenate block are 2544xHxW, which are fed to the self-attention block.

c: ENCODER SELF-ATTENTION MODEL FOR IMAGE FEATURES
The self-attention mechanism in the encoder is a key component for modeling long-range dependencies and capturing contextual information in the image.By applying convolutional operations and computing attention weights, the model can selectively attend to relevant image regions and enhance the representation of important features.The architecture of the self-attention model on image data consists of three convolutional layers (query_conv, key_conv, value_conv) with kernel size 1, which reduce the dimensionality of the input features.The figure 8 details the encoder self-attention.It takes the concatenated image features as input and applies convolutional operations to transform them into query, key, and value tensors.It then computes attention weights based on the similarity between query and key, and uses these weights to attend to different parts of the value tensor.The attended features are combined with the original features using a scaling parameter (γ ).
This attention-based fusion allowed the model to effectively combine information from multiple branches and spatial resolutions, resulting in a more comprehensive and discriminative feature representation.The self-attention mechanism plays a crucial role in improving the model's ability to understand and interpret complex visual patterns, leading to enhanced performance.

3) DECODER MODEL FOR SCENE CAPTION GENERATION
The output features from the Encoder module are passed to the Decoder module as the initial input to be used in the caption generation process.The decoder structural details are demonstrated in the figure 9.The Decoder module contains a network responsible for generating sequences.First, the input captions are passed through an embedding layer and combined with the feature tensor.This concatenated input is then fed into a GRU recurrent neural network.The GRU processes the embedded tokens and produces output tokens through a linear layer, with dropout applied beforehand.The GRU also generates a sequence of hidden states, which capture contextual information from the input sequence.Furthermore, the Decoder module incorporates a selfattention mechanism for the GRU's hidden states, involving query, key, and value linear transformations, followed by attention calculation and application steps.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

a: GRU (GATED RECURRENT UNIT)
A GRU is a type a recurrent neural network, which is used to capture the sequential pattern of input text embeddings in the decoder module to generate scene caption.GRUs incorporate gating mechanisms that allow them to selectively remember of forget information from previous time steps.GRUs are suitable for handling long-range dependencies in sequences while avoiding the vanishing gradient problem.The figure 10 shows the basic structure of single GRU cell.It consists of two gates: the update and the reset gate.The update gate controls how much of the previous hidden state should be retained and how much of the new input should be integrated.The reset gate determines which parts of the previous hidden state should be forgotten, enabling the model to adapt to changing patterns.

b: MULTILAYER-GRU FOR SEQUENTIAL CAPTION TRAINING
The GRU generates captions by sequentially creating one word at each time step, considering the context vector, prior hidden state, and previously generated words.To begin, the input token (either a word or character) undergoes transformation through an embedding layer, converting it into a dense vector representation.
The embedded input token is then fed into the GRU cell, where it is processed alongside the previous hidden state to generate a new hidden state, as shown in figure 11.The hidden state is further passed through a linear layer, commonly referred to as the ''hidden linear layer,'' responsible for dimension transformation.The transformed hidden state becomes the input for the subsequent iteration of the GRU cell, along with the embedded input token of the current time step.This iterative process continues for each time step, with the hidden state and input token being recurrently fed back into the GRU cell.Finally, the output of the last hidden state is forwarded through another linear layer, known as the ''output linear layer,'' which further modifies the dimensions of the hidden state.The resulting output from the output linear layer represents the predicted output token, serving as the model's prediction for the current input.
The 3-layer GRU model effectively captures dependencies between the input tokens and generates sequential outputs based on the learned patterns in the training data.

c: SELF-ATTENTION FOR TEXT EMBEDDINGS
This Self-attention module is specifically designed for sequential data, such as text or time-series data.It performs linear transformations on the input data, calculating attention weights to highlight relevant elements.Additionally, it incorporates a gamma parameter for scaling.In the context of this module, the self-attention mechanism is applied to RNN features, as demonstrated in figure 12.
It takes the hidden states from the GRU module as input and employs linear transformations to compute the query, key, and value tensors.Attention weights are then calculated based on the query and key tensors, and applied to the value tensor to obtain the attended RNN features.The self-attention mechanism focuses on important parts of the hidden states, weighting each hidden state according to its relevance within the context.It achieves this by applying attention calculations using linear transformations for query, key, and value tensors.The resulting attention-applied tensor is scaled using the gamma parameter and added to the original GRU hidden state to produce the output.During training, the gamma parameter is learned alongside other model parameters, enabling the model to adaptively determine the appropriate scaling for the attention-applied tensor.This parameter allows the model to control the impact of the attention mechanism on the final output, effectively adjusting the importance of the attention-applied tensor relative to the original input.
The self-attention module enhances the representation of hidden states by capturing long-range dependencies and contextual information, leading to improved performance in tasks such as caption generation.The attention output is further processed through a linear layer to map the hidden states to the vocabulary size, generating the final predicted tokens for caption generation.

C. MODEL IMPLEMENTATION 1) TRAINING OBJECTIVE
The main objective of training the encoder-decoder model is to optimize its parameters in order to generate accurate and meaningful captions for images.This objective is accomplished by minimizing the loss function during the training process.Here, the loss function is computed using categorical cross entropy, which quantifies the dissimilarity between the predicted word probabilities and the true word labels in the captions.The categorical cross entropy loss encourages the model to assign higher probabilities to the correct words in the captions and penalizes deviations from the ground truth.The categorical cross entropy is mathematically computed using the eq.( 1): where N is the batch size, M is the maximum sequence length, y i represents the true label of the i th word, and P i denotes the predicted probability of the i th word.Minimizing this loss function enables the model to learn the statistical patterns and relationships within the captions, thereby improving its ability to generate accurate and contextually appropriate captions for images.The model parameters are optimized using the Adam optimizer.Adam is an adaptive optimization algorithm that dynamically adjusts the learning rate for each parameter based their gradients.It combines the advantages of both AdaGrad and RMSProp to achieve efficient optimization.The learning rate for the model is set to 3e-4, and is then trained for 100 epochs on NVIDIA RTX 3050 GPU for 3 days.

2) EVALUATION METRICS
The evaluation metrics play a vital role in assessing the quality of machine-generated text.BLEU (Bilingual Evaluation Understudy) measures the similarity between generated and reference text using n-gram matches.The machine-generated translation is equal to the reference translation if the BLEU score is 1, which ranges from 0 to 1.
The formula for calculating BLEU score is eq.2: where in equation ( 2): (precision_ngram): Difference between the machine-generated translation and the reference translation in terms of n-grams (1, 2, 3, and 4).
N : The number of n-grams taken (usually up to 4).brevity_penalty: Modifies the BLEU score according to how long the machine-generated translation is in comparison to the reference translation.
While METEOR (Metric for Evaluation of Translation with Explicit Ordering) score is employed to assess the caliber of translations produced by automated means.Compared to the BLEU score, which just considers the n-grams overlap, the METEOR score employs a more complicated meth-od.It specifically considers the machine translation's quality in terms of word choice, word order, synonyms, and similarity to the reference translation (s).The METEOR score uses a stemming algorithm, which is one of its distinctive features.The formula for calculating the METEOR score is in eq.3: where in equation ( 3): P: precision score R: recall score α : Tunable parameter that controls the impact of precision and recall on the overall score.Typically, α is set to 0.5.
ROUGE-L is a metric used to measure the similarity between summaries by comparing the longest common subsequences (LCS) found in both summaries.It utilizes a LCS-based F-measure to assess two summaries, X of length m (the gold standard summary) and Y of length n (the generated summary sentence).When X and Y are identical, ROUGE-L equals 1.However, if there are no common subsequences between X and Y, ROUGE-L will be equal to 0.
CIDEr score, designed for image captioning, considers n-gram matches and relevance using TF-IDF weighting.CIDER captures saliency, importance, and consensus in references.
These metrics provide quantitative measures to gauge the performance of text generation models, yet they should be supplemented with human evaluation and qualitative analysis for a comprehensive understanding of the generated text's quality.

IV. RESULTS AND DISCUSSIONS
During the training process, the model's convergence is monitored by recording and analyzing the training loss using TensorBoard.The validation loss is also calculated on a separate validation dataset to evaluate the generalization ability of the model.Figures 13 and 14 display the training and validation loss curves across the 100 epochs.
The experimental results exhibit a consistent decrease in the loss over the epochs, indicating the effectiveness of the proposed model in generating accurate and meaningful captions for images.The model achieves low training and validation losses, demonstrating its capability to capture the intricate relationships between images and their corresponding captions.

A. QUANTITATIVE RESULTS
To statistically assess the model's effectiveness, widely used evaluation metrics including BLEU, METEOR, ROUGE-L, and CIDEr are employed.These metrics assess the quality of generated sentences by comparing them to reference phrases using freely available MSCOCO tools.The evaluation metrics provide insights into the consistency of n-grams between the generated and reference phrases, facilitating a comprehensive comparison with existing image captioning methods.Table 1 presents the evaluation metric scores for the image captioning model using both the MSCOCO and FLICKR8k datasets.The BLEU, METEOR, ROUGE-1, ROUGE-2, and ROUGE-L scores are reported.It showcases the model's performance in terms of similarity, precision, recall, and overlap between generated and reference text.
On MSCOCO dataset, a higher score in these metrics reflects better performance and alignment with humangenerated captions.A score of B1 67.7 show the closeness of the generated caption with the ground truth.ROUGE-L, assess longest common sub sequences and METEOR score of 22.7 and Rouge-L score of 47.6 indicates model has moderate similarity between generated captions and reference sentences.It can generate contextually relevant captions.CIDEr, considers n-gram matches and relevance using TF-IDF weighting.The higher CIDEr score validates models' effectiveness in producing diverse and informative captions.
On Flickr8k dataset, the proposed model scored 63.8 (BLEU-1), 44.8 (BLEU-2), 30.4 (BLEU-3), and 20.7 (BLEU-4).The scores are slightly lower than the MSCOCO dataset, the model still generated meaningful captions for the images.The METEOR score of 20.5 and Rouge-L score of 46.7 demonstrate its proficiency in generating contextually relevant and coherent captions.The CIDEr score of 65.0 further confirms its capacity to produce informative and diverse captions.This shows a strong ability to understand visual content and convert it into relevant textual descriptions.
To further evaluate the model trained on MSCOCO dataset is further tested on random 10 images from internet to check its efficiency in generating meaningful captions.The performance of it is tabulated in  of the generated captions with reference sentences, demonstrated a high value of 47.6.This highlights the model's proficiency in generating contextually relevant and coherent captions, aligning well with the reference captions.The CIDEr score attained a value of 58.6.This indicates the model's effectiveness in generating diverse and informative captions.On the other hand, the METEOR score, measuring the quality of captions through paraphrases, obtained a moderate value of 22.8.The model captures paraphrases to a certain extent, there is scope for improvement in incorporating a wider range of language expressions in its generated captions.
Table 3 demonstrates the performance of various image captioning methods using multiple evaluation metrics.GoogleNIC exhibited moderate performance with relatively higher scores in B1 and B2 [28].Log Bilinear showed good performance in B1 and B2 but lacked precision in B3 and B4, limited to produce shorter captions [29].LRCN performed well in B1 and B2 but experienced a significant drop in B3 and B4 scores [43].CapsNet showed higher B4, represents longer and more accurate captions [52].
Xception + YoloV4 Object Importance demonstrated impressive performance in ROUGE-L and CIDEr but exhibited lower scores in BLEU metrics, shows limitation with caption precision [51].Attention NIC, on the other hand, showed excellent results in ROUGE-L and CIDEr, but lacked data for BLEU scores [26].Our proposed model with self-attention in encoder and decoder demonstrates the ability to generate coherent and contextually meaningful captions for images in real-time.

B. QUALITATIVE RESULTS
Even though the model is evaluated based on the metrices, this should be supplemented with human evaluation for a comprehensive understanding of the generated text.
Qualitative analysis complements quantitative metrics in evaluating machine-generated text.Human evaluation considers coherence, relevance, and creativity, while identifying limitations and biases.Human evaluation helps identify cases where the model may produce grammatically correct but semantically incorrect or nonsensical captions and provides a deeper understanding of the model's strengths and weaknesses.The ground truth (GT) and the generated caption (GC) have been illustrated in the figure 15 for the images chosen to consider the situations faced by visually impaired people.
The analysis of the generated captions in figure 15 reveals both strengths and limitations of the model's performance in image captioning.In subfigure 15 (a), the model effectively captures the main idea of a person riding a motorcycle on a street, demonstrating its proficiency in conveying the essential visual information with minor variations in wording.Similarly, in subfigure 15 (b), the model successfully conveys the scene details of a truck and a car driving on a street, despite differences in word order.
However, in subfigure 15 (c), the model introduces an additional element -''vegetables'' -not present in the ground truth, indicating a tendency to generalize and include similar items commonly found in markets.This suggests the need for fine-tuning the model to prevent such over-generalizations.
In subfigure 15 (d), the GC fails to accurately infer the specific action of ''walking'' and instead identifies it as ''standing.''This highlights a limitation in the model's capability to recognize dynamic actions within scenes accurately.In subfigure 14 (e), the GC misses the object ''ball'' but accurately conveys the game and the action of playing, showcasing the model's ability to focus on the main elements of the scene.
Finally, in subfigure 15 (f), the GC provides an accurate and succinct representation of the scene details, demonstrating the model's ability to convey the main elements present in the ground truth effectively.The generated caption expresses a similar meaning when compared to the ground truth as can be seen visually.
The model exhibits strong performance in capturing the main ideas and essential elements of the scenes, but it also reveals areas for improvement, particularly in accurately recognizing specific actions and avoiding over-generalizations. Combining the quantitative evaluation metrics with qualitative analysis leads to a more robust and reliable assessment of the model's performance in generating accurate, meaningful, and contextually aligned scene captions.

C. AUDITORY FEEDBACK TO USER ABOUT SCENE UNDERSTANDING
In PerceptGuide aid, the scene captions generated by the model are converted into audio using a text-to-speech module.The audio feedback message is thoughtfully designed to be played twice, providing the user with ample time to comprehend the scene details effectively.To facilitate user interaction, the PerceptGuide aid is equipped with a convenient push-button feature.Upon pressing the push-button, the system captures the current scene and promptly generates a concise and informative audio description.This novel feature allows users to easily request scene information whenever required, facilitating a smooth and effortless understanding of their surroundings for visually impaired individuals.The real-time, concise audio descriptions enable users to quickly understand the essential details.This highlights the Per-ceptGuide's effectiveness in enhancing the independence, mobility, and overall navigation experience for visually impaired people.

D. OBSTACLE DETECTION AND DIRECTIONAL PATH GUIDANCE
The PerceptGuide aid has two ultrasonic sensors and vibration motors on either side of the PerceptGuide rig bag.The obstacle detection is done using sonic time of flight concept utilizing the sound waves.The process involves the ultrasonic range finder emits the sound waves, which travel to the target obstacle and reflect off its surface.The sensor then detects the returned waves.The time taken by the waves to make round trip, and the speed of sound are used to calculate distance of the obstacle.The distance to the obstacle is calculated using formula presented in equation 4.

distance =
Speed of sound X Time of Flight 2 In a controlled environment, the obstacle detection function of the aid was evaluated.The sonic rangefinder sensors have a maximum detection range of 5 meters.During evaluations in the controlled environment, it detected obstacles within a 5m range and provided directional vibration alerts to the user.The vibration actuators on the left and right sides of the vest vibrated to alert the user to detected obstacles, providing real-time feedback regarding the presence and direction of potential obstacles along the path.When the user approached a left-side obstruction, the left-side vibration motor was activated to indicate the need to move away from that direction.Similarly, if an obstacle was detected on the right side, the vibration motor on the right side alerted the user to adjust their path accordingly.
The combined use of ultrasonic sensors for obstacle detection and directional vibration alerts demonstrated the PerceptGuide's efficiency in assisting visually impaired individuals.The real-time feedback provided by the vibrating motors allowed users to make informed decisions and avoid potential collisions with obstacles, thereby increasing their mobility and safety.

E. INSIGHTS INTO THE OPERATION OF THE PERCEPTGUIDE AID
The PerceptGuide aid offers both scene information assistance and obstacle avoidance alerts, enhancing navigation for visually impaired individuals within their environment.This aid consistently generates scene captions, even in the surroundings lit with a minimum of 500 lumens of light.However, in areas that lack proper illumination and experience low light conditions (below 500 lumens), the aid seamlessly transitions control from the monocular camera-based mechanism to the ultrasonic sensor-based directional obstacle avoidance functionality.In situations where the camera-based system scuffles to make decisions due to low light, complete assistance is flawlessly provided by the ultrasonic sensors.Also, during dusty condition the camera-based computer vision functionality cannot generate a scene description, during such scenarios also the camera-based mechanism transfers the control to ultrasonic based obstacle avoidance functionality, which provides the assistance for obstacle free path.
In light drizzling rain, the PerceptGuide Aid seamlessly combines camera-based scene understanding and ultrasonic sensor-based obstacle avoidance for effective assistance.But in heavy rain or extreme obscurity, the camera-based scene perception struggles.In response, the aid guides users to seek help from pedestrians or caregivers and encourages white cane use for better environmental awareness.
The PerceptGuide Aid employs mono earphones, allowing visually impaired users to hear external acoustic signals and scene descriptions.A 10-12-hour training program will be provided to help users differentiate external sounds from scene-related audio messages.Additionally, vibratory feedback is integrated to convey obstacle information on the navigation path.These dual feedback modes ensure effective operation in noisy environments, maintaining user awareness.
Prior research has introduced various assistive aids to address the mobility challenges faced by the visually impaired community.Augmented canes focused on obstacle alerts and navigation support, while wearable camera-equipped aids offer obstacle recognition and object detection.However, these camera-based solutions often rely on smartphone apps [6], [7], requiring constant internet connectivity, potentially raising costs, and draining smartphone batteries.
Semantic segmentation-based navigation [8] lacks scene descriptions and the ability to detect moving objects.Ultrasonic sensor-based safety orientation [9] cannot discern object details.A 3D sound rendering system [10] provides spatial information but lacks scene descriptions.RGB-D camera-based assistance [11] in indoor environments involves complex processing, leading to slower responses and requiring sophisticated hardware without environmental details.A cloud-based wearable assistive aid uses speech recognition for navigation [12], offering features like traffic light detection, obstacle avoidance, payment support, and navigation.Another aid relies on RetinaNet for indoor navigation [13] with obstacle detection.These aids provided only obstacle detection and couldn't provide any details about the surrounding scene.
The PerceptGuide distinguishes itself with the Faster Scene Understanding Architecture, a lightweight design, and low power requirements.It provides real time audio information about the scene contents and vibration alerts for directional obstacle detection.It incorporates ergonomic considerations to ensure user-friendliness.

V. FUTURE SCOPE
The future development of this research involves conducting a comprehensive evaluation with visually impaired individuals.This evaluation aims to carefully assess the effectiveness and real-life usefulness of the PerceptGuide aid in assisting visually impaired people.Our primary focus is to demonstrate the aid's potential as a highly effective assistive tool while also identifying any limitations it may have.Through this evaluation process, we will actively seek feedback for meaningful improvements and refinements to the PerceptGuide.
The PerceptGuide Aid encounters challenges in effectively providing accurate scene captions under diverse lighting conditions.The future version of the model can be improved by integrating datasets containing images showcasing varying degrees of lighting variations.This augmentation will improve aid's performance to deliver precise and contextually appropriate descriptions across different lighting scenarios.
Another limitation of the aid is it lacks training on weather related conditions such as sunny, cloudy or windy.Bridging this gap requires dataset with weather-related scenarios and model needs to be fine tunned for the same.
The proposed model can be further improved for scene understanding by incorporating additional modalities, scene audio and scene textual information, to enrich the captioning process.The model can be fine-tuned by training it on specific domain data relevant to visually impaired individuals to ensure more contextually relevant scene understanding.Realtime object tracking can make system responsive to dynamic changes in the environment.Extending the language model to support multiple languages can broaden the accessibility of PerceptGuide to diverse users.

VI. CONCLUSION
The PerceptGuide aid proves to be a highly effective, lightweight, power-efficient, and user-friendly assistive mobility aid for visually impaired individuals.It efficiently combines real-time scene captioning, ultrasonic range finding sensors, and provides obstacle detection with directional vibration alerts, and proves its suitability as an assistive mobility aid.The PerceptGuide demonstrates remarkable real-time scene perception performance, providing auditory feedback about scene within an impressive timeframe of 1.5 to 2 seconds.
The results of this study present a robust and effective approach for generating accurate and contextually meaningful scene captions.By leveraging feature fusion, selfattention mechanisms, and multilayer GRU, the proposed method showcases remarkable performance in capturing the intricate relationships between images and their corresponding textual descriptions.The quantitative evaluations conducted on the MSCOCO and Flicker8k datasets show the effectiveness of the model with improved Bleu-67.7, RougeL -47.6, Meteor -22.7 and CIEDR-67.4scores.The qualitative evaluations of the model highlight the scene captions are semantically correct to convey significant scene information to the visually impaired person, thereby assisting him to navigate independently.

FIGURE 2 .
FIGURE 2. (a) A blind-folded volunteer is wearing the PerceptGuide Wearable Aid and white-cane, (b) Placement of sensors, actuators, and processing unit on the PerceptGuide aid, (c) Size and dimensions of the aid.

FIGURE 3 .
FIGURE 3. Overview of the faster scene perception architecture (FSPA)generating scene caption.
2) ENCODER MODELThe Encoder model is used for understanding and representing the visual content of an image.It takes an input image

FIGURE 4 .
FIGURE 4. Overview of proposed approach FSPA (faster scene perception architecture) -scene understanding architecture.

FIGURE 5 .
FIGURE 5.The proposed encoder model with multi-scale feature fusion and self-attention.The proposed encoder module is shown in figure 5.The Encoder module consists of three important sub-modules: a) Backbone ResNet-50 model, b) Multi-scale Feature Fusion, c) Self Attention for Image Features.The encoder uses Resnet-50 trained on Coco dataset as backbone network.The Resnet-50 processes the image through multiple convolutional and pooling layers, extracting hierarchical features at different scales.The feature fusion module extracts feature from five branches.This helps to make the encoder features more robust.Further, the self-attention module is utilized to optimize the features.It allows the model to selectively attend to relevant image regions and improve the overall representation.After passing through the attention module, the output undergoes a series of operations, including an average pooling layer, fully connected layer, ReLU activation, and dropout layers.These operations contribute to producing the encoder's output, which is a condensed representation known as the image feature vector or image embedding.This feature vector, along with the text embedding, is subsequently fed into the decoder module to generate the caption.

FIGURE 6 .
FIGURE 6. CNN layers used from Resnet-50 backbone.The model takes RGB images as input (with 3 channels) of size 224 × 224 × 3 and processes them through the ResNet-50 backbone.The backbone consists of 7 sequential blocks each with multiple residual bottleneck blocks.These bottleneck blocks consist of convolutional, batch normalization and ReLU blocks.The output of the model is the feature representation of the input images.The feature representation of the image is downsampled by 32 making the output feature size as 2048 × 7 × 7. Subsequently, these features are combined with the features from the fusion model at various scales in accordance with the corresponding scales of alignment.

FIGURE 7 .
FIGURE 7. Architecture proposed for feature fusion network.

FIGURE 8 .
FIGURE 8. Encoder self-attention model for image features.

FIGURE 9 .
FIGURE 9. Decoder Network for generating Caption from the features extracted from Encoder.

FIGURE 12 .
FIGURE 12. Decoder self-attention model for sequential data.

FIGURE 13 .
FIGURE 13.Training and Validation loss over the 100 epochs for Flickr8k dataset.

FIGURE 14 .
FIGURE 14. Training and Validation loss over the 100 epochs for MSCOCO dataset.

FIGURE 15 .
FIGURE 15.Qualitative evaluation of images from the internet to compare human generated captions with perceptguide captions.

TABLE 1 .
Various metrics for image captioning model.

TABLE 2 .
Metrices score for testing image captioning model on random images.

Table 2 .
The BLEU-1 score, measures the accuracy of unigram captions, achieved an impressive value of 60.2, indicating the model's proficiency in accurately describing individual objects and scenes within the images.As per table 2, the BLEU-2 score at 31.7, signifying the model's ability to generate coherent bigram captions, capturing word sequences more effectively.However, the BLEU-3 score for generating trigram captions, displayed a relatively lower value of 14.4.The model excels at unigrams and bigrams, it has limitation in accurately capturing longer and more complex language patterns for trigram captions.In table 2, ROUGE-L score, which evaluates the similarity

TABLE 3 .
Performance comparison of scene understanding on MSCOCO.