CNN-based object recognition and tracking system to assist visually impaired people

Visually impaired persons (VIPs) comprise a significant portion of the population, and they are present around the globe and in every part of the world. In recent times, technology proved its presence in every domain, and innovative devices assist humans in their daily lives. In this work, a smart and intelligent system is designed for VIPs to assist mobility and ensure their safety. The proposed system provides navigation in real-time using an automated voice. Though VIPs wouldn’t be able to see objects in their surroundings, they can sense and visualize the roaming environment. Moreover, a web-based application is developed to ensure their safety. The user of this application can turn the on-demand function for sharing his/her location with the family while compromising privacy. Through this application, the family members of VIPs would be able to track their movement (get location and snapshots) while being at their homes. Hence, the device allows VIPs to visualize the environment and ensure their security. Such a comprehensive device was a missing link in the existing literature. The application uses MobileNet architecture due to its low computational complexity to run on low-power end devices. To assess the efficacy of the proposed system, six pilot studies have been performed that reflected satisfactory results. For object detection and recognition, a deep Convolution Neural Network (CNN) model is employed with an accuracy of 83.3%, whereas the dataset contains more than 1000 categories. Moreover, a score-based quantitative comparative analysis is performed using the supported features of devices. It is found that the proposed system has outperformed the existing devices having a total score of 9.1/10, which is 8% higher than the second-best.


I. INTRODUCTION
The World Health Organization (WHO) has reported that 285 million of the world's population is blind or visually impaired. Out of these, 39 million are blind [1]. The major diseases that cause visual impairments include refractive error, glaucoma, trachoma, corneal, opacities, cataracts, diabetic retinopathy, and unaddressed presbyopia [2]. Visually impaired persons (VIPs) face difficulties in performing activities of daily living (ADLs) e.g. opportunity of work and schooling, moving in their surroundings, ability to interact with the environment, and searching for common objects (indoor/outdoor) at their own or even with some assistance. The main challenges for VIPs are/object detection and recognition, currency identification, textual information (sign, symbol) and translation, mobility/navigation and safety [3]. In the past, several approaches, systems, devices, and applications have been developed in the domain of assistive technology to facilitate VIPs to perform tasks that they were formerly unable to accomplish [4]. Such solutions generally comprise electronic devices equipped with cameras, sensors, and microprocessors capable of making decisions and providing tactile or auditory feedback to the user. Many of the existing object detection and recognition systems claim high accuracy but cannot provide necessary information and attributes for the tracking of VIPs to ensure their safe mobility [5]. Though blind people cannot see objects in their surroundings, it would be helpful to know about them. Furthermore, there is a need to develop a tracking system through which family members of VIPs can monitor their movement. In the light of the requirements above, this paper presents a smart system that performs real-time object localization and recognition. As soon as the system recognizes the object, it sends audio feedback to the user. For example, after identifying a known object (e.g., car), the user will hear the word "car." Moreover, the user's location and a snapshot of the most recent viewed scene are periodically stored at a server that can be accessed by the family members using an application to track the user. For object detection and recognition, mobileNet architecture is used because of its low computational complexity to run on low power end devices. Since wearable hardware resources are limited, and the system's feedback about the object's name needs to be as close as possible, complex state-of-the-art object recognition approaches might not be viable as the first choice techniques. The main objective of this study is to develop a system for VIPs that contains the following features:  It performs real-time object detection and recognition using a deep learning framework.  It pronounces names of the objects that can be seen through the camera eye i.e., objects present in the current frame.  It periodically sends the location of the user to a web server.  It sends live feed/snapshots to the webserver.  It provides a web-based interface that the family members of VIPs can use to track their movement while being at their homes. It is an on-demand feature that provides security and ensures the user's privacy. The rest of the article is organized as follows::Literature review is performed in section 2. Section 3 presents the proposed framework. The experimental analysis is discussed in section 4 and, the conclusion is drawn in section 5.

II. LITERATURE REVIEW
The researchers have several proposed techniques to develop assistive devices for VIPs. The following technologies have been used in the existing devices: vision-based sensors (camera), non-vision-based sensors (e.g., IR, ultrasonic, inertial and magnetic sensing, etc.), and other technologies such as low-energy Bluetooth beacons, GPS, GPRS, etc. In this section, the main focus is put on vision-based sensing devices due to their relevance to our proposed system. The existing assistive devices can be grouped into the following three classes according to their functionalities:  Object detection devices  Navigation devices  Hybrid devices (Object detection and navigation)


The devices can be sub-classified based on their working principle and by means (sound or vibrations) of conveying information to VIPs. Figure 1 presents the taxonomy of the assistive devices for VIPs. The object detection devices use sensors (laser scanners, ultrasonic devices) and cameras to collect information from the surrounding environment, process it, and provide feedback to the users. The basic working principle of such devices is that they detect the object around the user and give instructions about the object/obstacle and its distance using vibrations or sound waves. Saputra et al. [6] have presented an obstacle avoidance system with the help of a Kinect depth camera for VIPs. It helps detect the obstacle and calculate its distance using the auto-adaptive threshold. The device is tested on ten blind people aged 20-40 years to evaluate the system's performance. The result of the proposed system is promising as it detects the obstacle without any collision from any direction. Yi et al. [7] have presented a blind-guide crutch using multiple sensors. The triplet ultrasonic module detects the obstacle from the front, left, and right sides. It identifies the object using the voice and vibration waves. Kumar et al. [8] have presented an ultrasonic cane that provides information related to the environment and enables the user to move safely. The ultra-Cane consists of a narrow beam ultrasound system that provides 100% obstacle detection. It detects objects which are 2-4 meters away. The proposed device is tested on ten people (ages 20-26 years). The volunteers effectively detected the hurdles within the proposed range. Petal et al. [9] developed a multi-sensor system that helps the user detect an object in the indoor surroundings. The system uses statistical parameters and the SVM (Support Vector Machine) algorithm for detecting the object. The response is given using the audio. Bauer et al. [10] have introduced a device based on the wireless cam for capturing the atmosphere using the depth map for surroundings. The object detector detects the semantic of an object, whereas the depth map provides the 3D information of the surrounding. The haptic or spoken feedback is given to the user. The application introduced by Chen et al. [11] is helpful in the detection of obstacles and contains glasses, a long stick cane, and a mobile application. The object can be detected through glasses and the long stick cane. If the user falls, the information is uploaded on the online platform and is displayed on the mobile device. Pogi and Mattoccia [12] have developed a wearable system for VIPs using deep learning and 3D vision based on the smartphone. The system uses an RGB camera to capture frames and CNN to detect obstacles. The system's performance is exceptional, i.e., near to 98% through the LeNet architecture. The system is lightweight, small in size, and performs detection in realtime, but it cannot categorize most of the daily life objects. The user is guided through the tactile and audio feedback. The system is tested on more than 20 individuals having a visual impairment of different degrees. The result is very 2 VOLUME XX, 2017

FIGURE 1. Taxonomy of assistive devices for VIPs
promising with good detection performance. Chae et al. [13] developed a system for detecting the collision through image segmentation. It detects obstacles that the existing solutions cannot be detected i.e., walls, doors etc. The input frames are segmented into superpixels through the lattices algorithm.
For segmentation results, a Graph-based merging algorithm is used to detect both non textured and textured objects successfully.The system is tested on sighted/blindfolded people, dynamic/static objects. The experiments result exhibit that the proposed system efficiently detects the object in real-time. Salavati and Mahvash [14] have introduced a system for detecting obstacles through a camera based on the Deep Neural Network (DNN). The system comprises unsupervised deep neural networks for extracting global features of the image, whereas, for extracting local attributes, a supervised DNN is used. Nguyen et al. [15] have proposed an electro tactile device connected with the tongue that helps the user navigate. The device is supported by an antenna that enhances wireless communication. Xiao et al. [16] have introduced an outdoor assistive navigation system controlled by the Raspberry Pi and GPS sensor, guiding navigation. The device has a list of the recordings. The user selects any sound from the list to receive the navigation. The device also has a speech recognition system to enhance its capabilities. The system is tested in New York City College to evaluate its performance. The result shows that every direction is followed properly using the developed device. Bharambe et al. [17] have presented a device that uses an ultrasonic sensor for navigation and direction. This device is based on GSM and GPRS. GPS is used to receive the specific location of a particular person. The device is tested on blindfolded persons to evaluate their performance. Martinez-Sala et al. [18] have developed a sugar navigation device for VIPs. The designed device is tested on blindfolded persons. The system uses ultra-wide techniques that provide accurate information for the location. Its accuracy is about 95%. Aladren et al. [19] have introduced a system based on the RGB-D sensor that guides navigation.The RGB-D is used for obtaining the range and the color information. The individual wears the device to evaluate the performance. The results show that the algorithm is robust and gives approximately 95% accuracy. Yamashita et al. [20] have introduced a wearable device that helps VIPs navigate using RFID, Quasi-Zenith system, and Hololens of Microsoft. The Hololens contain multiple cameras and sensors which help in positioning. Four experiments have been conducted against each individual to assess the system's performance. So, in total, 16 versions of different scenarios were attained. Out of these scenarios, users could move on the right path without any collision in thirteen situations. Mancini et al. [21] have offered a visionbased device that guides VIP in jogging, walking, and running. The system consists of gloves, a camera, and a board equipped with motors. The system recognizes the accurate path with speeds around 10 km/h using gloves. Caraiman et al. [22] have proposed an accurate and robust approach for segmentation and reconstruction of the outdoor surroundings using IMU and SoV camera data. It helps in the mobility of a user in the outdoor environment. The system provided a 3D environment even though the data was noisy.
The system's performance is demonstrated through a large dataset recorded using a camera. The information is conveyed to the user through an audio and haptic interface. "Let Blind People See", a system proposed by Jiang et al. [23] to detect obstacles, and audio feedback is used to convey an alert to the users. The mobile camera has been used to record the video used in the input for you only look once (YOLO) [24] algorithm for detecting the obstacles in real-time. Moreover, the engine of a 3D game-based application sends and creates 3D locations and audio for objects. The device is tested on a normal user whose eyes are covered with paper to evaluate the system's performance. The user successfully identifies the objects. Yang et al. [25] presented a wearable device that helps detect the traversable area through RGD-Camera based on the random sample consensus (RANSAC) and surface normal vector estimation. The device gives the interface of non-semantic for transmitting the results to VIPs. It works in both indoor and outdoor surroundings. However, due to heavy computation, it's difficult for a device to work in a real-time environment and it detects only a short range of attainable areas. For assessing the performance of the system, it is tested on eight individuals and three of them are blind. The results display that the system is useful in an unknown environment. Zhao et al. [26] introduced a hybrid system utilizing CNN and features of an image to get first-person vision (FPV) navigation. The system takes a sequence of videos as input and gives output in a map showing the user's position. The process of this system is divided into two sub-processes, i.e., estimation of a location and calculation of a movement. An AlphaMex-based on CNN is developed to recognize scenes in real-time, and SIFT tracking algorithms are proposed to analyze the movement. The IMU sensor easily influences the noise, and the complete navigation system depends on the single camera instead of using GPS or any other sensors. Kunz et al. [27] have introduced the idea of Virtual Navigation to aid VIPs in facing problems in mysterious locations through a real walk while staying in a precise environment. For achieving this, a user needs a camera on the head and brings the laptop to the back. Zhang et al. [28] developed a navigation system for VIPs using ARCore and provided safety and efficiency. The dual-channel mechanism that interacts with humans is introduced to deliver continuous and accurate direction that ensures path safety from risky areas and obstacle detection. The information is conveyed to the user through an audio and haptic interface. For evaluation of the system, different experiments have been conducted. The results show that the proposed system is much more efficient than the already available system. It tested on one blind and three vision impaired subjects to assess the system's performance. After the test, a survey was conducted, and the result shows that this entire subject efficiently follows the path to reach its destination. Kubanek and Bobulski [29] have developed an electronic device that helps in the orientation of VIPs in the surrounding through voice signals. The device comprises a memory card, helmet, plug, cables, sensor, raspberry pi, headphones, and resistor for the Kinect sensor. The area in front of the user is scanned through the Kinect sensor, and also different algorithms are used for generating the acoustic space. The experiments have been performed on people without visual impairment to check the efficiency of the system. The results displayed that the performance of the presented system is high. Rehman et al. [30] have developed an Electronic Assistance system for VIPs that helps them navigate. It comprises the data transmitter device, sensors, micro-controllers, and smartphones. The system is associated with the smartphone. The information is collected through sensors from the environment; the obstacle is detected through a microcontroller after processing the collected information. Bluetooth module helps in communication between smartphones and micro-controller. The smartphone provides the instructions continuously to keep the user on track. The device is tested in a real-time environment for evaluating performance. The obstacle detection is performed at 100cm with 97.33% accuracy. Kannan et al. [31] have designed a robot that acts as path guidance for blind persons. It is an addition to the guide dog.
The robot can move on multiple paths. It can also locate those places that are not even traced by the GPS module. The proposed system comprised the map, camera, and image processing algorithms. The robot helps the users in tracking their location. The robot is operated through a joystick that converts axes into voltage, and the ultrasonic sensor is used to find the distance in the range of 2-3 cm. It is placed on the motor for detecting the obstacles. The IR sensor is used to help in measuring the depth. The response is conveyed to the user through vibration. It is helpful for both indoor and outdoor environments. The experiments are performed on real-time data that give accurate results. Rahman et al. [32] have proposed a wearable system that helps VIPs to move independently. The system helps detect an obstacle that is placed in front of the user. The system consists of a motion sensor, smartphone application, accelerometer, ultrasonic sensor, microcontroller, and the device for data transmission. The audible instruction is produced to assist the user in navigation. The data is transmitted through a microcontroller to the smartphone using Bluetooth. It also helps in updating the current location and keeps them on track. In case, if the user falls, it notifies the guardians. The performance is evaluated by testing it on the target users. The result shows an accuracy of 98.3% for the proposed system. Minhas et al. [33] have developed X-Eye, which facilitates detection and recognition of an object and helps trace a location. It is composed of a camera and a smartphone. Meshram et al. [34] have introduced NavCane, which helps the VIPs based on the communication module, ultrasonic sensor, and the motor for vibration. It helps in guiding the accurate path and also in the detection of an object. It works only in indoor surroundings. To evaluate the system's performance, 80 users participated in the experiments. Out of the total number, 30 participants were blind, 30 had lowvision, and 20 were older adults. The result shows that the proposed system helps in navigation. Berge et al. [35] have introduced a device that helps VIPs and other disabled people through an Android-based system. It is helpful not only for detecting obstacles but also provides navigation.
The system was tested on 15 users.10 of these were experiencing disabilities of visual impairment, and 5 of them were blind. All of these participants tested the system in indoor/outdoor surroundings. The results of the proposed system are very promising as it successfully helps in navigation. Kammoun et al. [36] have invented a system called NAVIG to aid VIPs. It contains microphones, cameras, headphones, GPS, sensors, and a computer. The device uses a stereoscopic camera located at the top side to capture the images from the environment. Machine learning techniques are used for object detection. The NAVIG works both in indoor and outdoor surroundings. The results demonstrated that the system is working correctly. Jung et al. [37] introduced a wearable system that assists VIPs using artificial intelligence (AI) techniques. The system guides VIPs in zebra crossing and marked crosswalks. The system comprises a waist-mounted device, a walking cane, and sunglasses. A deep learning method is used to recognize zebra crossing in real-time. When VIPs reach a zebra crossing, they will instantly receive a message whether to move or stop. The system achieved accuracy up to 90%. Jung et al. [38] developed an intelligent obstacles detection system for VIPs. The system consists of a walking stick, smart glasses, a cloud-dependent platform, and a mobile application. VIPs wear smart glasses and hold the walking stick to identify obstacles. The developed stick starts vibrations for the guidance of VIPs. Furthermore, a notification is sent to their registered family member in case of any accidental event. The authors claimed that the system is 98.3% accurate. Rahman and Sheikh [39] developed a system for VIPs that helped them identify outdoor and indoor objects through audio messages. The system is based on four sensors that help detect the object from all directions. The overall accuracy of the system is approximately 99%. In previous years, numerous studies have been performed to address the challenges, i.e., understanding the environment, increasing perception, the nature and position of an obstacle on the travel path, and facilitating VIPs in outdoor and indoor environments. Mostly, the developed technologies include functions for selecting routes and detecting an obstacle [41]. But still, there is a need to design and develop intelligent systems that help in object detection and recognition and facilitate securing the mobility of VIPs.

III. The Proposed Framework
The proposed system comprises a raspberry pi digital signal processing (DSP) board with GSM and a global positing system (GPS) module, headphones, and a camera. DSP captures a live feed from the video camera and passes it to the object detection and recognition module (CNN model).
The model predicts objects in the current video frame and passes their names to the text to speech converter (SAPI) module, pronouncing their names using headphones. Moreover, the labeled snapshot is encoded using the Joint Photographic Experts Group (JPEG) encoder. The encoded image with the user's precise location is saved in the server's database. Furthermore, a user-driven feature is provided to VIPs that enables family members to track their movement (get current labeled frame and location) while relaxing at their home through a web interface. The user can enable/disable this feature through a single button press. Figure 2 provides the system architecture of the proposed solution that comprises the following main modules:

A. DSPBoard (Pi 3 Model B)
The Raspberry Pi foundation has developed a number of small on board computers to handle limited control functions.
A system on a chip (SoC) along with integrated Acom RISC machine (ARM) and a robust central processing unit ICPU) are installed on Raspberry Pi. We have used raspberry Pi 3 Model for the development of the proposed system [42]. The size of the device is similar to the standard credit card and it is the third generation of the series. Its primary specifications contain ARM Cortex A53 (Broadcom BCM2387) quad core processor with 64 bit CPU at 1.2 GHz speed and 1 GB of RAM. There are 4 poles stereo output and CSI camera port along with composite video port that is used to connect a full HDMI camera that is accessible to wireless LAN (802.11 b/g/n) and Bluetooth 4.1 chipset.

B. Software based Object detection and recognition module (SODRM)
The system employs Mobile-Net architecture [41] for object detection and recognition because it is light weighted and provides good accuracy that makes it the best choice for computationally low devices/DSPs. A convolutional neural network (CNN) is trained for object detection in a python programming language using tensor-flow API.

C. Mobile-Net Architecture
Mobile-Net [41] is based on streamlined architecture to build a lightweight deep neural network. It consists of 21 layers. It uses depth-wise separable convolution layers that can be either simple or deep convolutional layers. Its architecture is defined in Table 1. Mobile Net architecture consists of 21 layers containing both simple and deep convolutional layers. The architecture of simple convolutional layers is shown in Figure 3. Simple convolution layers consist of three steps. At each layer, convolution is applied using the 3×3 kernel followed by the batch normalization (BN) and rectified linear unit (ReLU) activation function. On the other hand, the deep layers consist of six stages. The first three steps of deep and simple layers are the same. These steps are followed by the 1×1 convolutional operation, batch normalization (BN), and rectified linear unit (ReLU) activation function. The architecture of deep convolutional layers is shown in figure 4.

D. Structural Similarity (SSIM) Index
As in the video signal, there is a strong correlation among the frames belonging to the same scene. In this case, detecting, recognizing, and pronouncing the same objects are not useful. It is better to check the similarity between two successive frames in the displaying order. If the similarity index is less than a certain threshold T, process the current frame. Otherwise, skip this frame (i.e., there is no need to process it). This skipping saves computational power and helps to develop a real-time system. The SSIM index [42] is used to measure the similarity between two successive frames in this work. It is used to decide whether the current frame is needed or not. The following expression measures the SSIM index between frames I and J of dimension M×N.
 σ 2 andσ 2 indicate the variance of the frame I and J, respectively.
 σ , is the covariance of the frame I and J.
After detailed simulations on various video streams, the SSIM index threshold T is adjusted to 0.7. If SSIM > T, skip the frame; otherwise, process the frame.

E. Text to Speech converter (SAPI)
In this work, the Speech API (SAPI 5.3) [43] is used to generate audio based on the input text. To pronounce the detected objects in the current frame, SAPI takes the textual input (consisting of objects names present in the current frame) from CNN and generates an audio signal. The output of SAPI is sent to headphones through which VIP scans hear the names of the detected objects.

F. Encoder
The CNN provides labeled snapshots in which bounding boxes are drawn around the objects and labels/names are assigned. These images are encoded using the Joint Photo Experts Group (JPEG) encoder to store them in a database and access them through the web interface.

G. GPS and GSM module
The Pi Anywhere 4G and LTE Hat for the Raspberry Pi Beta are used in the proposed system [44]. The Raspberry Pi minicomputer provides 4G mobile data, GPS positioning information, and battery support. It is a simple plug and plays module. It provides ultra-fast 4G (100 Mbps down/ 50 Mbps up) Internet connectivity for video streaming and downloads. Moreover, it provides easy access to GPS onboard, exposing location data. The prime purpose of the module is to provide connectivity and the location of the user for tracking. Figure  5 shows the GPS and GSM module chips.

H. Web-based Application
The motive behind the development of the web interface is to ensure the safety of VIPs. This interface facilitates the family members of the VIPs to track their movement (get location and snapshots) while sitting at home. The device allows VIPs to visualize the environment and ensure their safety. The web server is running on DSP. It periodically (in real-time) sends the person's position (GPS coordinates) and labeled snapshot to the web interface. The web interface maps the received coordinates on the map using Google-map API that provides the user's real-time location. Moreover, it displays the most recent received labeled image and stores all the received images in the gallery. The gallery is helpful to review the visited locality.
In this work, Django [45] is used to develop a web-based application. To monitor visually impaired persons, a user needs to login on to a web page via a web browser using the given ID. After signing in, the user is redirected to the dashboard, which contains the most recent snapshot received from the web-server and a real-time location of the VIPs. The web interface will receive the live video feed from the camera and the person's coordinates in real-time. The received coordinates would be mapped on the map using Google map API that provides the real-time location of the blind person.

IV. Technical evaluation
This section presents scenario-driven evaluations of the prototype to test the efficacy of the proposed prototype. The idea is that blind people are concerned about objects present in their close vicinity. The system offers the presence of an object and conveys the type of an object through voice, and periodically sends the person's location to the webserver. For object detection and recognition, Mobile-net is used because it is fast and requires less computational power. The model is trained on the ImageNet dataset that is large-scale object detection, segmentation, and a captioning dataset comprising 2.1 million images in 1000 categories [46]. The sample objects of the ImageNet dataset are shown in figure 6. The training of Mobile-Net is performed on Ubuntu 16.04 LTS platform with Xeon E3-1231 v3, 4 Cores @ 3.40GHz, GTX1060 [47]. It achieves an accuracy of 83.3%. Table 2 illustrates the comparison of MobileNet architecture with other models in terms of accuracy, number of model parameters, error rate, and complexity in terms of Mega Floating-point Operations Per Second (MFLOPs) [48]. It is reflected that MobileNet architecture is 32 times smaller in terms of hyper-parameters and 27 times less computationally intensive than VGG [41]. However, the accuracy of shufflenet [48] is higher than mobile-net. The proposed system is flexible enough to accommodate any model, including shuffle-net. In the future implementation, shuffle-net may be incorporated instead of mobile-net.

FIGURE 8. Practical analysis of prototype for a live video stream
The average loss graph of the trained model is shown in figure 7. It is evident from the figure that after 800k iterations, loss becomes the minimum which is about 1.39. The prototype works as follows: it takes input from the camera, performs object detection and recognition, and pronounces the name of objects that can be seen through the camera eye. The notification is sent to the user through headphones. The system can perform multiple object detection and recognition in a snapshot. In the case of multiple objects in one snapshot, the objects are sorted in descending order according to their confidence level/probability of prediction. The object with a higher confidence level is pronounced first and subsequently the rest of the objects. In the case of object detection (that is predicted correctly or incorrectly), the information related to the close match is conveyed to the user.
On the other hand, if the object is unknown, it will not be detected, and the user will get feedback that there is an unknown object and the system is unable to identify it. The object detection and recognition feature of the proposed system facilitates the user to search objects in the street, home, or store. It may also help the VIPs visualize the environment in which they are roaming. The second main feature of the prototype is to provide secure mobility to the user through an application created to monitor VIPs. Through this application, the family member can see the location of the user and the snapshot/live feed. It is an ondemand/optional feature to ensure the user's privacy. It is up to the user whether to turn it on or off. They may turn on this option only if they are lost. After signing in the application, the observer is directed to the dashboard, which contains the most recent snapshots from the user and shows the user's real-time location on Google map.
The prototype is tested on the most common object of dayto-day life in different environments to evaluate the performance of the proposed system. In the first scenario, multiple objects from the live video stream are detected and recognized with great confidence, like a human, cat, horse, laptop, cup, bottle, etc. It also demonstrates the live detection of objects and shows the capability of deep learning algorithms to localize and detect different objects. It also shows that the prototype can detect multiple objects in a single snapshot.
In the second scenario, the prototype is tested for offline images. The results are shown in Figure 9. Different images have been uploaded, and the system can easily recognize them accurately. The system can detect single or multiple objects. In frame 2, the prototype can detect multiple objects, such as persons with a maximum confidence level of 98% and bicycles with a 91% confidence level, even in the offline state.
In the third scenario, the prototype is tested for offline videos. The selection of videos is based on the principle that they should carry the potential disturbing phenomenon for object detection, e.g., bright scenes, dark scenes, etc. Also, the scenes should contain both indoor and outdoor environments. This is a pre-requisite condition to validate the flexibility and strength of the system to function properly in any situation. Figure 10 shows the results for the different scenes of the test video. It is noticed that the system also works in dim lights. In the 1st frame, the person is moving outside. The prototype detects it with great confidence. In the 2nd frame, the system detects the person even in the dark. In the 3rd frame, the system easily recognizes the person with a 94% confidence level. It is also observed that the prototype functions properly for both (indoor, outdoor) environments.

FIGURE 9. Practical analysis of prototype for offline images
In the fourth scenario, the prototype is tested for household objects. These are the most common objects of day-to-day life, and the system can easily detect and recognize them, as is displayed in Figure 11. In 1st frame, the system correctly detects the table and conveys the information to the user. Similarly, in the 4th frame, the system detects multiple objects in a single frame, i.e., the bed, the person sitting on the bed, and a vase.
In the fifth scenario, the prototype is tested in an outdoor environment. Figure 12 shows the experimental results for the outdoor environment. In the 1st frame, the system detects the car, but it cannot detect multiple smallsized objects. In frame 4, there are two buses, and the system recognized one but was unable to recognize the other. Similarly, in other frames, the system provides accurate results. In the sixth scenario, the prototype is tested in the outdoor environment to evaluate the second major feature, i.e., secure mobility. When this option is turned on, it shows the user's current location to the observer. The complete trajectory of the user is drawn on a Google map and can be monitored by the observer. In addition, the application periodically receives snapshots from the user to get details of the current environment. The working of this feature is shown in Figure 13.

V. Technical Evaluation of Real-time Scenarios
The proposed system has been tested in real-time with four users (1 blind, 1 with covered eyes, two normal persons) in the age range of 20-25, as shown in figure 14.
The users used the prototype as displayed in figure 15. These users are instructed to move on multiple trajectories one by one and collect feedback regarding object detection and voice quality. In these trajectories, several objects were placed, e.g., dust bin, bottle, table, etc. The results show that the system has correctly detected most objects along the path. The users could also listen to the object name in the current frame using the text to speech converter. The camera used to evaluate a system was able to detect obstacles within the range of 6m. The result establishes that the system accurately detected obstacles and measured the distance of the hurdles. The system produces a fast response and allows a person to The objects comprise static (e.g., bed, table, and chair) and moving objects (e.g., bicycle, car, and pedestrian). A visually impaired person faces many problems in unfamiliar surroundings. An application has been created through Android studio integrated with Google Map API. The precise location and snapshot of the user were transmitted from the application to the web server so that caregivers and family members could track them to ensure the security of VIPs. Several experiments are performed with moving user indoor and outdoor environments to evaluate the module. The application displayed the longitude and latitude of the location on the map, as shown in figure  16. The pilot testing of the system was successful as most of the members were able to navigate correctly.
The system also has a few limitations. During the experiment in the outdoor environment, it is noted that participants face problems in detecting road curbs, changes in the surface of the road, and detecting staircases. For indoor environment, participants faced difficulties in detecting soft fabric such as soft curtains, cushions, etc.

VI. Comparative Analysis
This section presents the quantitative and comparative analysis for the proposed system with existing assistive devices. The quantitative analysis is based on the listed features in Table 3. There are a few essential attributes used by researchers in previous studies [1]. The assistive systems for VIPs need to contain the features such as concise and clear info within time, reliable performance throughout the day and night, proper working both in indoors and outdoors environments, analysis in real-time, and a high accuracy rate. Otherwise, the manufacturer may not compete and survive in the market. The assessed features are essential for designing an assistive device/system for VIPs. For evaluation, each feature is assigned a weightage, i.e., if the developed assistive device contains all the mentioned features, its weightage is 10. Let us suppose that some devices work only in the daytime and vice versa, or some devices only work in indoor atmospheres and vice versa, then their score will be 5.
The assistive devices developed so far for the VIPs provide multiple features such as efficient performance during night or day times, working both in the outdoor and indoor environment, detecting the object, and identifying the dynamic or static objects. A few of the devices assist in navigation and cover many other features. The evaluated features are the basic features that assist in designing assistive devices. The highest weightage is 10, achieved by devices having all the basic features. The score for each feature in every system is highlighted depending on the collected information. Some of the features are still in the research stage. The user's response is also considered to evaluate only those devices tested in real-time scenarios. The value of a device is considered as Vk. The range is between 0-10, where the '10' value is given to the device that contains all features. The feature value of 5 is given to the system that will perform in only indoor conditions, or the device works only in the outdoor environment, e.g., Smart Cane and other devices. The same criteria are being applied to the range of the device feature. The '10' value is assigned if it covers a large range. The value 2.5 would be assigned to those devices whose detection range is low i.e., less than or equal to 1m. An equation is defined to compute the total score of every system based on the features in Table  2. It is as follows, where N represents the total number of features and k describes a particular feature. Table 4. gives a quantitative analysis of the assistive devices. It is evident from Table 4 that the proposed system outperformed others in terms of the total score. It is due to the enriched features supported by the proposed system. It is important to note that none of the devices is perfect, and they need improvements in their design to fulfill the requirements of the VIPs. Figure 17 presents a complete picture of the evaluation for each system with an individual score. Systems with higher scores demonstrate robust and enhanced performance. The device functions at daytime/night or both.

Analysis Type
The response of the device in real-time or not.

Object Type
The device detects the static or dynamic objects

Security
The device helps in securing the person through monitoring.

Range
The device is either short range or long range.

VII. Conclusion
This paper presented a smart and intelligent system for VIPs to assist them in mobility and ensure their safety. The proposed system is based on the day-to-day requirements of VIPs. It assists them in visualizing the environment and providing a sense of the surroundings. They can recognize objects around them and sense the natural environment using CNN-based low-power Mobile-Net architecture. Moreover, a web-based application is developed to ensure the safety of VIPs. The user of this application can turn the on-demand function to share his/her location with the family. It is beneficial to their family as they can monitor the movement of VIPs and can track his/her location using the live feed from the camera. The experimental analysis shows that the proposed system provided satisfactory results and outperformed other devices in terms of supported features. The task is very challenging but using state-of-the-art techniques, the suitability of a device can be computed. In the future, additional criteria for selecting features, such as aesthetic sense and outlook of the device, may also be considered to evaluate its performance. Moreover, the bone conduction headphone may also be incorporated to improve feedback. Finally, after launching the device, surveys and user comments will also be gathered to enhance the performance of a device.