Visualization of Salient Object With Saliency Maps Using Residual Neural Networks

Visual saliency techniques based on Convolutional Neural Networks (CNNs) exhibit an excessive performance for saliency fixation in a scene, but it is harder to train a network in view of their complexity. The imparting Residual Network Model (ResNet) that is more capable to optimize features for predicting salient area in the form of saliency maps within the images. To get saliency maps, an amalgamated framework is presented that contains two streams of Residual Network Model (ResNet-50). Each stream of Reset-50 that is used to enhance the low-level and high-level semantics features and build a network of 99 layers at two different image scales for generating the normal saliency attention. This model is trained with transfer learning for initialization that is pretrained on ImageNet for object detection, and with some modifications to minimize prediction error. At the end, the two streams integrate the features by fusion at low and high scale dimensions of images. This model is fine-tuned on four commonly used datasets and examines both qualitative and quantitative evaluation metrics for state-of-the-art deep saliency model outcomes.


I. INTRODUCTION
The field of computer vision has taken a sensational curve, with the ascent of the Convolutional Neural Networks (CNNs), which is one of the most impressive forms of Artificial Neural Network (ANN) architecture. Therefore, visualization of a salient object in an image using Convolutional Neural Network (CNN) models is the most focused area and lies under the umbrella of supervised machine learning algorithms [5]. Typically, Convolutional Neural Networks (CNNs) have learned hierarchically and extract highly discriminative information for classification from raw images [16]. In computer vision, visual saliency detection is one of the main challenges and CNNs are the most powerful techniques that are used widely for different layer integration to make saliency maps [38]. Saliency map processing has raised an awesome measure of research intrigue and has been appearing to be beneficial in numerous applications [2]. Recently, saliency maps as depicted in Fig. 1 can be an excessive benefit for 2D image applications, object classification, The associate editor coordinating the review of this manuscript and approving it for publication was Peng Liu . action classification, video applications, video analysis, and quality assessment [2], [34].
In a common paradigm, fixation means: a common popout slim blob-like extraordinary salient area, a salient item detection frequently creates clean associated area [1]. System oriented fixation prediction models require more effort to make a saliency map of the salient object within the image compared to human. However, human visual network has the ability to find eye fixation to a useful substances and perform this task naturally and rapidly in the real world while seeing visual images [12]. Hence, researchers are typically aimed at understanding and predicting visual saliency that simulates just like the human visual process based on pinpointing the most prominent object within a scene effortlessly [1].
The advantage of this paper is to extract the most salient informative objects with their respective semantic regions in an image for understanding the whole scene. This extraction simulates the functionality just like the biological visual consideration systems [43]. Therefore, the main motivation is to provide new insights about human biological attentional processes and give new ways for: understanding visual attention, complex scene understanding, detecting salient objects in a low clutter context, making new artificial intelligence applications and these applications can be based on image or video saliency detection mechanisms [11], [43]. Commonly, visual saliency models use a multiscale configuration for improving accuracy, which integrates the information at low and high image scales [11]. This improves the saliency detection performance of our model, which finds out the tiny salient regions and the center of large salient regions in high and low scales, respectively [11].
In addition, there are various prediction models that make saliency maps based on: the probability distribution of the position of the eye fixation on the image [11], low-level features such as multiscale contrast, color spatial distribution to describe a salient object locally and regionally, high-level features such as ''objectness information'' [2]. Since then, models of saliency have emerged to fixate the most prominent regions by snubbing the less significant part, but still there are many opportunities to get better due to its complexity, having many different object types, having large dissimilarity of multiple objects in a scene [12], variations exist in images due to different viewpoints (camera viewpoint) illuminations, different object pose, partial occlusions and unrelated background as shown in Fig. 2. Although, convolutional neural networks (CNNs) have a sequence of breakthroughs to reformulate the layers as gaining knowledge for image detection but difficult to train due to its complexity. Normally, it takes so much time to train a desired model in CNN, so the saliency systems may have limited power while using CNN when known and obvious objects are not present within the image [2]. For the solution of this problem, the Residual Network Model (ResNet) [9], one of the deep CNN models, is used to carry strong semantic features within the image. In addition to this, a feature significantly describes the particular attribute of the object, some commonly used features are size, color, and shape. The primary objective is to process a saliency outline geographically to the level of saliency for visual consideration. Thus, we suggest a two-modality framework to get conceptual components from crude image pixels progressively, which has richer prior information for a better saliency prediction as this model has learned and how to identify images from ImageNet [33] dataset. It is a case of transfer learning where features are learned on one job and reused for another with or without fine-tuning. The transfer learning paradigm is considered important typically for smaller saliency datasets [21]. For image identification, Residual Network Model (ResNet) [9] used one stream with a short cut between its two blocks of layer, which reduced the computational complexity and then summed up the results at the end. However, we used two ResNet-50 [9] streams running parallel at two different image dimensions, at the end we produce results in the form of grey scale visual maps from one combined deep Residual Network [9] model up to 99 layers.
Overall Contribution: our proposed framework that addresses the challenges.
• Explore several CNNs models that integrate the feature maps after fusion but design a two-stream framework that utilizes ResNet-50 [9], which is efficient for getting global visual contrast information.
• Investigate the effect of ResNet-50 [9] on different image dimensions. The key features are to use input data diversity and high image dimensions for getting better saliency. The robustness of the saliency framework can be enhanced by using these key features.
• Four challenging datasets are used for the analysis and evaluation of our saliency model.
• Extensive analysis and fair comparison with state-ofthe-art saliency prediction models with respect to qualitative and quantitative results.
The rest of this paper is organized as follows: Section II mentions some of the related work. Section III describes the VOLUME 9, 2021 design of our visual saliency model. Section IV mentions the details of model training and proceeds with the investigation of our model evaluation. Section V discusses the final results. Finally, we end up with conclusions in Section VI.

II. RELATED WORK
The most obligate goal is to discuss the most recent research strategies about CNNs saliency models that foresee the likelihood circulation area of the eye prediction over the image. Saliency maps have different intensity for each pixel and each pixel has its place on the most salient object. In spite of the fact that these strategies accomplish better execution than conventional models depending on visual saliency. In [2], Jia et al. made an improved saliency method with multiple layers of CNN to study visual elements named as EML-NET that acquired encouraging results after merging the comfortable prior information, which discovered the results of convolution by means of CNN model on the comprehensive saliency dataset. It can be utilized further to expand the scalability performance that turns into more thought-provoking for getting features from several layers. In [3], the researchers have proposed a framework and built on two equally trained CNN models, one trained model was generated for top-down visual saliency, and the other trained model was exploited for classification. In addition, the authors collected the eye look map dataset by means of Tobbi T60 visual tracker and evaluated the performance in two forms: visual map and enhanced classification accuracy. Furthermore, a comparison has been shown between Inception VGG-19 and SalClassNet classifiers.
In [1], Feng et al. computed a comprehensive spontaneous CNN architecture that captured the global and local contrast features information based on different scales, which could successfully spot the salient region within the images. Moreover, comparative results with ten state-of-the-art architectures have been exposed. In [4], the authors made a design: to extract multifarious semantic features, to study end-toend pixel-wise visual saliency at different scales while considering only the global perspective by utilizing link layers through large receptive fields. In addition, key factors were included: massive deepness, dissimilar size, kernels working in parallel to pinpoint the saliency, greater receptive fields for global context, center bias for pattern outline identification reliant on location. The proposed network in [8] contained two end-to-end CNN fixation streams, one stream was pretrained on human visual guesstimate on eye tracking data and the other was pretrained on an image identification dataset named as semantic stream which was figured out semantic signals from the input images. Furthermore, these two CNN streams merged to form a module like inception block with convolution and deconvolution layers to notice a complex prominent element. The authors in [13] presented a deep CNN based attentional push approach for saliency prediction. This model contained two pathways: a saliency way fed by the whole image to implant fixation method for computing the augmented maps, a push way fed by 2-D cropped actor head image to guess the gaze scene actors. Followed by a trivial convent that merged and generated the saliency. In [15], a ConvLSTM model built on LSTM that iteratively fixated different locations in the images to refine feature prediction and learnt the earlier saliency maps made by Gaussianfunction. In addition to this, ConvLSTM learnt center bias without mixing the prediction features manually.
In [17], Cao et al. establish a simple and end-to-end CNN network that identifies input features with fewer parameters for the production of visual saliency maps. Moreover, the authors performed widespread experiments for the selection of high quality features at low layer, middle layer, and high layer. The major motivation of this design was to select input features that enabled the network to improve the results and showed major similarities with the contrast evidence which was presented in ground truth masks. In [23], Ji et al. suggest a new encoder decoder CNN framework by acquainted multidimensions spatial-wise and channel-wise devoted layers. These attention layers united the perspective information related to features at varying scales and then finally produced the saliency. In addition to this, the structure was designed to get visual saliency maps with accurate side way edge information. This structure showed effective results on various datasets. In [25], Monroy et al. extend CNN architecture by transfer learning to predict the 2D and omnidirectional images (ODIs) saliency in an accurate manner. In this pipeline, the generated visual maps were very close to the ground truth.
However, in [34], the authors suggested a novel way to extend the 2D prediction method by applying on cube face images for 360degree images. This method used CNN based fusion approach which has been trained on CMP images and a new loss function. In [30], the researchers introduced two new datasets: one was odd one-out (O 3 ) images, the second was a psychophysical pattern (P 3 ). These two datasets were used to evaluate the capacity of visual saliency algorithms for finding single target. Furthermore, the effect of an architecture based on CNN saliency model training was investigated on these forms of datasets and did not find an odd oneout target ability for major improvement. A pyramid feature attention network (PFAN) was proposed in [38] that enhanced the contextual and spatial features by using a novel CNN. It contained four modules named as: a context-aware pyramid feature extraction (CPFE), channel-wise attention (CA), spatial attention (SA), and edge preservation (EP). CPFE was designed to acquire rich context features at the multiscale level, but other modules were utilized to generate feature maps and then applied fusion for the saliency regions.
Moreover, in [28], the authors presented a novel deep spatial contextual computational saliency method named as deep spatial contextual long-term recurrent convolutional network (DSCLRCN) that inevitably learns local features from the input images in parallel. Then, it acquired long-term spatial interactions between global context and the overall scene context to conclude the saliency maps. In [29], the authors proposed a multiresolution convolutional neural network (Mr-CNN), which was a predicted eye fixation computational framework that learned two types of features simultaneously from input images. It was trained on fixation and non-fixation locations with multiresolution and utilized input images as raw pixels. At the end, Top-down and bottom-up features were integrated in the last layer to predict visual saliency.
Gaze II in [21] used the initial deep features from the VGG-19 model, which was used for the image identification model. It trained some readout layers on top of the VGG-19 for saliency prediction. A strong test was performed after conservative cross validation, which achieved 87% top performance in area under the curve metrics. In [40], the authors introduced a two-level hierarchy by embedding deep CNNs named as hierarchical deep CNNs (HD-CNN). This model separated easy classes from difficult classes by using coarse and fine category classifiers. In training, coarse category classifiers are used multinomial logistic loss with global fine-tuning. In addition to this, the fine category classifiers and layer parameters built HD-CNNs are more scalable for visual prediction. In [7], the researchers provided a hard analysis of noisy saliency maps and presented a novel hypothesis about irrelevant features which passed through an activation function ''ReLu''. Then, we proposed a method during back propagation through layer-wise thresholding. A comparison summary of CNNs models used for visual saliency and quantitative comparison between evaluations metrics of different deep learning saliency models on the challenging MIT300 dataset are shown in Table. 1 and Table. 2 respectively.

III. DESIGN OF VISUAL SALIENCY MODEL
In this section, we will discuss the important factors of Residual Neural Network and the details of our two-stream visual saliency model architecture.

A. RESIDUAL NEURAL NETWORK
First, we introduce the basic structure of the Residual Network Model (ResNet) [9], that is, an excellent feed-forward deep intensely arrangement of interconnected convolutional layer blocks which has the power to learn and highlight features from input data. Then describe the proposed two-stream framework in Section 3.2. Most of the CNNs designs such as AlexNet [18], VGG-16 [35], and GoogLeNet [36] are comparatively 'Shallow' for generating saliency map, but the Residual Network builds its deep architecture based on the popular CNN model for saliency presentation [2]. The Residual Network Model (ResNet-50) [9] is the varietal way and the deepest ever presented for classification in vision. It won the 1st place on the tasks of ImageNet [33] detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competition [9]. A deeper network has demonstrated the degradation problem by loss of information which affects the training accuracy [2], [9]. Therefore, for faster training and to construct a really deep network, two approaches are initially introduced in ResNet [9]: one is a stack building block of similar connecting shapes and the other is a new skip connection approach [9], [10].
These building blocks are known as ''Residual Units'' in [10] that optimiz Residual Network Model (ResNet) [9] than plane deep learning models. Due to feed forward network identity mapping in the form of short-cut connection which skips some layers and adds results to the tiled layer output. Skip connection is an information compensation strategy which intuitively collects prior layer information with equal scale that compensates current layer features [2].

B. TWO STREAM VISUAL SALIENCY NETWORK
Inspired by ''Salicon'' presented in [11], which was a pioneer effort to train a model using DNNs for ''Visual Saliency Prediction''. The main concentration is how to make visual saliency prediction for useable applications. The architecture can be useful to calculate the saliency prediction within the images based on ResNet-50 [9], which is pretrained on Ima-geNet [33] for object classification. Therefore, a trainable end-to-end two-stream ResNet-50 [9] framework is proposed to address the fixation map problem and permits to learn the parameters for back propagation of the pretrained ResNet [9] for optimizing a saliency. Several ways are there to accomplish the integration of two stream data going from initial fusion to the later one [39].
However, to achieve this, we design a very simple twostream architecture with 49 convolutional layers in each stream that will be fused at the end for capturing extracted features to reduce the semantic gap for saliency maps.
Consequently, we have one ResNet-50 [9] to generate R H streams and another ResNet-50 [9] to generate R L streams. The detailed architecture is displayed in Fig. 3. These two streams are fed by two input images with three dimensions ''1000 × 800 × 3'' and ''500 × 400 × 3''. The first two measurements record the spatial area of the responsive field of the neuron, and the third one lists the layouts for which the neuron is tuned [11]. The neurons are tuned to detect the same patterns because of these two streams that share the same filters but at a different scale. This model contains 99 ''Convolutional'' layers in total, two ''Max Pooling'' layers, and one ''Concatenate'' layer. Firstly, Reset-50 is employed to get the initial features by initializing the first 30 layers from the pretrained ResNet-50 [9] on ImageNet [33] dataset. Then, we modify some parameters in it to record the saliency measurements. The proposed system explains the parameters of the model architecture as shown in Fig. 4. One ''Max Pooling'' layer is used after the first convolutional layer with pad = '0' and stride = '3'. When RGB images are resized as high (1000 × 800 × 3) and low (500 × 400 × 3) scales, respectively, we indicate the neural reactions of these two streams after the second last convolutional layer of ''Conv5'' block with dimensions of ''35 × 48''and ''17 × 24'', but both streams are taken third dimension as 2048 at this level.
Note that R L has half the spatial resolution of R H at the second last convolutional layer of ''Conv5'' block. Next, the output of low-scale residual network is resized by upsampling to ''35 × 48'' with a linear interpolation to match the same spatial resolution of high-scale residual network. Combine the responses of two-scale residual networks for creating the maps of saliency. The last ''Max Pooling'' layer is used with stride = '1' and pad = '0' to denote the global features. Then, we introduce the last convolutional layer to learn the global visual contrast information. This convolutional layer is used as a single filter with ''0'' padding and stride = ''0'', that identifies whether the reactions in the last layer relate to the salient region in the form of accurate saliency maps. This layer generates the resolution of ''37 × 52''. At the end, we resized the ground truth maps to meet the size of our network output.

IV. EXPERIMENTAL DETAILS
The extensive experiments demonstrated that typically all saliency algorithms did not show adequate singleton target in natural images. Therefore, our framework can be simply extended to have a variety of previous knowledge for visual saliency detection. All investigations are conducted on four commonly used datasets, containing ECSSD [31], HKU-IS [20], PASCAL-S [37], and DUT-OMRON [14].

A. MODEL TRAINING
We implement the proposed model in PyCaffe by using ResNet-50 [9], pre-trained on ImageNet [33] as a basic model to extract early features. The most common four datasets are employed for further training on the high and low scale dimensions of the input images. In training, we fine-tune by using training images to determine the learning weights with a momentum of 0.9 and a weight decay of 0.0005 on four different datasets separately until the training loss converges. Training has been running for 80 epochs with real ground truth fixation masks for fine-tuning. Fine-tuning of ResNet-50 [9] model for visual saliency up to 80 epochs is shown in Fig. 5. The learning rate of the first 30 convolutional layers is set to 0, but the learning rate of rest of the convolutional layers is set to 0.0001. In addition, network parameters are optimized using ''Adam optimizer'' with a batch size of 16. The visual saliency detection can be considered as a binary prediction problem; thus we utilize binary cross entropy as the loss function. We prepared the system in PUCIT, a NVIDIA Titan GPU with 12GB memory, and it took different time spans for the four datasets upon the system utilized.

B. DATASETS
As more models have been proposed in the writing, more datasets have been acquainted with further saliency discovery models, but the reality is that more datasets are required in the literature. There are some widely used datasets which play an important role for the most prominent object visualization. Different benchmarks used various datasets for assessing remarkable visual saliency for salient objects and Punjab University College of Information Technology, Lahore, Pakistan.
for performance evaluation of saliency generation models. In this work, we evaluate the proposed visual saliency model by using the most persuasive datasets, including ECSSD [31], HKU-IS [20], PASCAL-S [37], DUT-OMRON [14] that are commonly used in many earlier works in the field of remarkable saliency fixation.

PASCAL-S [37]
dataset contains 850 validation sets of natural images with ground truth of full segmentation from PASCAL VOC 2010 dataset, which has 8 free-viewing viewers for exploring the images. For each input image, the individual was asked to identify a salient object by clicking with no time limit and there were also no constraints on the number of objects one can choose.
HKU-IS [20] dataset contains 4447 complex images which consists of many disconnected salient objects having diverse spatial locations. This dataset is thought-provoking for similar background and foreground looks.
ECSSD [31] dataset contains the most challenging 1000 images with diversified patterns in both foreground and background. It is a structurally complex new scene dataset, which contains challenging natural images for saliency detection and corresponding ground truth masks. Five helpers produced the ground truth mask.
DUT-OMRON [14] dataset contains complex 5168 images with pixel wise ground truth masks of salient objects. It is a diverse dataset which consists of sample images of side length 400 pixels.

C. EVALUATION METRICS
In this section, we discuss three criteria which are used for performance evaluation of our proposed model, i.e., Maximum F-measure (MaxF β ), mean absolute error (MAE), precision, and recall curve (PR).  PR-curve is used to measure the estimated saliency map with the threshold ranging from 0 to 255. The visual saliency map can be changed into a binary map. Then, its precision and recall can be obtained by comparing the machine-generated saliency map with the ground truth masks. By doing these comparisons at each threshold value produces P-R curves for the four mentioned datasets.
F-measure is used to measure the harmonic mean of average precision and recall. It is based on pixel-wise error and can evaluate the overall performance [12].
Mean Absolute Error is used to represent the average absolute difference of the estimated saliency map and the ground truth saliency map. It often snubs structural similarities [1].
To validate the efficiency of our model, we perform several experiments, in which we find that our rich hierarchical model explore the representative potential at pixel and semantic level for learning visual saliency strategies which can be utilized for recovering local details. The visual saliency finding results from the above experiments show that visual saliency maps generated by benefiting saliency optimization process with better quality [45]. Commonly, the accuracy and superior performance of the model have been improved after multilevel feature fusion between low and high scales and performed analysis on ECSSD [31], HKU-IS [20], PASCAL-S [37] and DUT-OMRON [14] datasets. DUT-OMRON [14] is the largest and more challenging dataset among the four dataset, which has a difficult scenario due to the large number of complex scenes to identify the best performance of a model [8], [14]. PR curve demonstrates the clear and comparative small range distribution of precision and recall points by using a binary cross entropy loss function [43]. As a result, the produced saliency maps are sensitive to binary thresholds which produce smooth PR curves. It demonstrates the higher performance and better PR curves especially on ECSSD dataset. However, SMD [41] drops faster compared to the nearest ELD [19] and MSI-CNN [26] methods on all used datasets. According to our observation, multi-scale fusion strategy plays an important role on a model performance, and the quantitative results can be further improved by different factors such as: number of layers, image dimension, and hyper-parameter values. Fig. 6 shows the comparative results of our method in terms of PR curves on four commonly used datasets.
MGCC [1], FSN [8], MSA-CNN [23], SMD [41], ELD [19], MSI-CNN [26], JLSD [27], CNET + PNET [24], BENDer#1 [43] on four commonly used datasets. We choose these methods because they are based on CNN, identified as a benchmark, and developed recently. As illustrated in Table 3, we can see that our model achieves significant performance after two scales fusion of residual global features from each stream than the other 7 methods, but JLSD [27] proved to be equally better with our method. From Table 3 and Fig. 6, we can see that our model achieves considerable performance over all four datasets, including the lowest MAE and the highest Maximum F-measure (MaxF β ) after two scale fusions of residual global features from each stream. We observe that our trained model measured stable values of max F-measure but showed unstable in terms of MAE. In addition, it has lower MAE values on PASCAL-S [37] and ECSSD [31] dataset which is considered better and at the second position in terms of MaxF β on all four datasets compared to the other seven models but JLSD [27] and BENDer#1 [43] are very close to our model. Moreover, our method shows encouraging results in terms of overall performance metrics.
Qualitative Comparison: We selected ResNet-50 [9] with more generalization ability due to the large number of operations connected to depth features for getting improved performance [44]. One frame produces pixel level visual saliency maps and the other frame produces full resolution semantic level visual maps [42]. By using ResNet-50 [9], we get improved results in the form of visual maps. Such improvement caused by the fusion of two streams results that highlight the fixation and semantic level, but still it is a big challenge to get the best quality saliency scores. Fig. 7 shows a comparison between our model's saliency map and six other saliency prediction model's results which are provided by the concerning authors. These state-of-the-art models are Gaze-II [21], Salicon [11], LSTM (SAM) [28], Deep CNN [32], ELM [40] and Mr-CNN [29]. Our model can predict the correct salient TABLE 3. Performance of the proposed method and other 6 state-of-the-art approaches on four commonly used datasets. Red, blue, and green indicate the best, the second best, and the third best results in terms of maximum F-measure (''↑'' means larger), MAE (''↓'' means smaller) and ''-'' represents no reported.
object region even under complex scenes, human structure, and animals. It can also detect a significant regions when cluttering and unrelated background is present in the images. It can be observed that our model may be indicated as the most prominent object in a better manner but also achieves encouraging results for different size objects in images.

VI. CONCLUSION
Recently, visual saliency map generation can be considered a useful study for image and video applications. Hence, to predict the visual saliency detection system, we fine-tune the ResNet-50 [9] model that is pretrained on ImageNet [33]. We perform various experiments on rearchitect ResNet-50 [9] model in the form of two streams. These two streams are fed by input images at low and high scales that prompt a saliency identification. In the future, this model can be tested with more number of layers to get significant results.

CONFLICT OF INTEREST
None of the authors have a conflict of interest related to the research and results presented in this paper.

DATA AVAILABILITY STATEMENT
The datasets used in the experiments and discussed in the paper will be available if required.