SATSal: A Multi-level Self-Attention Based Architecture for Visual Saliency Prediction

Human visual Attention modelling is a persistent interdisciplinary research challenge, gaining new interest in recent years mainly due to the latest developments in deep learning. That is particularly evident in saliency benchmarks. Novel deep learning-based visual saliency models show promising results in capturing high-level (top-down) human visual attention processes. Therefore, they strongly differ from the earlier approaches, mainly characterised by low-level (bottom-up) visual features. These developments account for innate human selectivity mechanisms that are reliant on both high- and low-level factors. Moreover, the two factors interact with each other. Motivated by the importance of these interactions, in this project, we tackle visual saliency modelling holistically, examining if we could consider both high- and low-level features that govern human attention. Specifically, we propose a novel method SAtSal (Self-Attention Saliency). SAtSal leverages both high and low-level features using a multilevel merging of skip connections during the decoding stage. Consequently, we incorporate convolutional self-attention modules on skip connection from the encoder to the decoder network to properly integrate the valuable signals from multilevel spatial features. Thus, the self-attention modules learn to filter out the latent representation of the salient regions from the other irrelevant information in an embedded and joint manner with the main encoder-decoder model backbone. Finally, we evaluate SAtSal against various existing solutions to validate our approach, using the well-known standard saliency benchmark MIT300. To further examine SAtSal’s robustness on other image types, we also evaluate it on the Le-Meur saliency painting benchmark.


I. INTRODUCTION
Visual attention consists of perceptual and cognitive mechanisms that empower humans to rapidly select and interpret the most interesting parts of a complex visual scene. For human information processing, selective mechanisms associated with attention work as a "data prepossessing bottleneck". Often, the selective mechanisms are a result of socalled bottom-up processes, in which the viewer is guided by perceptual signals and analyses the surroundings with no conscious intentions [72] [7]. However, cognitively-driven top-down mechanisms are equally important in the way hu-mans direct their attention to selected elements, whether they are visual, auditory, olfactory or otherwise [20].
In addition to top-down vs bottom-up dichotomy, visual attention literature distinguishes overt from covert attention. Covert attention enacts when the eyes are not moving because focusing on a specific fixation point (one might intentionally pay attention to the peripheral information without moving the eyes). On the other side, overt attention relies on eye movements shifting from a location to another of a given visual scene; foveal processing enables capturing high levels of detail from objects of interest while suppressing the context information into a low-resolution, low-colour processing mode [68] [50]. The scene is constantly analysed through rapid eye movements, i.e., saccades, by which visual attention processes scan rapidly new objects of interest. Such mechanisms help humans prioritise and filter stimuli from the early visual processing stages to later stages where higherlevel cognitive processing can occur. The human ability to detect saliency of objects enables efficient scene scanning, and as such, it is one of the fundamental attention mechanisms [49].
Therefore, a visual attention-related modelling is known as saliency prediction [48]. Saliency prediction deals with detecting the most attention-grabbing regions of a given visual scene from a bottom-up perceptual perspective. For a given input, be it an image or a video sequence, saliency prediction algorithms encode each pixel of the visual scene with an intensity value [0,255] or [0,1], indicating the probability of the pixel to be salient [58]. The corresponding map returned by saliency algorithms is known as a saliency map. The dominant understanding in the field is that the higher the saliency value, the more likely observers' eye movements will be drawn to that area in the image or video frame, assuming that there are no top-down cognitive or task-driven bias. Saliency maps are usually visualized as blobs distributed around the regions that naturally stand out (or pop out) of the visual scene. Therefore, saliency maps are typically represented as density maps (or heat-map) of probabilities. The accuracy of a predicted saliency of a given scene is measured against the recorded eye movements on the same scene. That relies on the understanding of close relationships between eye movements and visual attention [60]. Examining statistics on different levels of visual saliency enables an in-depth understanding of the processes that govern human attention and, by extension, human behaviour. Due to its broad relevance, predicting human eye movement patterns and visual saliency has an impressive range of applications in computer vision and related fields such as image compression [35], image captioning [22], image retrieval [31], image re-targeting [61], quality assessment of multimedia content (i.e. image [3], [17], [18], stereo [62], 3D meshes [?], [1], etc.), remote sensing [30], watermarking [34], map viewing [51] [5], indoor localization [29], perception [14], image enhancement [12], [19], healthcare [38] among many others.
Saliency prediction links to the pioneering work by Treisman and Gelade [72] on the feature integration theory. According to Treisman and Gelade [72], early visual features are registered as viewers perceive a visual scene for, then, being combined into a complete object-based perceptual identification. The latter also introduces the so-called preattentive and attentive stages, corresponding to bottom-up and top-down information processing. Being able to separate and organize the visual information hierarchically based on its perceptual saliency and its importance paves the way for mimicking human attention in mathematical models and makes feature-based models eligible for predicting salient regions from stimuli that are viewed freely (not task-driven).
In terms of algorithmic developments, the seminal work by Koch and Ullman [48] establishes the basis of central saliency and incorporates low-level information, and building on this model, Itti et al. [40] proposed the first computational model of saliency. From this point forward, many biologically inspired computational architectures have been proposed, such as the graph-based visual salience, or GBVS [36]. Several computer vision and image anaylsis methods focus on the extraction of low-level and high-level features to model and detect objects in images and videos [4], [11]. The ever-increasing success of machine learning and deep learning approaches in vision-related computational processing has allowed a critical development for saliency prediction over the last few years. Convolutions Neural Networks (CNNs) reached out high accuracy rates in learning complex semantic representations from large-scale image recognition datasets [52]. Due to the advance of deep learning techniques in mimicking human behaviour, recent CNN-based saliency architectures reduced the gap between human eye movements (typical baseline for saliency studies) and the performance of prediction models remarkably. Current CNNbased models focus on high-level semantically-informative representations because low-level features contain a high ratio of noise signals which are not semantically helpful [28]. Another main drawback that affects the reliance only on these deep hierarchical CNNs representations is the problem of limited receptive field proportional to the network depth layer. The consistency of scene semantics has an effect on eye movements, i.e., the eye tends to remain fixated longer on objects that are semantically informative regarding a scene's content [37] [71]. If a visual scene contains too many objects, representational inconsistency of scene semantics increases, highly correlated with human eye movement during free viewing. Given the above, predicting object characteristics links to the accuracy of the visual attention model. Despite the availability of several "deep saliency models" (i.e., deep learning-based saliency models), much of the knowledge in psychology and neuroscience describing various aspects of human visual attention has not been adequately tackled yet [49] [9]. In some tasks, even the traditional saliency models (those proposed before deep-learning models) offer decent saliency predictions, which can be superior to psychophysical evaluations [49], possibly explained by the importance of the low-level features in an image in detecting the early fixations [27]. Both high and low-level features may serve a purpose in the way humans process visual information. Inspired by the previous studies considering human attention as a multilevel selection process [45], we integrate low and intermediate-level feature mapping to leverage the discriminant part of both low-level and deep semantic features to propose a new saliency model. To implement and test the proposed model, we incorporate convolutional Self-Attention modules on skip connections from the encoder to the decoder architecture opposite layers. As a result, the proposed model can effectively predict visual saliency patterns from multilevel contextual scene representations and overcomes the limitation of narrow receptive fields by employing the ability of Self-Attention to capture the context from an extended range of sequence dependencies.
The main contributions of this publication are summarised as follows: • We develop a novel approach for visual saliency prediction using both high and low-level factors in learning multilevel features for producing static saliency maps. In addition, a self-attention module has been incorporated on the encoder-decoder skip connections to boost the global information in the deep layer and generate a highly representative saliency distribution. • We evaluate the effectiveness of our model on the established MIT300 preserved benchmark and Le Meur [57] paintings dataset. All comparisons demonstrate that the proposed model is consistent, efficient, and superior to or competitive with other state-of-the-art methods. • We further test out the robustness of the proposed approach in an ablation study on challenging scene samples that include both high and low-level features. The results reveal that multilevel skip attentive connections are effective and boost the performance of the backbone encoder-decoder model. The rest of the manuscript is organized as follows: Section II introduces an overview of related saliency literature. Section III provides a detailed description of the proposed approach. Section IV demonstrates the benchmarking experiments and compares results to state-of-the-art methods. Finally, conclusions and outlook are given in Section V.

II. RELATED WORK
Visual attention modelling has been a topic of interest to computer vision for many decades, starting from the seminal work by Koch et al. [48], which was then implemented by [40] as a bottom-up model that predicts saliency using multi-scale low-level features. On top of the previous efforts, the GBVS [36] framework extracts image features to predict saliency using graph theory-based formulas that define Markov chains over different input maps. Zhang et al. [56] proposed a Bayesian framework tackling bottomup saliency as self-information over linear visual features and the overall saliency as the point-wise mutual information between features and target. Around the same time, Bigdely-Shamlo et al. [63] proposed a method to detect visual saliency relying on the Kalman filter. Achanta et al. [2] extracted salient pixels in images using features of colour and luminance in the Fourier domain. Colour spaces and their role in saliency extraction were further investigated later by several researchers, e.g., [13].
The approaches mentioned above focus on visual attention using low-level spatial features. Sun and Fisher [71] introduced a hierarchical object and location-based visual attention model using a grouping-based salience. They treat complex visual tasks that depend on the current scene and the observer's goals, thus introducing a top-down cognitive aspect to saliency prediction. Integrating another feature of human cognitive processing, Jin-Gang and Gui-Song [41] presented an object-based saliency detection with a paradigm based on the Gestalt grouping cues. Kai-Yueh et al. [44] method introduced a model that is reliant on the relationships between saliency and "objectness", a concept in which a scene element is ranked for its meaning.
Recently saliency modelling gained remarkable performance by applying deep learning techniques which can learn top-down representations. This was achieved due to the construction of large scale eye movement datasets such as [43], [42], [6], [39] and [15]. The eye movement datasets mentioned above were collected using free-viewing eye-tracking sessions. The latter differs from task-driven scenarios where the variability of tasks could result in an unbalanced specificity of eye movements toward visual features related to the tasks. Among the deep learning approaches to saliency prediction, Ensemble of Deep Networks (eDN),Vig et al. [73] trained an early shallow CNN architecture that learns end-to-end saliency by merging different layer feature maps. The achieved performance did not mark an important result leap, as shallow networks cannot learn high-level features.
Lots of recent methods leveraged classification architectures pre-trained on ImageNet dataset [24]. These architectures have a superior ability at extracting the deep semantic representations from images [26].
As the deep learning-based models started to populate the scene, Oyama and Yamanaka [64] explored the influence of classification accuracy of the models on saliency estimation. The well-known DeepGaze1 by Kümmerer et al. [53] reemploys an early light object recognition network to explore the limits of deep learning in saliency prediction. Kümmerer et al. later introduced a new network, namely DeepGaze2 [55], based on the VGG [70] classification network. Both DeepGaze1 and DeepGaze2 models use fixed prior maps to regulate the possible biases in the data. Pan et al. [66] compare two approaches for predicting saliency in an endto-end fashion. The first one is reliant on a lightweight network whose parameters are learnt from scratch. The second one is deeper and takes advantage of a pre-trained image classification network. In contrast, Cornia et al. [21] propose an architecture that extracts and combines features from multiple levels, then use a learned before tackling the bias of the dataset and introducing a new custom pixel-based loss function. Huang et al. [39] use a two-stream VGG network in their SALICON with different input scales, in which both output streams feature maps are concatenated to model the final saliency. DVA [74] learns multi-level information from different layers with different receptive field sizes, the decoders composed of a series of deconvolution layers with upsampling operations, the resulting multi-level maps are fused to produce the final saliency map. SalGAN [65] uses a deep convolution generative adversarial neural (GAN) network. The model design architecture contains a generator of saliency, and a discriminator network, where the two compete in a min-max game between the generator and the discriminator to produce a saliency map, which is qual-VOLUME 4, 2016 itatively indistinguishable from the ground truth based on eye movement recordings. Liu and Han [59] propose a deep spatial contextual long-term recurrent convolutional network that learns local features on each image location in parallel via fine-tuning a pre-trained CNN model. Afterwards, the model simultaneously learns to incorporate global scene context to predict saliency. Cornia et al. [23] introduce a set of prior maps generated by a Gaussian function, the use of a neural attention mechanism and convLSTM (convolutional LSTM) layers on feature maps to refine the predicted saliency maps iteratively. Most recently, [46] proposed SALYPATH, an architecture to simultaneously predict saliency and associated scan-path, using a combined loss function that uses pixel level and distribution functions and a Noise Sensitivity Score (NSS) [67] metric. Table 1 provides a summary overview of the most prominent methods mentioned above in terms of their major contributions to predict saliency on natural images.

Models
Major contribution to predict saliency maps SAM-ResNet [23] ResNet and RNN to refine the features SalGAN [65] Trains a deep convolution generative neural network DVA [74] Combination VGG high level features SAM-VGG [23] VGG and RNN to refine the features ML-Net [21] High multilevel VGG features and learnable bias DeepGaze [53] VGG and fixed prior maps to regulate possible range of biases SALICON [39] Two streams of VGG network on two different scales

III. PROPOSED MODEL
In light of the prior work, this section describes the architecture of the proposed model, as illustrated in Figure 1. It consists of a VGG-Encoder network and the Decoder-network composed of five deconvolution blocks interspersed with an extended Self-Attention module (ECSA). ECSA takes the previous block decoded hidden-vector and the opposite layer in Encoder features as input using a skip connection; the ECSA output is fed into the next Decoder block to produce a saliency map at the final stage. The motivation behind the proposed model are explained below: • CNNs can only process features in a local neighbourhood, thus making it inefficient to model long-range, multilevel dependencies across spatial regions. Instead, incorporating self-attention modules connecting multilevel Encoder and Decoder networks yield robust predictions. Therefore, fine details at every position are properly considered with others in distant portions of low-level attention from early layers. That also helps to overcome the limited receptive field issue in learning object-based correlation from deep semantic representations. • Most recent studies consider the high-level features as they focus on solving the complex top-down attention problem, meanwhile underestimating the importance of capturing global and local low-level attention. Multilevel skip connections help in leveraging the deep semantic representations from the last decoder layers and simple, attractive structural features from the first encoder layers, which enhance the modelling of better, more representative saliency distributions than those examining only high-level features.

A. EXTENDED CONVOLUTIONAL SELF-ATTENTION
The so-called integrated attention mechanism has recently shown important improvements in the performance of var- ious downstream computer vision tasks [16] [25] [46]. Unlike the absolute attention mechanisms, the mechanism mentioned above learns in a fully adaptive, joint, and taskoriented manner, which allows the network to prioritise and associate weights to feature vectors. The self-attention or intra-attention calculates the response at a position in a vector by attending all positions within the same vector. In greater detail, the self-attention module draws the relationship between distant features, incorporating the module at multilevel connections on the Encoder-Decoder network layers. The latter prompts the model to generalise better static visual attentive cues at low and high levels, boosting the representation capability of the full network. The prediction performance of this design generalises well across various static saliency datasets fig 1.
The main goal of self-attention is to determine a new set of vector values representing global vector features dependency. Thus, Self-attention reveals the set of values to pay more attention to the interaction of input vector features. In simpler words, for a given vector, we need to extract query, key and value vectors from it, simulating the selection process applied in system retrieval. The latter measures attention by calculating a similarity between a query and best related key features using a score function. The output scores go through the normalisation step to have the sum of probability values to one. The final value vector is a weighted combination of the previous value vectors based on the normalised score result. The overall architecture of the proposed extended selfattention is described in figure 2.
In equation 1, the hidden, encoded features of the ith VGG encoder block are given as a function of the input image X ∈ R 256×128×3 , where f 0−i is the ith VGG encoder block ,h i and w i are the down-sampled input height and width after the ith maxpooling operation, except for i = 5 which denotes the last encoder output H 5 = Z 0 , here the max-pooling is not applied. We denote the decoded variable after the ith block by Z i . Each of the ECSA modules is placed just before each decoder block and takes as input both of Z i and H 5−i , and transform the intermediate features H 5−i into three variables Q, K and V, unlike [75] that incorporate just one layer of 1*1 convolution without activation function, we extend our implementation by a shallow series of activated CNNs interspersed with down-sampling and upsampling operation, the ECAS module architecture slightly differ corresponding to the ith positional block, because we are extracting an attention vector from the Encoder layer position and inject it into the Decoder which mean that the two vector spaces are not similar, so that a deep transformation need to be applied, of course taking into consideration the computation efficiency of the whole architecture. The resulted couple feature spaces (Query, Key) ∈ RC ×N from Q(H i ) and K(H i ) respectively, simplifying the dimension of H i ∈ R C×N , where N =h i × w i representing the number of feature location,and C the number of output channel from both of Q and K stream which is equal to the C/8. The attention map resulted after normalizing the output of dot product between Query vectors VOLUME 4, 2016 and key vectors using a Softmax function, where S represent the similarity between Query and key feature spaces : The attention map A∈ R N ×N shows the likelihood that a particular positional feature in l th location appears in the jth location in N feature locations, ( j,l )∈ R N , the Value feature space is further enhanced by multiplying it to the attention map: V alue enhanced = V alue.A The dimension of the context vector, which is the enhanced value feature space, is equal to Z i dimensions, moreover we scaled the context by a learnable parameter γ in order to learn how much the decoder network should relay on the context from the the encoder features at each stage. Finally, we add it to the decoded variable Z i Furthermore, our approach offers flexibility in that it has no restriction regarding receptive field dimensions, using the capability of self attention in capturing distant features, i.e., the system theoretically can work with any width and height input image.

A. EXPERIMENTAL SETUP 1) Loss Function and Training Stage
The whole model, including the encoder-decoder and the ECSA modules, was trained using the loss function noted as L . The L loss function is defined as a combination between the Kullback-Leibler Divergence(KLD), the Normalized Scanpath Saliency (NSS), and the Binary Cross-Entropy (BCE). Each term (KLD, NSS, BCE) covers a particular aspect for learning the best set of weights [10]. Specifically, KLD evaluates the mutual distribution between the predicted output and the ground truth, BCE is used for binary classification of each CNN output vector independently, and NSS [67] provides a saliency metric that measures the mean saliency value at ground-truth fixation locations. We detail each approach and their contribution to L loss function in our model below. Our assumption behind using a combination of metrics as an objective function, back to optimize the loss toward the weights that lead to the best results in capturing more accurate representative saliency distributions.
BCE is mainly designed for calculating the distance between two normalized distributions in the interval [0, 1].
In probabilistic terms, BCE measures the accuracy of the modeled probability distribution of saliency for a given input image pixel.
L KLD has been widely used for training saliency models as it often used as one of the metrics in different benchmarks. It is chosen as a weighted main loss in our work.
L NSS is adopted from the standard NSS metric, which is a similarity metric. Their negatives are used for minimization in order to optimise the model weight in the right direction, the goal from adding the L NSS loss is to maximise the similarity metric results : . and N = refer to the sum of fixations.

2) Implementation details
We implemented our model in PyTorch and trained the model on the MIT1003 dataset, using 900 images for training and 103 images for validation. We initialised the encoder with the pre-trained VGG [70], and both the decoder and the attention modules were randomly initialised using the Xavier method [33]. We used the Adam optimiser [47] to train the model. We opted in for a learning rate of 10 −4 and a scheduler step with a dividing factor of 2 every 20 epochs. During the first ten epochs, the ECSA parameter γ was set to zero to focus on learning the main task. At the same time, the decoder layers gradually froze, starting from the bottom to the top and progressively increasing the complexity. After the first ten epochs, the whole model was trained end-to-end, including all parameters.

3) Computational load
The embedded self-attention modules are trained in an endto-end manner with the encoder-decoder backbone model. The entire training procedure takes about 5 hours on Google Colab environment with a single NVIDIA Tesla T4 GPU and a 2.0GHz Intel(R) Xeon(R) CPU. Since our model does not need any pre or post-processing steps, it takes only about 0.0106 s to process an image of size 256 × 192.

B. EXPERIMENTAL RESULTS
In this section, we evaluate our model on the MIT300 benchmark dataset [54], which is one of the most well-known benchmarks for saliency models. The dataset consists of 300 natural images; the corresponding saliency maps are preserved privately for a fair comparison. We also used the newly published Le Meur [57] Paintings dataset, which offers a different, more specialised stimuli space as the paintings differ in many ways from natural scenes. Testing our approach on multiple types of stimuli helps us study our model's performances on different datasets. We also want to demonstrate the effectiveness of the extended selfattention module in capturing the global representations on another type of space, in which inherently different cues would attract the viewers' gaze compared to natural scenes. Le Meur's dataset consists of 150 painting images related to five different art periods and their respective saliency maps. We used the entire dataset for testing.

1) Competitors
We compare our model with a representative set of stat-ofthe-art models, namely, SALICON [39], DeepGaze1 [53], SAMCornia [23], and ML Net [21]. We selected these models due to their ability to address visual attention on different stimuli domains, e.g., indoor, outdoor, painting. For the sake of generality on low-level attention, we further compare our model with some previous static attention models and frameworks, i.e., Itti Koch model [40], and the GBVS [36].

2) Metrics
We conducted comparisons of our model's results against the selected competitors using six saliency metrics, which are divided into two categories: • Distribution-based metrics: These metrics allow comparing the predicted saliency map to the ground-truth distribution from eye movement recordings. We used three of them, namely, (KLD) Kullback-Leibler Divergence, Similarity Metric (SIM), Linear Correlation Coefficient (CC). • Location-based metrics : These metrics compute some statistics of fixation locations, such as Normalized Scanpath Saliency (NSS), Area under Curve (AUC) and its derivative AUC-Judd (AUC-J), and shuffled AUC (s-AUC). Reference articles 8315047 6180177 provide more detailed descriptions of all the metrics used in our experiments.

3) Performance
We calculated the results on the MIT300 dataset by sending the output prediction to the active benchmark service. At the same time, we tested our method over Le Meur's dataset using the same protocol described in their workLeMeur. Table 2 shows our results on the MIT300 benchmark. As Table 2 demonstrates, our model scored the highest among the comparative models on both CC and SIM metrics for this dataset while achieving a very close second place for the AUC and scoring competitive results for the NSS and KLD. Next, we made the same comparisons for the second dataset (Le Meur paintings dataset). The outcomes from this comparison are shown in Table 3. As Table 3 demonstrates, our model achieves the highest score with the KLD metric and close second place with the SIM and CC for the painting data set, and it remains competitive for the remaining metrics.  Figures 3 and 4 illustrate the qualitative results (i.e., the visual outputs) of our model against the "ground truth" (i.e., the eye movement data) and other state-of-the-art models. In these two figures (Fig 3 and 4), we can see the stimuli overlain with predictions and ground truth saliency maps for the MIT300 and Le Meur dataset. It is immediately clear that our model can capture both the global and the local attention patterns, demonstrating an important generalisation capability for different image (i.e., scene content) distributions. The continuity between the most intensive salient regions to the effect of the self-attention in extracting the global context of the scene captures the most salient objects. It also provides intuition on how our attention could be swapped from one object to another. Thus, it would be beneficial for the case of scan path prediction. Figures 5 shows the effectiveness of our model to predict salient regions on synthetic images characterised with low-level features standing out of the visual stimulus e.g., shape, contrast, colour, orientation.
Based on these findings, SAtSal (our proposed model) performs superior to and competitive with previous state-of-theart models. For the sake of fair comparisons, we used some standard metrics accounting for the effectiveness of saliency detection in static images. Thus, overall results demonstrate that the method is robust over multiple datasets.

C. ABLATION STUDY
This section provides a detailed evaluation of the proposed approach from several aspects through an ablation study to verify the effectiveness of the proposed multilevel Self-Attention modules and examine the influence of different VOLUME 4, 2016 training protocols. We conducted the ablation study on two subsets, one that contains natural images from (MIT1003 by [43]), and another that has explicitly low-level features from (CAT2000 by [6]). We consider this protocol to study the effect of multilevel self-attention modules on natural images using different settings. We first tackled images that contain both bottom-up and top-down attention stimuli. The same patterns exist in the distribution of images from Le Meur and MIT300 test sets. Therefore, to further examine the robustness regarding only the bottom-up cues, we conduct the same test protocol on a specific category of images containing only low-level features with no semantic meaning. Also, we are restricted in this ablation test protocol to exclude models trained on the same data distribution to avoid training-test overlap. Below we detail the outcomes from the two ablation studies.

1) Effectiveness of multilevel self-attention modules on natural images
First, we study the effect of multilevel self-attention modules (ECSA modules) by disabling the main components in the following three settings. Note that for each setting, we have used the same training protocol as the original model and tested the model on 100 images from the MIT1003 dataset [43]. The results are summarized in Table 3  The significant drop in Settings 2 and 3 compared to Setting 1 is caused by the nature of the testing dataset, which portrays images representing objects with high semantic meaning. Even though it is relatively subtle, the drop in performance with Setting 1 indicates the importance of the low-level features, which would be even more pronounced in other stimuli. Since one of the strong points of our model is to integrate the low-level feature detection into deep learning-based saliency prediction in combination with mid and high-level features, we conducted an additional ablation study with a focus on low-level features. We repeated the ablation study on 100 images from the CAT2000 [6] using the same settings as in the previous section. Images from CAT2000 contain patterns prepared for perceptual psychology studies, with low-level features, including geometrical elements, pop-out, conjunction, search asymmetry, textures, etc. We present the results from this study in Table 4.
We select just the distribution-based metrics on this part of the study because we are interested in testing the model's accuracy in revealing one region of interest from the other non-attractive low-level features. Other fixations on this kind of scene located far from the areas of interest can be considered outliers that do not represent the bottom-up saliency of the scene.
With the CAT2000 dataset, we see a minor improvement (e.g., CC: 0.9448→ 0.9470) in Setting 3 compared to the proposed approach. We believe this may be due to the nature of the scenes in CAT2000, as they do not contain much semantic meaning. Thus there is no need to calculate the attention for high-level deep representations. However, the drop in performance is quite evident in Settings 1 and 2 compared to the results obtained from SATSAL. The learned information from low-level features on the multilevel skip self-attention modules are essential for modelling better saliency and could boost the performance on a given general scene.

V. DISCUSSION AND CONCLUSIONS
In this paper, we were set out to build, implement and test a new architecture for visual attention modelling, specifically, for saliency prediction. Unlike most previous methods, we designed our approach to predict saliency from a more "holistic" perspective, accounting for both bottom-up (low level) and town-down (high level) features in a scene. Our model has shown great flexibility (thus, early signs of generalisability) in predicting visual saliency over datasets containing images with inherently different visual characteristics, precisely, natural scenes, paintings, as well as highly simplified perceptual psychology stimuli. SATSal's saliency scores are either superior or competitive against the state-ofthe-art models based on multiple metrics. We introduced an extended CNN self-attention module, using skip connection on multiple levels to model the representation of low and high-level features equally to capture local and global factors that attract human attention. Our approach enables local features to model human visual attention after filtering them out of the noise and merging them with deeper global representations. The steps mentioned above finally allow global and local visual information to generate more accurate predictions than models focusing only on lowlevel or high-level features. Specifically, the main contri-bution of our work is a new architecture that can capture relations between separated spatial dependencies from multiple hierarchical levels. Furthermore, the steps, as mentioned earlier, improved the accuracy of the extracted saliency maps because it takes all stimulus features into account. We evaluated our model on a well-known benchmark and a newly proposed dataset, attaining competitive results with a representative set of state-of-the-art models. Although the model is trained on a small set of data, both quantitative and qualitative outcomes demonstrate the effectiveness and robustness of our model and its capability to generalise against different data distributions. Furthermore, SATSal's performance provides evidence on the importance of taking multiple level features into account in improving saliency prediction. As a future extension, we intend to address the temporal dimension to predict fixations and their duration. To do that, we aim to employ the capability of self-attention in capturing the temporal dimension while exploiting the contextual and semantic characteristics of the stimuli. This work also opens questions about the interpretability of deep saliency models, the features responsible for improving saliency prediction, the reason behind the accuracy rates from one distribution to another covering different cues. Finally, we consider bottomup and deep semantic cues contributions in qualitative and quantitative results.