Learning Nonlinear Feature Mapping via Constrained Non-convex Optimization for Unsupervised Salient Object Detection

Mimicking the biological visual attention mechanism to detect salient objects in images has been widely studied in recent years. Most of the existing computational models rely on external learning for saliency prediction, which however lack robustness in diversified scenes. In this paper, an unsupervised learning model is proposed to detect salient objects by fully exploiting the internal information of the scene. Specifically, we formulate saliency detection as a mathematical programming problem with which to learn a nonlinear feature mapping from multi-view features to saliency scores. The optimization objective is to maximize the between-class variance of the attended and background regions in the resulting saliency maps, which is statistically optimal. Moreover, to solve the non-convex constrained mathematical programming problem, a hybrid external point method based particle swarm optimization algorithm is developed to find the optimal solution in a rapid manner. Finally, extensive experiments are conducted on four classical saliency benchmark datasets to test the effectiveness of the proposed method and it shows superior qualitative and quantitative performance than the other 16 state-of-the-art unsupervised saliency models.


I. INTRODUCTION
Visual saliency is a remarkable behavior of primates, which allows rapid scene analysis for novel information discovery. Over the past two decades, computational modeling of this intelligent behavior has been an emerging research topic, which has benefited a wide range of scientific and engineering fields [1]- [3]. In its early stage, the source of inspirations for visual saliency modeling mainly comes from biological cognition rules, such as local/global contrast, singularity/sparsity, shape/location prior, etc [4]- [7]. These models are artificial rule-based and generally have good theoretical interpretability. Later on, learning models with handcrafted features are widely studied for this problem, which try to build mappings from low-level visual features to saliency degrees [8]- [10]. Usually, external training images are needed for the learning of the detection model parameters.
In recent years, the end-to-end learning frameworks are designed to boost the detection performance without manual feature design [11]- [14]. In order to support exact mapping from raw data to saliency score, massive external labelled images are used to fit the model parameters. Concerning the basic model design process, existing saliency detection methods can be roughly classified into two categories, i.e., rule-based and learning-based.
In general, learning-based models are not easily restricted by rigid rules and thus more flexible and autonomous in handling this ill-posed problem. However, the knowledge transfer from external images to test image makes learning based models somewhat general to match well with a specific scene. In fact, biological visual system is more intelligent and can directly extract the regions of interest (ROIs) from the scene without external guidance [15]. Therefore, designing learning models that can make full use of the image internal information will be a promising direction towards more effective saliency detection. Different from traditional rulebased or learning-based models, this learning from the image itself paradigm possesses great potentials in better formulating this problem. The challenge remains is how to develop self-taught mechanism to advance the saliency learning process towards more intelligent way. Till now, not too much work has been done on this direction and it is a meaningful research topic to develop unsupervised selftaught learning model (Fig. 1) for saliency detection. As an important attempt towards this goal, unsupervised optimization based salient object detection methods are proposed in recent years [16], [17]. The sparse and low rank structures of salient object and background are characterized by different regularization norms to build an unsupervised decomposition model for salient object detection. From the learning perspective, these methods belong to discriminative models and are designed to learn background subspaces for sparse salient outlier detection. As is known, generative models are less dependent on specific assumptions and can provide a more universal expression for complex mapping. Thus, it is possible to design an unsupervised generative model to learn the complex relationships from image features to saliency scores without external supervision. The core is to design a reasonable self-taught mechanism so as to provide momentum to promote the learning towards a more intelligent way.
Motivated by the above observations, an unsupervised non-convex optimization model is proposed in this paper, which can learn nonlinear feature mappings directly from the internal scene for accurate saliency prediction. Different from previous methods, the saliency mapping function is designed to be nonlinear quadratic form, which contains both selfinformation and mutual-information of the features. This nonlinearity is more capable of describing the complex correlation relationship between feature spaces and saliency domain. Correspondingly, to learn the optimal mapping vector for each individual image, a constrained non-convex programming problem is proposed with the objective to maximize the between-class variance of the attended and background regions in the resulting saliency map. This special optimization objective can provide self-supervision information to guide the learning process to favor saliency modeling. Followed by this, a hybrid algorithm combing the idea of external point method and particle swarm optimization is deduced to solve the above programming problem efficiently. Finally, with the learned mapping coefficient vector, high-quality saliency maps can be obtained through nonlinear feature transform and joint statistical inference.
The contributions of this paper are summarized as follows. First, we propose a novel unsupervised self-taught paradigm to learn from the image internal information for adaptive saliency detection from complex scenes. Secondly, a nonlinear mapping based constrained non-convex programming model is established to combine multi-view features for accurate saliency prediction. Thirdly, we design a hybrid iterative algorithm by combing the external point method and particle swarm optimization to learn the optimal mapping space from the above model efficiently. The rest of this paper is organized as follows. In Section II, we will introduce the proposed unsupervised nonlinear feature mapping model and hybrid iterative optimization algorithm in detail. Experimental results and performance evaluation on benchmark datasets as well as model ablation study will be given in Section III. Finally, we will briefly conclude this work in Section IV.

II. PROPOSED METHOD
Our method is mainly inspired by the following observations. First, previous methods mostly build linear feature mappings for saliency prediction, which ignore the coupling of different feature dimensions. To achieve stronger modeling capability, a nonlinear feature mapping model is proposed for saliency detection. Secondly, to avoid the bias of knowledge transfer from external images, an unsupervised internal learning scheme is developed for thorough scene analysis. The general pipeline of the proposed salient object detection method is shown in Fig. 2. As can be seen from Fig. 2, the general framework of the proposed method is composed of three components, i.e., nonlinear feature mapping (NFM) construction, unsupervised internal learning (UIL) model design, and hybrid iterative optimization (HIO) algorithm deduction. Detailed descriptions of these components will be given in the following subsections.

A. UNSUPERVISED NONLINEAR FEATURE MAPPING (NFM) MODEL
Building connections between low-level visual features and high-level perception results conforms to the basic biological cognition rules [18]. Most of the existing methods make linear assumptions for saliency modeling, which however are not sufficient enough to describe this complex relationship. Concerning the low-level features, they are commonly considered as independent channels, but their coupling factor is seldomly studied. In this paper, we propose a nonlinear feature mapping model to better capture the information transferring process from low-level feature to saliency degree.
Given an image in RGB color space, we first extract its multi-scale contrast ( 1 f ), center-surround histogram ( 2 f ), and color spatial distribution ( 3 f ) feature maps according to [19]. Meanwhile, the input image is oversegmented into several superpixel regions 1 2 { , , , } k s s s using LSC algorithm [20], each of which represents a uniform local image structure. Correspondingly, the feature values of the pixels reside in each superpixel are averaged and assigned as the superpixel feature value. In this way, we can create a feature matrix The basic idea is that under a properly designed expression space, the saliency of a superpixel region can be precisely predicted from its feature vector. From a mathematical perspective, this problem can be expressed in the following form where, i sa is the saliency value of the i-th superpixel region and ()   is the functional projection space. Previous methods of this kind, which have limited modeling capability, adopt linear projection and assume no coupling among features. To improve the model adaptability, a quadratic mapping function considering both the feature self and mutual information is proposed in this paper, which has the following form [ , , ] T    are respectively the mapping coefficient vectors of its corresponding self and mutual information terms. The quadratic and product terms can be considered as extended features and their linear weighted sum is used to predict saliency score. In practice, we need to learn an optimal mapping function such that foreground regions would have greater saliency values than background regions. To guarantee the existence and uniqueness of the solution, we enforce the following constraints to the mapping coefficient vector. 3 3 All the weighting coefficients are forced to reside in [0, 1] and their sum is equal to 1. By this, the feature representation space is restricted to the surface of unit hyperplane. Till now, we can formulate the basic nonlinear feature mapping model as follows. The problem left is how to find an ideal mapping space for accurate saliency score prediction from the extended features. Inspired by the idea of maximizing the between-class variance, we propose an optimal criteria to evaluate the quality of the solution. The basic idea is to maximize the between-class variance of foreground and background saliency to produce high-contrast saliency map results. Denote the number of fore/back-ground superpixels as  of a given saliency map is defined as where, g s is the average value of the saliency map. Through mathematical deduction, the above equation can also be written as By expanding the two terms where, f s and b s are the sets of fore/back-ground superpixels, 11 22 ] As a statistical criteria, 2 B  can reveal the divergence of foreground and background saliency value distributions. By maximizing 2 B  , we can find an optimal mapping coefficient vector with which to produce high-quality saliency maps. Towards this end, we build an objective function to evaluate the quality of the solution.
Combing the constraint conditions in (3), we can obtain the standard form of our saliency optimization model.
Since f s and b s are not known in practice, they need to be iteratively learned with to maximize the objective function. This is a compound constrained optimization problem with dynamic solution search space. Below we will develop a feasible way to solve the above problem in an efficient manner.

B. EFFICIENT HYBRID ITERATIVE OPTIMIZATION (HIO) ALGORITHM
In the above hybrid saliency optimization model, the objective function depends both on the immediate variable and two hidden variables s . The whole learning process works in a self-taught manner with no external information involved. To solve this optimization problem, we first adopt idea from the external point method (EPM) to merge the constraint terms into the optimization objective. Here, we use the quadratic penalty function and the corresponding penalty term can be constructed as follows.
( ) By adding the penalty term to the objective function, we can obtain the overall fitness function in the following.
( ) where, is the penalty coefficient used to control the penalty strength for any illegal solutions. In practice, will gradually increase along with the iteration process to guarantee that the solution moves towards feasible space. The above fitness function contains both hidden variables and hyper-parameter, which has no explicit analytical solution in practice. Here, we propose a particle swarm optimization (PSO) based hybrid algorithm for efficient solution search. The basic idea is to first randomly sample a group of candidate solutions in the feasible space and then iteratively update their positions towards greater overall fitness value. Given a candidate solution, the saliency value of each superpixel region can be determined according to (2). Followed by this, we can organize a coarse saliency map with the predicted saliency of superpixel regions. Using Otsu algorithm, we binarize the coarse saliency map to get the fore/back-ground superpixel sets. In turn, based on the divided fore/back-ground sets, the fitness value of the candidate solution can be calculated with (11). After each iteration, the position of each candidate solution is updated according to the heuristic rules of PSO and the penalty coefficient is increased to force the candidate solution to reside in feasible space. This iteration process will proceed until the optimal solution of the swarm reaches a stable state. The hidden variables and hyper-parameter in (11) change constantly during iterations and meanwhile the fitness function value is an approximate estimation result due to the coarse saliency map binarization. Therefore, (11) is a dynamic objective function with noisy optimization environment, which can be better coped with by PSO in a stable manner [21].
Pseudo code of the proposed hybrid iterative optimization algorithm is summarized in Algorithm 1. The general process combines both the idea of the external point method and particle swarm optimization. The penalty coefficient  is magnified along with the iteration number so as to guarantee the solution to move towards legal space. Meanwhile, a group of particles collaborate intelligently to search for the optimal solution in order to maximize the fitness function. By simultaneously considering the legality and optimality of the solution, satisfactory results can be obtained in a rapid manner for saliency mapping. It is also worth mentioning that the division of  With the above EPM-PSO algorithm, the optimal mapping coefficient vector *  can be obtained, with which we can predict the saliency values of superpixel regions according to (2). After filling the predicted saliency values into the corresponding image space and performing pixel-level statistical inference, we will get the final saliency map of the image for succeeding analysis. Since both low and mid-level information of the image are exploited, high-quality saliency map can be obtained as will be seen later.

C. IMPLEMENTATION DETAILS
In the experiments, the expected number of superpixels in LSC algorithm and the number of particles in the swarm are respectively set to 500 and 36. Also, the three velocity updating parameters , 1 c and 2 c in PSO are fixed to be 1, 2 and 2 during the iteration process. For reasonable running of the algorithm, the initial value for  and the termination error  are chosen to be 1 and 0.001. Concerning the initialization of the swarm particles, we use random function to generate legal vectors for the particle positions, velocities, personal best and global best. Finally, the fitness vector . Swarm fit is initialized to be an sn-dimensional negative infinite vector for ease of succeeding update of solutions. During implementation, our code is written in MATLAB and run on a HP Z8 G4 workstation with 8 core CPU of 1.70 GHz and 64 GB RAM, and 64 bits Windows 10 operating system. It is also worth mentioning that each image will be assigned a specific optimal mapping function during its unsupervised self-taught learning process.
Both qualitative and quantitative results are obtained for comprehensive analysis of the modeling capabilities. As a common practice, saliency map is used as a qualitative way to evaluate the quality of the detection results. For observation convenience, only saliency maps of the ten topperforming unsupervised methods (which are chosen based on quantitative comparison results) as well as the three deep learning methods are displayed for visual comparison. Meanwhile, precision-recall (PR) and F-measure curves of the saliency detectors are drawn to quantitatively evaluate the detection performance. Precision is the ratio between true positive (TP) and the sum of true positive (TP) and false positive (FP), while recall is the ratio between true positive (TP) and the sum of true positive (TP) and false negative (FN). The two indexes are complementary to each other and usually need to be balanced in practice. To give a unified evaluation standard, F-measure is proposed which is the harmonic mean of precision and recall.
where, 2 0.3  = is used to give more emphasis on precision than recall as suggested in [43]. For each 8 bits grey-scale saliency map mn S   , we use a dynamic threshold ranging from 0 to 255 to binarize it, and 256 precision-recall point pairs can be calculated by comparing it with the binary ground-truth map mn G   . After plotting all the 256 point pairs on a 2-D plane, we can draw the PR curve of the saliency map. Correspondingly, 256 F-measure values can be derived according to (12) to form the F-measure curve.
Also, we use four numerical indexes, including area under curve (AUC), mean absolute error (MAE), S-measure (SM) [44] and weighted F-measure (WF) [45], to evaluate the salient object detection performance in a more comprehensive way. AUC is the area covered by the receiver operator characteristic (ROC) curve, which is drawn based on 256 detection and false alarm rate point pairs similar to that of PR curve. MAE is the average pixel-wise absolute difference between the saliency map and its corresponding ground-truth map on a benchmark dataset. As two newly proposed indexes, SM and WF take image structure and dependency relationship into consideration and thus is able to provide more reliable and objective performance evaluation results.
Below we will give the saliency maps and numerical indexes of the various saliency detectors on the benchmark datasets for comprehensive analysis and comparison of the model performance.

B. PERFORMANCE EVALUATION RESULTS
The ability to accurately detect salient objects from complex scenes is a major difference for various saliency models. It is therefore necessary to observe the model behavior in typical challenging situations. Fig. 3 shows some representative saliency maps of the top-ten performing unsupervised methods (DSR, HS, LPS, MBS+, MC, MST, RC, ST, WF and NFM) as well as the three deep learning methods (DFI, JCS and LDF) on the four benchmark datasets. As can be seen from the results, our method can produce high-quality saliency maps comparable to or better than the other counterparts, when in face of multiple, hidden, and cropped objects as well as low image contrast and motion blur. Due to the complexity of saliency generation mechanism, models relying heavily on certain aspect of cognitive cues will lack robustness in face of diversified scenes. For example, the cow in the fourth image is not integrally detected by most methods for their excessive dependence on boundary prior. Also, the football and its player are not given full attention by some methods for their overemphasis on focussness prior. Free from specific cognitive cues, our model can build representative mapping space for complex scene saliency modeling. It provides a more general way to handle the different forms of saliency information and thus is better suited to this problem.  Meanwhile, PR and F-measure curves of all the saliency models are respectively shown in Figs. 4 and 5 for objective performance evaluation. As can be observed from Fig. 4, among all the unsupervised methods, the PR curves of NFM keep lying in the most up-right corner of the plot, indicating the superiority of our model in unsupervised saliency detection. Till now, the saliency modeling performance is going to be saturated on some simple datasets, but still needs improvement in challenging datasets. PR indexes of the saliency detectors on the simple THUS-10000 dataset are relatively high and the advantage of our method over the others is not too obvious. When it comes to more challenging datasets like PASCAL-S and ECSSD, the performance gap between our method and the others becomes enlarged. And our method improves the PR index on the complex DUT-OMRON dataset to a large extent, implying its strong modeling capability in complex scenes. Similar results can also be observed from the F-measure index in Fig. 5. Since our model adopts nonlinear feature representation and self-taught learning mechanism, it is more suitable for detecting salient objects from diversified scenes than the other unsupervised methods. Concerning the three deep learning methods, they have favorable PR and F-measure indexes via the usage of massive external supervision information.
Besides, we show the numerical indexes, including area under ROC curve (AUC), mean absolute error (MAE), S-measure (SM) and weighted F-measure (WF), of the saliency models on the four datasets in Tables 1, 2, 3 and 4 respectively for further performance comparison. The up/down arrow means the larger/smaller the value is, the better the detection performance will be. For observation convenience, the best three non-deep learning results on each index are marked with red, green and blue. Since SM and WF both consider the dependency relationship among pixels, they can provide more objective evaluation standards than that of AUC and MAE. As can be seen, our method has the best SM and WF indexes among all the unsupervised methods on the four datasets, implying its superior performance under diversified scenarios. According to the three indexes (AUC, MAE and WF) provided in [17] (on PASCAL-S) and [16] (on ECSSD and THUS-10000), our method shows better numerical results than the above two recent unsupervised optimization based approaches.  Finally, concerning the modeling efficiency, the EPM-PSO algorithm can converge rapidly to an ideal solution in a few iterations for saliency prediction. The average running time of our method on a test image is around 3 seconds using the aforementioned computing platform. It is worth mentioning that the EPM-PSO algorithm can be further parallelized to greatly speed up the modeling process. In this regard, our method can obtain comparable or even favorable running efficiency than traditional salient object detection approaches. Since models based on deep learning rely on high-performance GPU platform, they are generally fast in speed during the testing stage. Below we will conduct ablation study to verify the effectiveness of each component in our model for saliency detection.

C. ABLATION STUDY AND MODEL VERIFICATION
From the view of information fusion, our saliency model combines multi-view feature maps under a nonlinear mapping space for comprehensive saliency prediction. Through joint optimization learning, individual feature maps are efficiently fused to get enhanced saliency map results. To further verify the effectiveness of the proposed saliency fusion model, we conduct ablation study on THUS-10000 dataset with PR curve as the evaluation criteria. Specifically, the three feature maps used in our model, including multiscale contrast (MSC), center-surround histogram (CSH), and color spatial distribution (CSD), are compared with the ground truth map to derive their PR curves on THUS-10000. Fig. 6 shows some example feature maps and their correspondingly fused saliency maps by the proposed method on THUS-10000 dataset. Meanwhile, the PR curves of the three feature maps on THUS-10000 dataset are drawn in Fig. 7 for performance improvement evaluation.
As can be seen from Figs. 6 and 7, both the qualitative and quantitative results are boosted to a large extent after the proposed nonlinear optimization modeling process. The multi-view feature maps contain rich contour, location and color information, which are fully exploited by the representative and discriminative learning model for highquality saliency generation. As a result of this, the fused saliency map shows better visual and numerical effects than each individual feature map. After the fusion process, the maximum precision value is increased by 0.1, which is a significant improvement on this dataset.
In Table 5, we summarize the learned optimal mapping coefficient vectors by the proposed method for the 6 test images displayed in Fig. 6. As can be observed, each image is assigned a non-negative projection vector in the feature representation space for accurate saliency estimation. Take the first image as an example. The self-information from MSC and CSH, as well as the mutual-information between MSC and CSD are mainly used to constitute the saliency map. The two self-information components provide contour and location estimation results, and the mutual-information component provides compound contour and color analysis result. They together contribute to the formation of the final fine-grained saliency map. Due to the designed intelligent learning and optimization mechanism, our model can sense the importance of each feature map and meanwhile build correlation relationships among them to compose the ideal saliency map. Through unsupervised self-taught learning process, a scene-specific saliency mapping function can be constructed for each individual image without external guidance. Thus, it is quite suitable for the robust modeling of saliency from complex open scenes.  Since the PSO has some randomness due to the initialization of the particle velocity and position as well as the update of the particle velocity, it is necessary to observe the stability of the optimization algorithm during the saliency modeling process. We run the EPM-PSO algorithm on PASCAL-S dataset for 10 independent times to check the influence of its randomness on the detection performance. The experimental results indicate that PSO can converge stably to the optimal solution in each individual test, leading to exactly the same saliency map result. In terms of convergent speed, variance of the average running time (in seconds) on a test image among the 10 independent runs is 0.02, which further proves the stability of the convergent process. In general, the essence of the optimization objective is to find a nonlinear unit mapping so that the between-class variance of the feature points after projection is maximized. The collaborative search mechanism of PSO makes the optimization algorithm avoid being trapped into local optimum and converge rapidly to the optimal solution.

IV. CONCLUSION
In this paper, we propose an unsupervised internal learning model to construct nonlinear feature mapping for robust salient object detection from complex natural images. A novel optimization objective based on between-class variance maximization is deduced to form a constrained nonconvex programming problem. Based on this, we develop an external point method based particle swarm optimization algorithm to search for the ideal solution in real time. Extensive experiments on benchmark datasets verify the superiority of the proposed method than other classic saliency detection models, especially in face of complex imaging conditions and scenes. Also, the effectiveness of the nonlinear feature integration scheme is confirmed by model ablation study. Different from previous methods, the proposed model is free from specific rules or generalization bias, and possesses self-driven saliency learning capability. In general, we provide a more representative and focused model for boosting the feature mapping based saliency detection performance. In the future, it is a meaningful research topic to develop unsupervised scene-specific saliency learning model to meet the requirement for open world scene understanding.