Salient Object Detection With Importance Degree

In this article, we introduce salient object detection with importance degree (SOD-ID), which is a generalized technique for salient object detection (SOD), and propose an SOD-ID method. We define SOD-ID as a technique that detects salient objects and estimates their importance degree values. Hence, it is more effective for some image applications than SOD, which is shown via examples. The definition, evaluation procedure, and data collection for SOD-ID are introduced and discussed, and we propose its evaluation metric and data preparation, whose validity is discussed with the simulation results. Moreover, we propose an SOD-ID method, which consists of three technical blocks: instance segmentation, saliency detection, and importance degree estimation. The saliency detection block is proposed based on a convolutional neural network using the results of the instance segmentation block. The importance degree estimation block is achieved using the results of the other blocks. The proposed method accurately suppresses inaccurate saliencies and estimates the importance degree for multi-object images. In the simulations, the proposed method outperformed state-of-the-art methods with respect to the F-measure for SOD; and Spearman’s and Kendall rank correlation coefficients, and the proposed metric for SOD-ID.


I. INTRODUCTION
Saliency detection (SD) is an image processing technique that estimates salient local regions in images [1]- [7]. Salient regions are generally defined as areas that attract human attention with respect to characteristics such as high contrast, unique orientation, and distinctive color. Detecting these regions is important for image applications, such as human eye fixation estimation and context-aware image coding.
Recently, several methods have been proposed for salient object detection (SOD) which is similar to SD [8]- [27]. Instead of estimating local regions, SOD identifies characteristic objects, such as a tall man, a red car, or signs. Some image processing applications require not only salient information but also important object locations [28]- [31]. For example, image retargeting uses the object locations and resizes images while retaining their shapes. Thus, SOD has been shown to be more useful than SD for some applications.
The associate editor coordinating the review of this manuscript and approving it for publication was Ikramullah Lali .
Moreover, Islam et al. proposed an expansion of SOD [25], which is called RSOD in this article, and studies have shown that it has high potential for image applications. SOD classifies estimated objects as salient or non-salient, whereas RSOD estimates salient object contours and their importance scores. Importance scores are useful for several applications, which we show in Fig. 1, where (a) is an input image, (b) and (c) are its ideal saliency map in SOD and RSOD, and (d) and (e) are the retargeting results for (a) using (b) and (c) according to [30], respectively. In (b), the white and black areas represent salient and non-salient regions, respectively, whereas in (c), the white, gray, and black areas represent first salient, second salient, and non-salient regions, respectively. In (d), a part of the dog, which seems to be the most important object in (a), is cropped because the chair and dog are given the same importance value by SOD shown in (b). By contrast, because of the different scores in (c), the dog is completely preserved in (e). Fig.1 shows one advantage of the expansion, and we experimentally understand that it has high potential not only for image retargeting but also, for example, for content-aware image coding and image representation. However, the discussion of RSOD was not sufficient in [25] to introduce it as a new theme of computer vision. The authors presented the expansion with little detail as a supplement to the main topic. The details of its definition were omitted, and its significance was not discussed. As the evaluation metric, Spearman's rank correlation coefficient [32] was use simply, but, unfortunately, its validity was not discussed. This inadequate discussion is a problem for tackling the new theme.
In this article, we call the technique, which is denoted by RSOD, SOD with importance degree (SOD-ID); refine it to introduce a new theme via discussing its definition, significance, and assessment; and propose an SOD-ID method that outperforms state-of-the-art methods. First, we discuss and construct the definition and significance of SOD-ID using several pieces of evidence and application examples. We define the importance score as represented in N degrees, and refer to this as the importance degree. Based on this discussion, we present the evaluation procedure and the dataset preparation of SOD-ID. We also propose an assessment index based on the squared error and Kendall rank correlation coefficient [33], and create a dataset based on those of SD and instance segmentation. Finally, we propose an SOD-ID method based on deep learning and instance segmentation.
Our contributions are summarized as follows: • We introduce and define a new theme, SOD-ID, via a discussion and examples.
• We introduce the importance degree using N , which is the generalized importance degree of SOD, and show its efficacy and advantages.
• We introduce valid dataset preparation and an effectual evaluation procedure for SOD-ID to evaluate methods without actually applying them to image applications, which contributes to the development of this theme.
• We propose an SOD-ID method via combining instance segmentation and SD based on deep learning as a separable system, which is also useful for SOD.
In simulations, the proposed method perceptually demonstrated N -degree salient objects, and objectively outperformed RSOD [25]. The proposed method accurately detected salient objects and estimated their values of importance degree. The proposed method was objectively compared with RSOD for Spearman's rank correlation coefficient [32], the Kendall rank correlation coefficient [33], and the proposed metric, and obtained better scores. Moreover, in the evaluation procedure of SOD, the results of the proposed method were objectively comparable with state-of-the-art SOD methods; therefore, we demonstrated that it is as effective as SOD.
The remainder of this article is organized as follows: In Section II, we provide an overview of existing methods of SD, SOD, fixation estimation, semantic segmentation, and instance segmentation. In Section III, we briefly present the fundamentals of SOD and RSOD. In Section IV, we discuss the definition, significance, evaluation metric and dataset of SOD-ID. In Section V, we explain the proposed index, dataset, and method. Finally, we present experimental comparisons in Section VI, and conclude this article in Section 7.

II. RELATED WORKS
In this section, we explore existing methods of SD, SOD, semantic segmentation, and instance segmentation. First, we describe SD and human eye fixation. Second, we explain two types of SOD and RSOD. Finally, we discuss the difference between semantic segmentation and instance segmentation, and review their recent methods.
SD is similar to human eye fixation; that is, they estimate regions of interest that correspond to human attention [1]- [7], [34], [35]. The traditional SD method uses characteristic features, such as high contrast, unique orientation, and distinctive color [1]. Harel et al. proposed a method that uses a graph-based algorithm, calculates activation maps based on several features, and combines them to generate one saliency map [2]. Recently, methods based on a convolutional neural network (CNN) have been proposed, and effectively extract global and complex features as a result of training using a large number of images and their corresponding gaze information [6], [7]. Although they accurately estimate human interest, they cannot estimate object contours.
SOD simultaneously estimates object regions and whether they are salient [8]- [24], [26], [27]. Traditional SOD methods use the propagation algorithm [11], [22], [36]. They iteratively propagate salient and background information based on color similarities between neighboring pixels and the Markov absorption probability. However, they often produce inaccurate results along object boundaries. Recent methods based on fully convolutional network (FCN) architectures have successfully reduced inaccurate detection. Liu and J proposed a deep hierarchical saliency network that realizes coarse-to-detailed estimation for salient objects [21]. Another method adopts a recurrent network to consider the connection of salient pixels [16].
Although a major SOD dataset contains the importance degree for objects [10], existing methods produce binary results; that is, they classify detected objects into salient or non-salient. The PASCAL-S dataset provides integer saliency values in [0, 255] with object contours. However, SOD methods disregard the priority of each object, and instead focus on estimating the contours of salient objects. Because detecting salient objects and their correct contours is a challenging task, researchers generally propose the estimation of the priority of each detected object as future work.
Semantic segmentation is a technique that identifies categories to which pixels belong, such as human, tree, and car [37]- [39]. Traditional semantic segmentation uses contour detection and the histogram of oriented gradients feature [37]. Recently, the FCN, which is a breakthrough approach for semantic segmentation, has been used to successfully detect image regions [38]. However, semantic segmentation methods cannot separate objects that belong to the same category.
Instance segmentation is derived from semantic segmentation and can identify not only object classes but also their instances [40], [41]. A basic instance segmentation method uses an FCN to detect small windows that each include one object [41]. Another method uses the recurrent architecture to iteratively detect object regions based on previous detection results [40]. Although instance segmentation and SOD similarly detect object contours, instance segmentation disregards their importance; therefore, the purposes of the approaches has been shown to be different.

III. FUNDAMENTALS OF SOD A. PASCAL-S DATASET
The PASCAL-S dataset contains images, their fixation data, and their SOD maps with multiple values that can be used ground truth (GT) for SOD-ID [10]. It contains 850 natural images whose full segmentation masks are provided in [42]. The fixation data were obtained by applying an eye-tracker to eight subjects that were instructed to perform a free-viewing task for images. In the SOD experiment, 12 subjects were given images and asked to highlight salient objects by clicking on them. The pixels of the SOD maps have integer values in [0, 12], and they are linearly normalized in [0, 255] for the png format. Therefore, we believe that PASCAL-S is an SOD-ID dataset with 13 degrees.

B. FCN METHODS
The FCN was introduced for image classification based on Visual Geometry Group (VGG) networks in [43], and then several FCN architectures were proposed for several applications [6], [34], [35], [38]. The VGG architecture consists of five blocks that each have two or three convolutional layers and a pooling layer. The FCN architecture is constructed by replacing the last layer of the VGG architecture with a onechannel convolutional layer. Some methods that apply merge and convolution layers to the FCN obtain superior results to past methods because the layers realize both shallow and deep convolutions; thereby, they can capture both global and local features [6], [38].

C. LOCATION-BIASED DETECTION
In SD and SOD, the location assumption is generally used as prior information [7], [11], [13], [15], [24], [36]. Photographers generally center interesting objects in images, and thus natural images often present salient areas at their center. To exploit this tendency, some SOD methods apply higher weights to salient pixels closer to the center of images [13], [24]. Following this strategy, in an SD method, a locationbiased convolution layer was introduced in the FCN, which obtained superior results [7].

D. RSOD
The CNN model detects the contours of salient objects and estimates their multiple saliency values because of its architecture [25]. The architecture recursively calculates saliency maps from coarse to fine levels, and finally fuses the resultant saliency maps. The calculation units are learned using the multi-stage GT of the saliency maps that is generated from PASCAL-S by thresholding its saliency maps at various values. Therefore, the fused maps have various pixel values that reflect saliency levels from coarse to fine.
As an additional process, the method estimates the importance score for each salient object from the output saliency map [25]. In basic terms, the score value is calculated by averaging the saliency values of pixels within the object as where S, X , X , χ i , and N X denote a predicted saliency map, candidate salient object, set of indices of pixels that belong to X , saliency value of the i-th pixel, and total number of pixels in X , respectively. It is unknown whether the calculated values are normalized because this is not clearly described in [25]. Note that the authors used the GT segmentation masks in PASCAL-S in this process. In experiments, the method simply uses conventional methods for evaluation. Spearman's rank correlation coefficient [32] is used as the evaluation metric, and the resultant scores are linearly normalized in [0, 1]. PASCAL-S without images used in training is directly used for testing the method.

IV. DISCUSSION ON SOD-ID A. DEFINITION OF SOD-ID
We define SOD-ID as a technique that detects the contours of salient objects and estimates their importance degree. Its methods produce a saliency map whose pixel values represent VOLUME 8, 2020 the importance degree scores of objects to which they belong. SOD-ID is mostly similar to SOD, but in contrast to the binary maps of SOD, its GT saliency maps have several values for N -degree objects, as shown in Fig. 2. N -degree means that the maps have integer values in [0, N − 1], where, clearly, zero indicates that the pixel of the map belongs to a non-salient object, and N -degree is linearly normalized according to the coding format As mentioned in Section III-A, PASCAL-S seems to be an SOD-ID dataset which N = 13 according to experiments. Moreover, note that SOD-ID is a generalized version of SOD; that is, SOD is an SOD-ID in N = 2. In this article, N = 7 is empirically used based on the characteristics of natural images. Table 1 shows the distribution of natural images in PASCAL-S [10] with respect to the number of salient objects within them, where the first, second, and last rows denote the number of salient objects, number of images that include salient objects of the corresponding number in the first row, and distribution, respectively, and ''7+'' in the eighth column indicates seven or more salient objects. From Table 1, natural images typically contain six or fewer salient objects. They rarely contain seven or more salient objects, but in most cases, some objects in one image have the same saliency levels. Therefore, because 7 degrees adequately realizes SOD-ID for natural images, N = 7 is generally valid. Clearly, the value of N can be fixed flexibly for various image applications.  [10] with respect to the number of salient objects.

B. SIGNIFICANCE OF SOD-ID
SOD-ID is a generalization of SOD and more suitable for image applications than SOD. People ordinarily rank objects in an image with respect to their interests. Similarly, it has been observed in experiments that subjects sometimes recognize salient objects as non-salient because of the objects' locations. SOD-ID estimates general results of this ranking, and therefore saliency information produced by SOD-ID is more related to human behavior than SOD. Moreover, by thresholding with various parameters as post-processing, SOD-ID produces various saliency maps of SOD. SOD-ID, which is used as pre-processing, results in a variety of saliency information useful to image applications, such as retargeting, content-aware coding, and summarizing.
For instance, SOD-ID is clearly more suitable than SOD for image retargeting from our experiments. Similar to Fig. 1, Fig. 3 shows retargeting results according to [30] for a multi-object image. The input image in Fig. 3 (a) represents ''dogs pull a sled and a human rides'' and therefore its important words are ''dog,'' ''pull,'' ''sled,'' ''human,'' and ''ride.'' Image retargeting should retain important words and sentences for input images in its results. In that sense, Fig. 3 (d) shows a failure because the dog is not clearly visible and hence, unfortunately, it represents the wrong sentence, ''something pulls a sled and a human rides.'' By contrast, in Fig. 3 (e), the retargeting result for SOD-ID, accurately represents the original sentence, ''dogs pull a sled and a human rides.'' For other images and retargeting methods, the results sometimes demonstrate the superiority of SOD-ID for image retargeting, as shown in Figs. 1 and 3.

C. SUPERVISED EVALUATION METRIC
The supervised evaluation metric of SOD-ID should measure the degree of similarity with respect to segmentation and the importance degree. Because SOD-ID methods aim to detect the contours of salient objects, they should be evaluated in the same manner as segmentation. Additionally, they should be evaluated when calculating the correlation and similarity of values for scores of the importance degree. An object that has higher scores than another object in the GT should have higher scores in the results of SOD-ID methods, and smaller is better in terms of the difference between the score values between the GT and the results of SOD-ID methods. Unfortunately, because conventional rank correlation coefficients evaluate the correlation but ignore the similarity of scores, Spearman's rank correlation coefficient, which is used in [25], is unsuitable for calculating the importance degree.
In this article, we propose an evaluation metric for the importance degree of SOD-ID. As the evaluation metric for segmentation, conventional methods, for example, the F-measure, can be used. An evaluation metric for SOD-ID is defined as a linear combination of the F-measure and the proposed metric, or the parallel use of them. The proposed metric F is defined based on simply combining metrics for the correlation and score similarity as where R, I , α, v p , and v t denote the correlation and similarity metrics, a balancing free parameter, and vectors for which each element is the score value of each object, respectively. We use the Kendall rank correlation coefficient as R [33] because it straightforwardly evaluates the correlation and therefore is more suitable than Spearman's rank correlation coefficient. For I, we use the squared error and define it as where N , v pi , and v ti denote the number of objects, and the i-th element of v p and v t , respectively, and σ is a free parameter that controls the variance of the Gaussian distribution. The metric proposition requires much experimental evidence, but because of the limited space in this article, the validity of F is briefly shown in Section 6 and a detailed discussion on this topic remains as future work.

D. DATASET PREPARATION
To create SOD-ID datasets, the procedure of PASCAL-S mentioned in Section III-A is suitable. The segmentation masks are simply obtained manually, and the importance degree is determined as follows: By the strict rules, the subjects of experiments are asked to collect and rank interesting objects in one image. The strict procedure requires several subjects, but unfortunately, it is a difficult task for them. By contrast, the procedure of PASCAL-S only asks subjects to collect interesting objects. For an object, the number of subjects that recognize it as salient is directly determined as its values of the importance degree, and to create a GT map of SOD-ID, pixels within each salient object have their scores based uniformly on the segmentation mask. If M subjects are applied, the resultant map has M degrees. This is simple and useful, but a large number of subjects are required to create general datasets.
To avoid experiments using subjects, we introduce a preparation procedure for the SOD-ID dataset based on existing SD data. As mentioned above, subjective experiments have the troublesome characteristic of requiring many people and large costs. To avoid this, we use existing SD data to produce the SOD-ID maps. The proposed procedure calculates the sum of pixel values within objects in the GT maps of SD, and resultant values are considered as their scores of the importance degree, which is defined in one image as where Deg i , s j , and i denote the score of the i-th object, i-th pixel value of the SD map, and a set of indices of pixels within the i-th object, respectively. To produce the SOD-ID map, pixel values within the i-th object are uniformly set as Deg i , and the resultant map is linearly quantized using N . Because the GT maps of SD represent the degree of saliency for each pixel, the summation values within an object are approximately recognized as the degree of interest for the object. Similarly, a pixel value within an object in the GT maps of SD is approximately considered as the number of subjects that recognize the object and categorize it as salient, and therefore, in the case of a large number of subjects, the summation procedure is recognized as the same as that of PASCAL-S for SOD mentioned in Section III-A. Based on the above assumptions, we believe that the proposed procedure is valid for creating SOD-ID datasets. We experimentally show that the proposed procedure mentioned above has high validity compared with the RSOD procedure mentioned in Section III-D [25]. Using these procedures, SOD-ID maps are produced using the full segmentation masks and fixation data of PASCAL-S. Table 2 shows this comparison, where ''Sum.'' and ''Ave.'' denote the results of the proposed and RSOD procedures; that is, they show values of the evaluation metrics between the SOD maps of PASCAL-S and their resultant maps, respectively. For simplicity, we use Spearman's and Kendall rank correlation coefficients as the metrics [32], [33]. From Table 2, the proposed procedure is clearly better than the RSOD procedure and thus our opinions mentioned above has been shown to be valid.

V. PROPOSED SOD-ID METHOD A. OVERVIEW
The proposed SOD-ID method is briefly shown in Fig. 4. The system consists of three technical blocks: instance segmentation, SD, and importance degree estimation. First, instance segmentation is applied to an input image to detect object contours, and its arbitrary method can be used here such as that in [40], [41], [47], [48]. Second, the salient regions of the input image are detected by the proposed CNN method using object contours detected in the first block. Finally, using the results of the first and second blocks, the proposed method  outputs an SOD-ID map with N degrees through the estimation block of the importance degree. The technical blocks can be independently developed, and therefore the system provides suitable expandability and serves as a fundamental design of SOD-ID methods.

B. PROPOSED CNN METHOD FOR SD
In this section, we explain the proposed CNN method for SD in the second block that uses the detected contours of the first block. The architecture uses the contours as a part of the input and extracts their multi-resolution features to estimate the saliency values. The loss function imposes different weights for object and background regions based on the contours. Note that the proposed CNN method considers location bias similar to conventional SD and SOD methods. Table 3 show the architecture of the proposed CNN method and its parameters, respectively. Figs. 5 (a)-(c) correspond to Tables3 (a)-(c), respectively. In Table 3, ''Conv.'', ''Pool.'', and ''p * '' indicate convolution, max pooling layers, and the pyramid pooling module, respectively. The rectified linear unit [49] is used as the activation function in the convolution layers. A VGG-based method is used to extract image features in Fig. 5 (a). The results of the first block and the features after Pool.3, Pool.4, and Conv.5-3 are merged along the channel direction, and input merged signals into Conv. 6-1. The signals after Conv. 6-2 are transformed using the pyramid pooling module proposed in [39], and the resultant signals are resized to the same size as the signals after Conv. 6-2. Finally, the resized signals and those after Conv. 6-2 are merged along the channel direction, and processed through Conv. 7-1, 2, and 3.

2) LOSS FUNCTION
The loss function of the proposed CNN method assigns high and medium weights for salient and object regions, respectively, and by contrast, low weights to background regions because they are generally uninteresting. The loss function L is formulated as where y i , x i , O i , φ(·), and β denote true saliencies, estimated saliencies, object region masks, a normalization function, and a free parameter, respectively. The masks are produced by binarizing signals of the instance segmentation results. φ(·) normalizes the estimated saliency values in [0, 1]. We generally set β to 2 or a value that is the maximum of O i + y i . If the i-th pixel is in a salient object, β − O i + y i is a low value, and hence this pixel is assigned a high weight.

3) TRAINING
For training, the loss function in Section V-B2 and the training dataset of COCO and SALICON were used [45], [46]. COCO contains natural images and their segmentation masks, and SALICON has saliency maps that correspond to them. The maps were binarized using a threshold value of τ = 0.15, and their elements corresponding to background pixels, which were detected by the masks, were set to zero. Stochastic gradient descent was used as optimizing, where Nesterov momentum, the weight decay, and the learning rate were set to 0.9, 0.5, and 10 −3 , respectively [50]. β in the loss function was set to 2.3, which was experimentally determined from the ratios of salient, object, and background regions.

C. ESTIMATION OF THE IMPORTANCE DEGREE
In the proposed method, the estimation block process is defined similarly to the proposed procedure in Section IV-D. Object contours are already detected in the first block and their saliency values are estimated in the second block. In the third block, the values within one object contour are summed and the result is its score of the importance degree as given in (4). Similar to the proposed procedure, SOD-ID maps are created based on the resultant scores and linearly quantized with N .

VI. SIMULATION
In this section, we compare the performance of the proposed method and state-of-the-art methods for SOD and SOD-ID. We present the comparisons in Section VI-B and VI-C, respectively, and before that, we discuss the validity of the proposed metric in Section VI-A by presenting some examples. For this simulation, we used the instance segmentation method proposed in [40] in the first block of the proposed method because it is not recent but has high accuracy. Based on Section IV-D, we introduced a dataset from the test sets of COCO and SALICON, which contain images with segmentation masks and their SD maps, respectively, where the proposed dataset is called a SALICON-based dataset in this section. Note that the proposed method is also represented by Prop. in this section.

A. VALIDITY OF THE PROPOSED METRIC
As mentioned in Section IV-C, the validity of the proposed metric is briefly shown in this section. Table 4 shows scores of pairs of arbitrary vectors in Spearman's and Kendall rank correlation coefficients, and the proposed metric. In Table 4, the pairs from the top to the bottom, respectively, indicate various scenarios as follows: same rank and slightly different value, slightly different rank and value, same rank and quite different value, and quite different rank and slightly different value. As mentioned in Section IV-C, SOD-ID metrics have to simultaneously evaluate the rank correlation and the value similarity. In that sense, from the first and third pairs, the proposed metric only satisfies the above property. We observed from the second and fourth pairs that the Kendall coefficient is too sensitive to the rank difference to be used as the SOD-ID metric. The fourth pair shows that the rank correlation is quite different, but its values are almost the same and hence the importance of objects is also considered to be comparable. However, the score obtained using Spearman's coefficient is rather bad and its weight for the rank correlation and the value similarity has been shown to be unbalanced. The proposed     [45], [46]. metric is clearly more suitable to be used as the SOD-ID metric than the two coefficients.

1) SETTINGS
HDCT [15], RFCN [16], DHS [21], DSSOD [24], and RSOD [25] were used as SOD methods for comparison. The methods were applied to the test sets of the DUTS, PASCAL-S, and SALICON-based datasets [10], [44]- [46], and the results were evaluated using the F-measure [51]. To calculate the F-measure, it is required that the saliency maps of the PASCAL-S and SALICON-based datasets, and the results of the methods are binarized. Because we set N = 7 in this article and the maps of PASCAL-S have integer values in [0,255] with N = 13, objects whose scores of the importance degree were one or more were recognized as salient for the SALICON-based dataset and therefore the maps of PASCAL-S were binarized with a threshold value of 36. According to TABLE 9. Scores for the estimation of the importance degree for the SALICON-based dataset [45], [46]. the above, the results of HDCT, RFCN, DHS and DSSOD were binarized with a threshold value of 0.14, and for RSOD and Prop., the value was 1.  Fig. 7. Unfortunately, the instance segmentation   [45], [46]. method often detected nothing for DUTS because of its above characteristic, as shown in the upper half of Table 5. However, the results of Prop. except that case were equivalent to those of the other methods. Prop. can solve this problem using an efficient instance segmentation method that accurately detects objects.

C. COMPARISON OF THE PROPOSED METHOD WITH SOD-ID METHOD 1) SETTINGS
In SOD-ID, Prop. was compared with RSOD which is the only existing SOD-ID method. The methods were applied to the PASCAL-S and SALICON-based datasets, and the results were evaluated using Spearman's and Kendall rank correlation coefficients and the proposed metric (2), where α and σ were experimentally set to 0.5 and 2.0, respectively. Clearly, the GT and resultant maps were uniformly normalized with N = 7. Tables 8 and 9 show the scores of RSOD and Prop. in the metrics for each dataset, and Figs. 6 and 7 show the images and GT maps in the datasets, and their resultant maps, where high values of pixels in the maps indicate high scores of the VOLUME 8, 2020

2) EVALUATION
importance degree. Note that the rows in Tables 8 and 9 correspond to those in Figs. 6 and 7, respectively. From Table 8 and 9, Prop. clearly outperformed RSOD in terms of the metrics. From Figs. 6 and 7, Prop. accurately estimated the importance degree of objects. Particularly, in ''Party,'' ''Woman,'' and ''Man,'' Prop. estimated the importance degree of small objects that had low saliency scores and were located in highly salient objects.

VII. CONCLUSION
In this article, we introduced SOD-ID via discussing its definition, significance, dataset condition, and evaluation metric property, and proposed its dataset, metric, and method. The proposed metric consists of the Kendall rank correlation coefficient and mean squared error, and simultaneously evaluates the rank correlation and value similarity for SOD-ID. The proposed dataset is generated using the proposed procedure based on the COCO and SALICON datasets. The proposed method of SOD-ID consists of three processing blocks: instance segmentation, SD, and importance degree estimation. We proposed a CNN-based SD method for the second block that uses the results of the first block. With this strategy, the proposed method objectively outperformed state-of-theart methods with respect to SOD and achieved an accurate SOD-ID.  VOLUME 8, 2020