Listwise View Ranking for Image Cropping

Rank-based Learning with deep neural network has been widely used for image cropping. However, the performance of ranking-based methods is often poor and this is mainly due to two reasons: 1) image cropping is a listwise ranking task rather than pairwise comparison; 2) the rescaling caused by pooling layer and the deformation in view generation damage the performance of composition learning. In this paper, we develop a novel model to overcome these problems. To address the first problem, we formulate the image cropping as a listwise ranking problem to find the best view composition. For the second problem, a refined view sampling (called RoIRefine) is proposed to extract refined feature maps for candidate view generation. Given a series of candidate views, the proposed model learns the Top-1 probability distribution of views and picks up the best one. By integrating refined sampling and listwise ranking, the proposed network called LVRN achieves the state-of-the-art performance both in accuracy and speed.


Introduction
Image cropping is a common photo manipulation process, which improves the overall composition by removing unwanted regions. Image cropping is widely used in photographic, film processing, graphic design, and printing businesses. Recent methods tend to learn photo composition and extract well-composed regions from ill-composed photo.
With the development of deep learning, most of researchers have devoted their efforts to proposing deep networks based on ranking approach. For ranking-based training, a number of candidate views in each image are labelled with the aesthetic ordering. Then, the image cropping task is formalized as classification of view pairs into two categories (correctly ranked and incorrectly ranked). Finally, sliding window [Chen et al., 2017b], detector [Wang and Shen, 2017] or reinforcer [Li et al., 2018] is adopted to finding the best view.
In [Chen et al., 2017a], Chen et al. first investigated learning-to-rank methods for image cropping. View finding network (VFN) [Chen et al., 2017b] based on a pairwise ranking layer is proposed to model the photo composition and crop image by sliding window. Wei et al. trained a view evaluation network (VEN) [Wei et al., 2018] with the pairwise siamese architecture. Inspired by knowledge distillation [Hinton et al., 2015] and anchor boxes [Liu et al., 2016], view proposal network (VPN) is proposed to transfer knowledge from VEN. In [Li et al., 2018], reinforcement learning is adopted to crop image step by step, and each step is controlled by the aesthetic score generated by a pre-trained VFN.
However, these ranking-based cropping methods are often poor in performance, which is mainly due to two reasons: First, pairwise training is unsuitable for image cropping process. In image cropping, the main goal is to pick up the best composed view from a list of candidate views. That is, image cropping is a listwise ranking task rather than pairwise comparison. In addition, pairwise training heavily depends on careful pair selection, because the samples with various distribution will result in training bias. Therefore, pairwise training significantly increases the computational complexity and make the training procedure unstable.
Second, coarse feature extracted from convolutional neural network (CNN) will affect the accuracy of model learning. Previous methods crop and warp the views in raw images or feature maps, and then calculate the rank score for each one. Pixel-accuracy is important in image cropping rather than object classification, and it will be reduced by warp operation. In addition, rescaling caused by the pooling layers will reduce the sampling resolution and damage the composition learning.
To overcome these problems of image cropping, we propose a listwise ranking method with refined view sampling. In refined view sampling, a novel region of interest (RoI) operation called RoIRefine is proposed to extract refined feature maps of candidate views. Instead of carefully selecting the view pairs, we take advantage of all annotated views and train the model with a listwise ranking loss.
In summary, our main contributions are: • We learn deep network for image cropping with listwise ranking.
• We propose a refined view sampling named RoIRefine to alleviate the problem of rescaling and distortion.
• The proposed model significantly outperforms the stateof-the-art methods in both accuracy and speed.  et al., 2007], visual saliency information, face and skin color detection results are combined for placing bounding box in image cropping. For those aesthetic-based methods, they emphasize the general attractiveness of cropped image. [Fang et al., 2014] proposed a aesthetic photo cropping system which combines three models: visual composition, boundary simplicity and content preservation. A set of aesthetic quality classifiers were trained to discriminate the quality of candidate windows [Wang and Shen, 2017]. With the development of datasets labelled by comparative aesthetic score, ranking-based methods are adopted to grade the composition of candidate windows [Kong et al., 2016;Chen et al., 2017a].
Recently, ranking-based methods together with other novel framework (e.g. knowledge transfer [Wei et al., 2018] and reinforcement learning [Li et al., 2018]) have achieved the stateof-the-art performance.

Learning to Rank
Ranking is widely used in information retrieval [Yao et al., 2016], recommender systems [Li et al., 2016] and software engineering [Xuan and Monperrus, 2014]. In learning-torank task, training data consists of lists of items with some partial order which is specified between items in each list. Most ranking algorithms are categorized into three groups by their input representation and loss function: the pointwise, pairwise, and listwise approach [Liu and others, 2009]. Pointwise approaches assume that each item in the training data has a numerical or ordinal score. Then the learning-torank problem can be approximated by a regression problem. Ordinal regression and classification algorithms can be used to predict the score of a single item. For example, the perceptron ranking (PRank) algorithm was proposed to find a rank-prediction rule that assigns each instance a rank order [Crammer and Singer, 2002].
Pairwise approach formalizes the learning task as comparison of object pairs into two categories (correctly and incorrectly). RankNet [Burges et al., 2005] learned a rank rule by using gradient descent methods and a natural probabilistic cost function on pairs of examples. RankBoost [Freund et al., 2003] used boosting to train ranking model by minimizing classification errors on instance pairs. Listwise approaches try to directly optimize the value over all items on training data. ListNet [Cao et al., 2007] tried to define a listwise loss function for learning to rank and introduces two probability models, respectively referred to as permutation probability and Top-1 probability. Suppose that π is a permutation on the n objects, and Φ (·) is an increasing and strictly positive function. Then, given the list of scores s, ListNet defines the probability of permutation π as (1) and the top one probability of object j is defined as where π(1) = j means the j object is ranked on top one in π permutation. Thus, from Eq. (1) and (2), we can obtain where s j is the score of object j = 1, 2, . . . , n.
In general, with the use of top one probability, cross entropy is used to represent the distance between the two given score lists.

The Proposed Approach
In this paper, we propose a listwise view ranking network (LVRN) for image cropping. As illustrated in Figure 1, a refined view sampling (called RoIRefine) extracts highresolution features to rank candidate views with listwise loss.

Listwise View Ranking
To address the shortcut of pairwise approaches, we formulate composition learning as a listwise ranking problem. In this paper, the proposed model listwisely ranks the candidate views and picks up the best one.
Given a set of annotation images where m is the number of images and n is the number of views. For each view v n i in the i-th image, a rank score y n i is labelled to represent the relative degree of view composition. For instance, the number of views n is 24 in CPC dataset labelled with listwise protocol. In the view ranking network, we denote the rank function as f (·), which takes a view v j i (sampled from image d i ) as input and then outputs a rank score f (v j i ). For the i-th image, we can obtain a list of scores . . , f (v n i )} from the list of views V i . Therefore, the ranking function f (·) can be optimized by minimizing the loss between Z i and ground-truth scores Y i .
Instead of pairwise approaches carefully selecting the training pairs, listwise learning removes the training bias as all candidate views are seen in each iteration. Even so, there are still a few view biases in the list -the best composed view is important than the worse ones. To address this problem, a nonlinear transformation Φ (·) is adopted to amplify the effect of the best one. We define Φ (·) as a common increasing function: According to Eq. (3), we rewrite the output scores to Top-1 probability as

RoIRefine
Listwise Loss Figure 1: The framework of listwise view ranking for image cropping. The model first applies VGG16 (truncated before the last max pooling layer) as the backbone to extract the feature of input images. Given a series of candidate boxes, the refined view sampling integrates RoI features in each bounding boxes. And then we use 3-layer full connection (FC) to generate the ordering distribution, and apply cross entropy to measure the listwise distance.
Similarly, the ground-truth score is rewrote as P Y . Following [Cao et al., 2007], we employ cross entropy as metric to minimize the distance between output probability P Z and ground-truth probability P Y . The loss function is defined as The ranking function f (·) can be simply found by minimizing the loss function L(Y, Z). Once the ranking function f (·) is learned, we simply use it to calculate the rank scores and crop the images from candidate views.

Refined View Sampling
Coarse features extracted from CNN backbone limit the performance of image cropping. Previous methods generate candidate views from images and then warp them to a fixed size (e.g., 227 × 227 in VFN ). However, warping is not suitable for composition learning and make the view deformed. The deformation of feature seriously damages the common composition rule, such as golden ratio, golden spiral and rule of thirds. In additional, the rescaling and multiple downsampling in the CNN backbone make the model insensitive to view contents.
For view generation, there are three common RoI-aware operation shown in Figure 2: • RoIPool [Girshick, 2015] (Figure 2(a)) is a standard operation for extracting a small feature map from each RoI. The quantized RoI is subdivided into spatial bins, and finally feature values covered by each bin are aggregated. The quantizations introduce misalignments between the RoI and the extracted features.  (Figure 2(b)) removes the harsh quantization of RoIPool, properly aligning the extracted features. In each RoI, RoIAlign uses bilinear interpolation to compute the exact features at four regularly sampled locations, and aggregates the RoI features using max/average pooling.
• RoIWarp (Figure 2(c)) operation is proposed in [Dai et al., 2016]. Unlike RoIAlign, RoIWarp crops a feature map region and warps it into a target size by interpolation. Even though RoIWarp also adopts bilinear resampling, it overlooks the alignment of floating-number RoI.
These RoI-aware operations are widely used in object detection and instance segment, but unsuitable for image cropping. Inspired by RoIAlign and RoIWarp, we propose an RoIRefine layer shown in Figure 2(d) to extract high-quality features for reducing deformation. Our proposed change is simple: we sample the full-map features and resample the RoI-aware features to reduce deformation. The first bilinear interpolation improves the sampling resolution. Although the first bilinear interpolation does not increase additional information, the improvement of resolution makes the features sensitive to floating-number RoI. Without the first interpolation, we cannot achieve the float coordinate in the feature map, which means candidate boxes shift or rescale. In the other word, interpolation implements finer sampling with float-quantization. The second resampling avoids inconsistent between the feature maps and candidate views. In this paper, we simply upsample the full-map features to 2× size and resample the RoI-aware features to the size of 14 × 14. Considering the trade-off between performance and efficiency, 2× upsampling is the best choice as larger scale upsampling (4x or 8x) hardly improves performance. Compare to previous RoI-aware operations, RoIRefine leads to large improvements as shown in Section 4.3.

Implementation
In this paper, we initialize the backbone CNN with VGG16 pre-trained on ImageNet. All weights of the three FC layers are initialized with normal distribution (zero mean and 0.01 standard deviation), bias are set to zero and the channels are set to 1024, 512 and 1, respectively. The proposed model is trained on CPC dataset [Wei et al., 2018] including 10,797 images, each with 24 candidate views. We directly rank 24 views and assign the order of views as ground-truth rank score.
During training, the images are resized to 224 × 224 regardless of its original size. Resizing the original image to fixed size is to fit the VGG-16 pretrained on 224x224 images (ImageNet), and is beneficial to model fine-tuning. Although resizing the original image does result global deformation, but its effect is weak. Global deformation does not affect listwise ranking because the ranking objects are views instead of original images. Every candidate views in one list only have the same global deformation, and listwise ranking loss is not sensitive to global deformation.
We trained the network for 10 epochs using stochastic gradient descent (SGD) with momentum of 0.9 and learning rate of 0.001 that decays by 0.1 after 4 epochs. The batch-size is set to 50 that means each mini-batch including 50 × 24 candidate views cropped from 50 images. Early stopping was adopted based on validation results on FCDB dataset [Chen et al., 2017a].

Experiments
We validate the effectiveness of the proposed model on two public image cropping databases (FCDB [Chen et al., 2017a] and FLMS [Fang et al., 2014]). We also compare the time efficiency on a GPU to existing image cropping models in Table 4.

Experimental Settings
To evaluate our model, we utilize the sliding window strategy of [Chen et al., 2017b] to generate candidate views and

FCDB Dataset
FCDB contains 348 test images and each image is labelled by a photography hobbyist. To evaluate the generalization ability of our model, we adopt the same metrics as previous works [Chen et al., 2017a;Wei et al., 2018], including intersectionover-union (IoU) and boundary displacement (Disp). The IoU can be computed as where Area gt i and Area pred i denote the area of the groundtruth and best-ranking crop view, respectively. Boundary displacement is given by whereB j i and B j i denote the four corresponding edges between the ground-truth and best-ranking crop view, respectively.

FLMS Dataset
FLMS contains 500 test images and each image has 10 annotations from 10 different persons. The evaluation metric is a little different as it has more annotations for each image than FCDB. Following previous methods, Top-1 maximum IoU is chosen as the evaluation metric. Top-1 means to pick up the best cropping views to compute the result. We compute the IoU between the ground-truth and Top-1 views, and then choose the maximum IoU as final results.

Quantitative Evaluation
In this section, we study the cropping accuracy of our model with the state-of-the-art methods. We evaluate the performance on FCDB and FLMS dataset. VFN uses the ground truth window as the candidate views which leads to remarkable improvement, and VPN performs a post-processing by discarding small views to improve performance. For comparison fairness, the results (shown in Table 1 and Table 2) are evaluated without ground-truth windows and post-processing as [Li et al., 2018].

FCDB Dataset
As shown in Table 1, we evaluate the cropping performance on FCDB dataset. Besides of the methods discussed above (VFN, A2-RL and VPN), we choose two other pairwise learning-to-rank methods as baselines. AesRankNet [Kong et al., 2016] is proposed to rank photo aesthetics modelled by a pairwise loss function. RankSVM [Chen et al., 2017a] uses AlexNet to extract aesthetic features and find the best cropping window among candidate views. According to Table 1, the proposed model achieves the best IoU and Disp scores compared to the others.

FLMS Dataset
We also evaluate on FLMS dataset and the results are shown in Table 2. Following [Li et al., 2018], we choose Top-1 maximum IoU (Max IoU) as metric to represent cropping accuracy. In addition to ranking-based methods, two classification-based methods are also compared on FLMS dataset. Fang et al. [Fang et al., 2014] learns an aestheticbased cropping model by discriminative classifier training. In [Wang and Shen, 2017], attention box prediction (ABP) network and aesthetics assessment (AA) network are proposed to model the photo assessment problem as aesthetic quality classification. From experiments in Table 2, we can see that our model outperforms other methods in cropping accuracy.

Time Efficiency
To validate the time efficiency, we compare the time cost between our model and the state-of-the-art methods (VFN, A2-RL and VPN) on FCDB dataset. All the results in Table 4 are evaluate on the same perform with one NVIDIA GeForce   1080 GPU. The selection of candidate views plays an important role for image cropping. In Table 4, Candidate means the number of bounding boxes used to find the best view (in VFN and VPN) or extract the evaluation feature (in A2-RL). In general, the model using most candidate views in evaluation can most likely find the best results shown in Table 3. From Table 4, the proposed method uses the most candidate views in the least time (120+ frames per second) and achieves the best accuracy.

Performance Analysis
Performance of Listwise Learning To illustrate the effectiveness of listwise learning, we design a contrast experiment shown in this section. As described in Section 3, we build the VGG16-based networks without RoIRefine using different ranking losses (pairwise and listwise). We train the networks on CPC dataset and compute the rank scores for all candidate views once time. Following [Chen et al., 2017b;Wei et al., 2018], the pairwise loss is defined as where v 1 i and v 2 i is two views selected in the same image d i , and v 1 i is preferred more than v 2 i . For listwise training, the training setting is the same as the Section 4.3 except that RoIRefine is not used.
The result of pairwise training heavily depends on pair selection because the samples with various distribution will result training bias. In order to compare as much as detailed, we train three models with different selection methods shown in Table 5. Without pair selection, there are more than 2.6 million pairs in CPC dataset. With simple pair selection, we set a threshold 0.5 to drop the pairs with a minor gap of rank score, and generate 1.3 million pairs. With careful pair selection, we train the model following [Wei et al., 2018]. The markedly improvement in Table 5 shows that listwise training overcomes the problem of the pairwise training.

Performance of RoIRefine
In this section, we study the improvement of the proposed RoIRefine. In order to show the contribution of using The column (f) shows the cropping results of our model. the listwise loss function when using the better view sampling(RoIRefine), we train and evaluate eight models using different ranking loss and different RoI-aware operations. The experiment results of RoIRefine and three RoI-aware operations on FCDB are shown in Table 6. The differences between these RoI-aware operations is the number and place of interpolation shown in Figure 2. RoIPool aggregates the view feature after RoI-crop without any interpolation; RoIAlign aligns the feature using interpolation before RoI-crop; RoI-Warp resamples the feature using interpolation after RoIcrop. Inspired by RoIAlign and RoIWarp, RoIRefine adopts bilinear interpolation before and after RoI-crop.
RoI-aware operations reduce the deformation caused by traditional view generation and achieve the markedly improvement about 5.0% IoU. Without interpolation, RoIPool results feature deformation and achieves the worst performance of four RoI-aware operations. RoIAlign and RoIWarp removes the harsh quantization of RoI boundaries, and improve IoU by about 2.0% to 2.5% over RoIPool. RoIRefine combines pre-interpolation (RoIAlign) and post-interpolation (RoIWarp) to refine the view feature, and achieves a gain about 1.5% IoU than RoIAlign and 1.0% IoU than RoIWarp. The experiment results demonstrate that high-quality features extracted by RoIRefine can overcome the problems of rescaling and deformation.

Qualitative Visualization
As shown in Figure 3, there are five groups of qualitative results generated by different methods on FCDB dataset. Obviously, it is very intuitive comparison that our model can extract better view than the others.
For A2-RL (Figure 3(b)), reinforcement learning is sensitive to initial status and iteration step, resulting unstable performance shown in the second and fifth images. VPN ( Figure  3(c)) uses 895 anchor boxes including the origin image, and tends to select the full image shown in the first two images. Because of high computational complexity, VFN cannot apply a mount of candidate views to achieve high-accuracy results shown in Figure 3(d). Comparing Figure 3(a) and Figure  3(e), we can see that our predicted boxes are close to groundtruth. In the last column, the results cropped by our method have better visual quality than the origin images. Image cropping is a common photo manipulation process, which improves the overall composition by removing unwanted regions. In this paper, we formulate the learning of photo composition as a list-wise ranking problem to overcome the problem of pairwise-based approaches. Furthermore, a novel RoIRefine operation is proposed to extract high-quality features for view generation. The experiment results on two common datasets show that our method creates new state-of-the-art results with faster speed of 120+ frames per second.
In the future work, we will study multi-task learning to combine composition evaluation and boxes regression. Unfortunately, how to design the multi-task loss is a problematic issue. Inspired of the success of detection framework, RCNN-like [Girshick, 2015] or SSD-like [Liu et al., 2016] method will be our first choice.