Bidirectional Multiscale Refinement Network for Crisp Edge Detection

With the development of deep convolution neural networks (CNNs), contour detection has made great progress. Some contour detectors based on CNNs have better performance than human beings on standard benchmarks. However, it is easier for CNNs to learn the similar features of adjacent pixels, and the number of background pixels and edge pixels in the input training sample is highly imbalanced. Therefore, the prediction edge by the edge detector based on CNNs is thick and requires post-processing to obtain crisp edges. Accordingly, we introduce a novel parallel attention model and a novel loss function that combines cross-entropy and dice loss through the use of adaptive coefficients, and propose a novel bidirectional multiscale refinement network (BMRN) that stacks multiple refinement modules in order to achieve richer feature representation. The experimental results show that our method has better performance than the state-of-the-art on BSDS500 (ODS F-score of 0.828), NYUDv2 depth datasets (ODS F-score of 0.778) and Multi-Cue dataset (ODS F-score 0.905(0.002)).


I. INTRODUCTION
Contour is an important feature of images, and accurate edge detection is a basic task of machine vision and image processing. In the early years of image technology, most users obtained the target edge by looking for the gray intensity mutation in the image. However, most users currently obtain the semantic meaningful object contour, because it is the basis of object recognition, image segmentation, and epic flow algorithms. With the development of deep learning, CNN has become the main method of edge detection, such as N 4 -field [26], Deep Contour [11], Deep Edge [4], HED [12], and RCF [13]. These make full use of the hierarchical feature learning ability of neural networks and obtain optimal F-score performance in benchmark datasets such as BSDS500 and NYUDv2.
Although CNNs have achieved better accuracy in detecting target contours than traditional classical algorithms, their prediction is much thicker than the classical method. Fig 1 shows the contour maps that were obtained by the Canny operator and the richer convolution features (RCF) detector, which are not processed by non-maximum suppression. It can be found from the previous literature that deep learning methods pay more attention to the precision and recall of contour, and less attention to the crispness of boundary (precisely localizing edge pixels) than traditional methods. CED [14] added a residual enhancement module in HED [12] and used sub-pixel convolution to recover image resolution step by step, which allowed the decoding process to integrate all features of image coding as much as possible, and improved the crispness of contour. LPCB [15] achieved greater accuracy of contour by improving the loss function. BDCN [42] trained the neural network by using multiscale features. Inspired by LPCB [15], BDCN [42], LRC [47], RCF [13], and CBAM [32], a bidirectional multiscale refined network (BMRN) is proposed. The BMRN model uses a multi-scale training method to train on the enhanced Berkeley segmentation dataset (BSDS500), the NYUDv2 depth dataset, and Multi-Cue in order to further improve its generalization performance.

II. RELATED WORK
For the task of edge detection, the early literature used the gray information mutation between the edge and background of the object and used local gradient to detect the target contour. For example, the Prewitt operator [1] and the Canny operator [2] both use derivative to detect contour. In recent literature, Tao Fang et al. proposed BAR [45], which is based on the bilateral asymmetric receptive field mechanism of the visual pathway. Qing Zhang et al. proposed DIDY [46], a biologically inspired model for contour detection based on binocular disparity and receptive field dynamics. With the evolution of machine learning, the whole and local information of the image is fully considered.
Edge detection is regarded as a pixel-level classification problem, with the pixels classified by extracting the local features of the image. Martin et al. proposed the Pb [21] by using the channel characteristics of image brightness (BG), texture (CG), color (TG), and the logistic regression algorithm. Arbelá ez et al. proposed gPb [3], which gives each pixel a global probability by calculating global characteristics on the basis of multi-scale Pb [21]. Subsequently, Arbelá ez et al. constructed the image Gaussian pyramid and extracted the contour features at different scales to achieve contour extraction and proposed the multi-scale combinatorial grouping (MCG) algorithm. In recent years, with the success of deep CNNs in the field of computer vision, deep CNNs are increasingly being applied to contour detection. Inspired by FCN [43], Xie et al. proposed HED, the first end-toend contour detection algorithm [12]. The network structure of the algorithm is divided into coding and decoding networks, and the entire network is trained directly end-to-end by the backpropagation algorithm. On the basis of HED [12], Liu et al. proposed RCF [13], which compresses the features of each convolution layer with the same resolution, thereby greatly improving the network's ability to express image features. Also on the basis of HED [12], Wang et al. added a residual enhancement module and used sub-pixel convolution to gradually restore the image resolution, so that the decoding process could synthesize all the features of image coding as much as possible. They proposed CED [14], which greatly improved the accuracy of contour.
Because the CNN-based edge detection method has the ability to learn features automatically, the robustness of the algorithm is improved to a certain extent. However, these models of end-to-end edge detection, such as RCF [13] and CED [14], use the VGG16 [44] traditional classification network as the backbone coding network, making it easier to produce similar responses of adjacent pixels. At the same time, the pooling layer in the network structure further reduces the image resolution, and the edge pixels and non-edge pixels in the training samples are not balanced. These factors make it difficult to guarantee both the precision and accuracy of the output contour map. This type of thick and fuzzy edge contour may be a disadvantage in some visual tasks, such as recent optical flow methods, which require accurate and crisp edge input to interpolate sparse matching results.
In view of the above problems, this paper proposes a bidirectional multiscale refinement network based on BDCN [42]. The three main contributions of this paper are: (1) To improve the performance of the network model in terms of obtaining the local details of the contour, we propose a novel parallel attention model, which can encode the spatial context information of a large range of feature images into local features.
(2) We propose a novel refinement block, which can automatically obtain multiscale feature weights in the end-toend training process, so that multiscale features can be reconstructed.
(3) We propose an adaptive weight loss function, which combines cross-entropy and dice loss functions and minimizes the distance between image and ground truth from the image level to the pixel level.

III. PROPOSED METHOD
In this section, the specific structure of the proposed method is described. First, the formula reasoning of the network model is introduced, and the overall structure of the network model is described. In order to obtain a crisper contour, the loss function is improved, and a new adaptive loss function is created by fusing the cross-entropy and dice loss functions.

A. FORMULA
(X, Y) is used as a pair of samples in the training dataset, where X represents the input original image and Y is the corresponding real label image. According to the described object scale, Y is decomposed into S binary edge images, and the formula is: (1) represents the edge image corresponding to scale s. is described as two complementary label edge images in BCDN: one to ignore the edges with larger scale than s, and the other to ignore the edges with smaller scale. The two complementary label images are defined by: In the formula, s2d represents information propagated from the shallow layers to the deep layers, and d2s represents information propagated from the deep layers to the shallow layers. Therefore, is defined as (4) Based on the above formula, the edge prediction results are interpolated by 2 + 2 at scale s. To make full use of the image multiscale features to train the neural network, we propose a bidirectional multiscale refinement network model, whose specific structure is described in Section B.

B. NETWORK STRUCTURE
The overall structure of the proposed network architecture is shown in Fig 2. The proposed network is composed of multiple ID Blocks and Refinement Blocks, each of which is trained by layer-specific supervisions inferred by a bi-directional cascade structure, as shown in Fig. 2(a). The optimized VGG16 [44] is used in the backbone network, which removes three full connection layers and the final pooling layer to decrease the numbers of parameters and remove the unclear features due to too small a scale. The 13 convolution layers of VGG16 [44] are divided into five modules (ID Block). A 2 × 2 max pooling layer is used between each ID Block module to increase the receptive field layer by layer. Richer multiscale hierarchical features can be obtained in the horizontal direction of the network structure. The output feature maps with different resolutions by different ID Blocks are input to the corresponding Refinement Block in a cascade manner, as shown in Fig. 2(b). This structure can integrate the multi-scale features generated by the coding process, and allows us to make full use of the multi-resolution features of the side output to conduct deep supervised learning in the training process. Previous studies have shown that this type of structure does a good job of capturing hierarchical features and produces semantic edge contours through fusion.

ID Blocks and Refinement
Blocks are the basic module of the entire network model. We use the attention mechanism to improve the ID Blocks of BDCN [42], so that it is more conducive to extract the localness details of the contours on the basis of retaining the SEM module. The input feature image is propagated from shallow layers to deep layers along the main coding network, and the predicted edges 2 and 2 of the output are generated after each intermediate Refinement Block.
2 and 2 , 2 and 2 are used as training samples for bidirectional supervised training of the network model. Finally, all the side outputs of the middle layer are fused to calculate the final edge contour. This network structure does not increase the horizontal depth of the network expands the network vertically. On the contrary, it increases the nonlinearity of the model, reduces the correlation of adjacent pixels, and is more conducive to the extraction of crisp edges.

1) PARALLEL ATTENTION MODEL (PAM)
Inspired by CBAM [32], SENet [34], and DANet [33], we observe that each channel of the high-level feature map can be regarded as a specific class response, and that different semantic responses are related to each other. The channel attention model can enhance the expression of specific semantic features and improve the most important features in the channel feature map. The position attention model, which is a part of parallel attention model, can encode the spatial context information of a wide range of feature images into localness features, so that the semantic part of the feature map is enhanced. Our goal is to detect the crisp contour, and to do so we need to obtain more details of the local contour. Based on this, we propose a novel parallel attention model that uses the position attention module in DANet [33] and the SE module in SENet [34]. Fig 4 shows the details of the proposed attention model, which can be embedded into some network models as an entire module. The SE module and the position attention module can be connected in parallel and in series in a network structure. The series connection can be further divided into two parts, according to the sequence of the two modules, with either the SE module or position attention module in the front. For the above connection, the ablation experiment shows that parallel connection can better extract the local features of the contour. For a further trade-off of the proportion of the position attention and SE attention modules, the feature values processed by position attention and SE attention are multiplied by factors α and β, and then pixel level summation is performed. At the beginning of training, α and β can be given an initial value, and then the optimal solution can be achieved gradually in the training process. As shown in Fig 4, the calculation process of the position attention module is shown in formulas (5) and (6).
In formula (5), we first feed feature map X ∈ × × into a convolution layer to generate two new feature maps, B and C, and transform their dimensions to { , } ∈ × , where N = H × W represents the number of pixels. Matrix multiplication is performed between the transpose of B and C, and then the spatial attention feature map is obtained by the softmax function calculation.
In formula (6), we feed feature X into a convolution layer to generate a new feature map D, and its dimension is transformed into D ∈ × . Matrix multiplication is performed between the transpose of D and S, and its dimension is transformed into × × . Finally, perform an element-wise sum operation with the features X to obtain the final output ̃∈ × × .
In formula (7), ∈ × × represents the input image, and the feature map of the c-th channel of input image X is processed by average pooling layer to obtain .
where = [ 1 , 2 , ⋯ , ] , 1 ∈ × and 2 ∈ × , denotes the Relu function, and the activation function. ̃= ( , ) = Formula (9) represents the c-th channel feature map of output image ̃, which is obtained by multiplying scalar with the c-th channel feature map of input image . The final formula definition of parallel attention model is: In formula (10), SE represents the squeeze and exception model, and represents the position attention module. and are given an initial value at the beginning of training (the empirical value is set to 0.01 and 1.1, respectively), and the optimal solution is obtained in the training process.

2) REFINEMENT BLOCK
Inspired by DRC [41], in order to better integrate multiscale features, we propose the refinement Block to transform the original network model. The main idea is to use the middle edge feature map to aggregate the edge evidence of edges. The structure is shown in Fig 5(a), including weight convolution layer, sub-pixel convolution, and element-wise add layer. As shown in Fig 5(b), the weighted convolution layer uses two identical feature maps as inputs, one adopting 1 × 1 convolution following ReLU layer and another adopting a sigmoid function activated trainable parameter α. Through the weight convolution layer, different feature weights can be automatically learned in the end-to-end training model, which is more effective than manually specifying weights. According to the different resolution of each ID Block output feature map, we use sub-pixel convolution to transform a low-resolution feature map into one of high resolution. We use sub-pixel convolution rather than up-sample, because up-sample did not restore the spatial details of pixels by using a bilinear interpolation algorithm, which makes the edges of the detected objects more blurred. Sub-pixel convolution is a standard convolution layer, which rearranges the feature values, called phase shift. This method helps to eliminate the blocking effect in the image super-resolution task and keeps the computational cost low. Sub-pixel convolution is highly effective for accurate edge location. Finally, we adopt an element-wise add layer to fuse the outputs of multiple weighted convolution layers, because the element-wise add layer has fewer training parameters than a 1 × 1 convolution layer after concatenating the outputs of multiple weighted convolution layers. The output channel is set to the dimension of the smallest one in the same refinement block.

C. ADAPTIVE WEIGHT LOSS FUNCTION
Loss function is an important part of end-to-end CNN, because it directly affects the prediction results. In CNN-based edge detection, the design of loss function needs to pay attention to three things: the imbalance between positive and negative samples; the combination of multiple losses; and the accuracy of test results. Based on these three considerations, we fuse cross-entropy and dice coefficient by adaptive weight, which improves the performance of edge detection. Due to the high imbalance between the edge and non-edge pixels of the input image in contour detection, HED [12] uses the weighted crossentropy loss function to solve this problem, so that the end-toend CNN can be trained. The expression of weighted crossentropy loss function is: In formula (11), + represents edge pixels, − represents non-edge pixels, = | − | | | ⁄ , and 1 − = | + | | | ⁄ . represents the input image, and ( | ; , ) is the classification probability of pixel through the softmax function.
In formula (12), and represent the corresponding values of prediction map and ground truth at the i-th pixel, respectively. The dice loss can be used to measure the approximation of two sets of image pixels, and it is not necessary to consider the imbalance between edge pixels and non-edge pixels. Dice loss is considered as the measure function of image-level similarity, while cross-entropy loss focuses on pixel-level difference. Therefore, the fusion of these two loss functions can minimize the distance between images and labels from the image level to the pixel level. In LPCB [15], it is necessary to manually set two weight values of loss function. Because these are empirical values obtained through many experiments, it is easy for errors to occur. To solve this problem, according to the principle of Nash equilibrium, take ℒ ( , ) + ℒ ( , ) as one side of the game, and then construct a term (1 + 1 ) opposite to the original cost function as the other side of the game. In order for the network model to achieve end-to-end training, and finally obtain the weight coefficient and value to achieve Nash equilibrium, we propose an adaptive weight loss function. The formula is: In formula (13), ℒ ( , ) is equation (12), ℒ ( , ) represents the normal cross-entropy loss ℒ ( , ) = − ∑ ( + (1 − )(1 − log )), and N is the total pixel number of input images. In the training process, the larger values of and increase the contribution of ℒ( , ) + ℒ ( , ), while the smaller values of and decrease its contribution. The last term (1 + 1 ) is the regularization term of parameters and . All ID Blocks and Refinement Blocks are trained with two layer-specific side supervisions in our network. In addition, we fuse the intermediate edge predictions with a fusion layer as the final result. Therefore, BMRN is trained with three types of loss. We define the total loss ℒ as, where and are weights for the side loss and fusion loss, respectively. ℒ( 2 , 2 ) , ℒ( 2 , 2 ) and ℒ ( , ) all adopt the formula (13), represents the final edge prediction, represents the ground truth.

IV. EXPERIMENT
In this section, we introduce the dataset and hyperparameter settings and discuss the use of ablation experiments to verify the proposed method. The ablation experiment results show that the subsystems of the proposed model affect the performance of the entire network. In addition, further comparisons with state-of-the-art methods verified that our method has better performance in detecting crisp contour.

1) DATASET ENHANCEMENT
In the experiments, the widely used BSDS500 [3], NYUDv2 [6] depth, and Multi-Cue edge/boundary [21] datasets were used to train and evaluate our model. The BSDS500 consists of 500 image pairs, the training set consists of 200 image-label pairs, the verification set consists of 100 image-label pairs, and the test set consists of 200 image-label pairs. NYUDv2 consists of 1449 image pairs of RGB and depth images. The training set includes 381 image-label pairs, the verification set includes 414 image-label pairs, and the test set includes 654 image-label pairs. To avoid insufficient training and poor robustness due to the lack of training images, the training set on the BSDS500 dataset was randomly scaled by 0.75-1.25 times, and the image-label pairs were rotated by 16 different angles and flipped at the same time, so that the number of image-label pairs in the training set reached more than 10,000. On the NYUDv2 dataset, image label pairs were rotated at four angles (0°, 90°, 180°, and 270°) and flipped at each angle. Multi-Cue contains 100 natural scenes. Each scene has two frame sequences, one obtained from the left view and one from the right view. The last frame of the left view sequence is labeled as edge and boundary. According to HED [12] and RCF [13], we randomly divided 100 labeled scene images, 80 for training and 20 for testing. Three experiments were carried out independently, and the average results of three experiments were used as the final output.

2) HYPERPARAMETERS
In this paper, we used the PyTorch 1.20 framework to implement the proposed neural network model. The network model uses SGD as the optimizer, and the backbone network uses the VGG16 [44] model pretrained on the ImageNet dataset. Other hyperparameters are set as follows: min batch size (10), global learning rate 1e-(4), and momentum (0.9). In the BSDS500 and NYUDv2 datasets, the iterative training was 40,000 times, and the learning rate decreased 10 times every 10,000 times. We trained 2,000 and 4,000 iterations for Multi-Cue edge/boundary, and were set to 0.5 and 1.1, respectively, and all experiments were performed on a Nvidia GeForce RTX 2070 video card with 8 GB of RAM.
Based on the previous work, the standard non-maximum suppression was performed to obtain the final edge. To evaluate the performance of different methods, we used average precision (AP) and F-measure, the latter at both optimal dataset scale (ODS) and optimal image scale (OIS) [30,31]. To make the edge prediction value match the ground truth annotations, the maximum distances on BSDS500 and NYUDv2 were set to 0.0075 and 0.011, respectively [14].

B. ABLATION EXPERIMENTS
We performed ablation experiments to analyze the performance of each subsystem in our method in the BSDS500 dataset. We used the BSDS500 training set and validation set to train the network model and evaluated the performance of the model on the test set. Table 1 shows the influence of different attention modules in the proposed network model on performance. It can be seen that the proposed PAM has better performance than previous attention models. In addition, we applied the proposed loss function to HED [12], CED [14], and the proposed network model BMRN for comparative experiments. The results are summarized in Table 2, which shows that the proposed loss function can achieve better output results. Table 3 shows the ablation experimental results of BDCN combined with the proposed modules. It can be seen from the experimental results that each proposed module improves the performance of BDCN model, and the experimental performance of our algorithm is the best. Compared with BDCN model, our method improves the values of ODS, OIS and AP by 2.73%, 1.82% and 3.31%, respectively. During the ablation experiment, other parameters had the same settings.

1) PERFORMANCE COMPARISON ON BSDS500
The proposed method was compared with several non-deep learning methods and deep learning-based contour detection methods, including N4-field [26], DeepContour [11], DeepEdge [4], HED [12], RCF [13], CED [14], LRC [47], TIN [49], and PiDiNet [50], and traditional contour detection methods, including Canny [1], gPb [3], SE [8], and PMI [18]. The results are summarized in Table 4  than the previous best method CED [14] and also exceeded the human standard on the BSDS500 dataset (ODS = 0.803). It can be seen from the qualitative results in Fig 8 that our method is thinner and crisper than the output of HED [12] and CED [14]. Note that in Table 4, the suffix of the network model is MS, which means multi-scale fusion is adopted, and others use single scale. In single-scale edge detection, we input an original image into the network model, and then the edge probability map is output. In order to improve the quality of prediction edge, we created image pyramids to detect multi-scale edges, and each of these images are separately input to the network model. All prediction edge probability maps are resampled to the original image size using bilinear interpolation. Finally, these prediction maps are averaged to obtain a final prediction map. Trade off accuracy and speed, we adopted three scales of 0.5, 1.0, and 2.0.

2) PERFORMANCE COMPARISON ON NYUDV2
The NYUDv2 dataset contains two sub-datasets: RGB dataset and HHA dataset. Experiments were carried out on all two subdatasets, and the final result of RGB-HHA was obtained by averaging the prediction results of the RGB and HHA training models. The proposed method was compared with several nondeep learning methods, including gPb-UCM [3], gPb-NG [45], and SE [8], and deep learning methods, including HED [12], RCF [13], AMH-Net-Resnet [41], LPCB [15], BDCN [42], LRC [47], TIN [49], and PiDiNet [50]. All experiments were based on single-scale inputs. The results of the quantitative comparison experiments are summarized in Table 5, from which we see that the RGB-HHA version performs better than the RGB and HHA data training versions in terms of ODS, OIS, and AP. It can also be seen that COB-ResNet50 [19] achieves the best results, because it uses UCM [3] in the post-processing, which has better performance for contour detection. Fig 9 shows the precision-recall curve of the different methods. The qualitative results of our method and CED [14] are summarized in Fig 10. Compared with CED [14], our prediction results obtain crisper edges, which proves the effectiveness of our method.

V. CONCLUSION
In this paper, we analyze the reason that CNN-based edge detectors do not produce crisp edge, and propose a novel model that largely improves the localness performance of CNN-based edge detectors. The proposed network model can make each middle layer focus on a specific scale through a bidirectional multiscale refinement structure, and train the network model through specific supervision in each middle layer. In order to accurately locate the edge pixels, we propose a refinement block and parallel attention model, and embed them into the entire model structure. At the same time, to solve the problem of a significant imbalance between edge and non-edge pixels, an adaptive weight loss function is proposed. Independent experiments were carried out on the BSDS500, NYUD-v2, and Multi-Cue datasets, and the results verify that the proposed method has better performance than the state-of-the-art.