Wireframe Parsing With Guidance of Distance Map

We propose an end-to-end method for simultaneously detecting local junctions and global wireframe in man-made environment. Our pipeline consists of an anchor-free junction detection module, a distance map learning module, and a line segment proposing and verification module. A set of line segments are proposed from the predicted junctions with guidance of the learned distance map, and further verified by the proposal verification module. Experimental results show that our method outperforms the previous state-of-the-art wireframe parser by a descent margin. In terms of line segments detection, our method shows competitive performance on standard benchmarks. The proposed networks are end-to-end trainable and efficient.<xref ref-type="fn" rid="fn1"><sup>a</sup></xref><fn id="fn1"><label><sup>a</sup></label><p>The code will be released on github for reproduction of the results.</p></fn>


I. INTRODUCTION
Inferring 3D geometric information of a scene from 2D images has been a fundamental yet difficult problem in computer vision. For a long time, researchers model and reconstruct 3D scenes by extracting and matching local features (e.g.SIFT features, corners and patches). The feasibility of incorporating these local features by matching or tracking has been demonstrated by numerous works. Meanwhile, modern scenarios which often involve complex interactions of autonomous agents (UAV, car, home robot) with cluttered man-made environments (indoor or outdoor) present greater challenge for conventional local features based approaches. Specifically, the challenges lie in: Man-made scenes often consist of large areas of textureless surfaces; there may exist areas of repetitive patterns (facades e.g.) which cause much ambiguity for matching. Therefore, it is crucial for vision systems to capture more global features to analyze the scenes more accurately.
To cope with these challenging environments, prior knowledge about the scene is exploited to relax the problem in many works. For instance, Manhattan world assumption [1]- [5] has proven to benefit the 3D reconstruction tasks. However, the Manhattan world assumption is often violated The associate editor coordinating the review of this manuscript and approving it for publication was Chao Shen .
a The code will be released on github for reproduction of the results.
in cluttered man-made environments. Fortunately, independent of the assumption, we can formulate such environments as the ''wireframe'' defined in [6]. The wireframe [6] consists of a subset of salient lines and junctions in the scene. Conceptually, such junctions and lines could just be a subset among all the local corner and edge features extracted by traditional methods, but they encode rich information about the largescale geometry of the scene, which is the key to understand the scenes more globally. Detection of line segments [7] and junctions [5], [8] in 2D images has been studied in previous works. The detection results are further used to recover the geometric of the scenes in [2], [4], [9]- [11]. Typically, these methods rely on lowlevel cues and involve various heuristics, including search of appropriate thresholds, RANSAC-based verification techniques etc., which limit its scalability in new scenarios.
With the success of deep learning in computer vision, more and more approaches based on deep learning have been proposed to tackle vision tasks. Specifically, in the literature of line segment and junction detection, several methods have been proposed to detect line segments [12] or estimate the wireframe [6] in the image. In [12], a line attraction filed map (AFM) method is proposed to detect line segment by learning an embedding v dir for each pixel p in the image.
The v dir encodes the direction (with length) to the line segment pixel which is the closest to p. In the post-processing stage, the line segment candidate pixels are found by adding • The junction decoder predicts fixed number of junction (i.e.1) in each grid cell. Considering the uneven distribution of junctions in scenes, the assumption that there exists only one junction in a single grid cell does not apply to scenes with dense junctions, such as images of facades, urban street views.
• The process of proposing line segment from junctions suffers from the inaccurate junction predictions. Especially, the inaccurate junction branches make it difficult to pair junctions.
• The proposed line segments are not further verified to reject false connections between junctions.
Based on the above discussions of previous works on line segment detection and wireframe parsing, we propose an endto-end network for parsing the wireframe of man-made environments. Overall, our network contains two stages. In the first stage, our network learns three per-pixel embedding maps, including junction heatmap, junction branch map, and distance map. Then in the second stage, a set of line segment proposals are proposed from the junctions with guidance of the distance map, and further verified through a verification network.
As depicted in Fig. 2, with an image fed into the network, the junction detection module outputs a junction heatmap (heatmap decoder) and a junction branch map (branch decoder). Junctions are obtained by picking the locations of high density over the junction heatmap, thus our network detects junctions in an anchor-free manner, and is able to detect unlimited number of junctions. Meanwhile the distance decoder outputs a distance map. Examples of distance map is shown in Fig. 1. Then we propose a set FIGURE 1. Second column: The pre-computed ground-truth for distance map learning; Third column: The converted distance map. We set an array of thresholds {b i , i = 1, 2, . . . , 5} for the log-normalized distance value, of line segments using our Least Distance algorithm from the predicted junctions guided by the learned distance map. Finally, the proposals are passed to a verification module and each assigned a confidence score.
Contributions of this work include: (i) We propose an end-to-end learning network for wireframe parsing in the images of man-made environments, consisting of an anchorfree junction detection module, a distance learning module and a line segment proposing and verification module. (ii) Our method outperforms the previous wireframe parser by a descent margin, and is more efficient. (iii) Compared with the previous wireframe parser, our method produces much cleaner wireframe results with guidance of the learned distance map. (iv) In terms of line segment detection, our method is close to the state-of-the-art line segment detection method (AFM [12]).

II. RELATED WORK A. JUNCTION DETECTION AND KEYPOINT DETECTION
Junctions played an important role in computer vision, however, detecting junctions in images remains a challenging problem. Typical local methods such as Harris operator [14], are based on 2D variation in the intensity signal, and it is weak at handling textured regions. More recent methods find junctions in natural images [15] by studying contour curvature. Some others group line segments [5], [8], [9] to find junctions (i.e.the intersections of segments). As psychophysical analysis suggests that local junction is difficult to detect, even for humans [16].
Keypoint detection is closely related to some reconstruction tasks, such as reconstruction of scenes, faces and human poses. Many recent approaches for keypoint estimation apply deep neural networks (DNNs). Earlier works formulate keypoint estimation as a location (keypoint coordinates) regression problem [17]. In later works, keypoint detection methods mostly rely on regressing a pre-computed keypoint heatmap [18], [19]. The heatmap can be easily converted The overall architecture of our end-to-end Wireframe parser. After the first few layers, the input image is downsampled to H 8 × W 8 . F n is the mid-layer feature generated by the n th hourglass module. And the number of channel of all F n is 256 in our experiments. We use 3 hourglass modules in total, i.e.n = 3. The decoders of distance, heatmap and branch are a set of convolution layers of the same layer configuration. In proposal verification module, we perform RoI-Align [13] on F 3 .
to keypoint coordinates by searching the local maximums. Junctions can be naturally considered as geometric keypoints. Huang et al. [6] train a anchor-based junction detector using deep neural networks on a large-scale dataset with junction annotation, and achieve the state-of-the-art results. Recently, Zhou et al. [20] propose to find junctions in an end-to-end manner.

B. PIXEL-LEVEL EDGE DETECTION AND LINE SEGMENT DETECTION
Line segments detection has long been studied. Conventional methods typically rely on grouping low-level cues [7], [21]- [24]. These approaches suffer the pain of searching an appropriate threshold to filter out false detections. Some other work extends Hough transform to line segments detection [25]- [27]. In recent years, machine learning especially deep learning based approaches have been shown to produce the state-of-the-art results in generating pixel-wise edge maps [28]- [31]. Huang et al.propose to parse wireframe in man-made scenes. The wireframe defined in [6] consists of a subset of salient line segments and the intersections between them in scenes. And they build a large-scale dataset for training. Each image in the dataset is annotated with line segments and their intersecting relationship. Following Wireframe Parser, Xue et al. propose a line attraction field network [12] for line segment detection and achieve the state-of-the-art line segment detection results. Unlike the Wireframe Parser, they only focus on detection of line segments, ignoring the relationship between the segments. Zhang et al. [32] formuate the wireframe parsing task as a graph optimization problem.

C. LEARNING THE GEOMETRY
With the success of deep learning, learning based approaches on inferring pixel-level geometric properties of scenes have been developed for several years. Geometric properties, such as the depth [33], [34], and the surface normal [35], [36] have been extensively studied. Recently, more and more work pays attention to the higher-level geometric primitives.
For instance, [37] proposes a method to recognize planes in a single image, [38] uses SVM to classify indoor planes, and [39]- [41] train fully convolutional networks to predict room layout edges formed by the pairwise intersections of room faces. Liu et al. propose PlaneNet [42] to detect piecewise planar regions from a single image. Yang and Zhou [43] recover planar regions with deep networks.

D. EMBEDDING LEARNING FOR DETECTION
Embedding learning are frequently used in instance segmentation. Usually the embedding is learned by regressing the pre-computed embedding map generated from the groundtruth annotations and acts as the cue to cluster pixels or group keypoints into instances. For example, Liang et al.use proposal-free networks (PFN) [44] to handle instance segmentation task. PFN take the coordinates of the center, topleft corner and bottom-right corner of the object instance that a specific pixel belongs to as the target embedding, which is used to cluster segmentation results into object instances. Bai and Urtasun [45] apply deep networks to learn a distance embedding, which measures how close a pixel in an instance object is to the boundary of the instance. The network outputs an embedding map with peaks and basins. Watershed transform is then used in post-processing to find the basins, i.e.the instance boundaries. Papandreou et al. [46] propose to learn pose keypoints heatmap along with the displacement of each keypoint to other keypoints. Recently, Xue et al. [12] propose line attraction field networks to learn an embedding for line segment detection. The line attraction embedding encodes the direction (with length) of each pixel to the closest line segment pixel, and is post-processed with the squeeze module to find line segments.

III. METHOD
For clarity of notation, we let p and x denote an image pixel and a junction, respectively. And we denote a line segment by l. In addition, superscript 'g' and 'p' represent ground-truth and predictions, respectively. In terms of distance functions, i.e.the smaller euclidean distance between x from x s and x e . d(x i , x j ) is the euclidean distance between two junctions. And d ⊥ (x, l) defines the perpendicular distance of junction x to line segment l, while d(p, l) denotes the shortest distance of pixel p to line segment l, as illustrated in Fig. 3. The projection of junction x on line segment l is denoted by ρ(x, l). In the distance map, we use d p to represent the shortest distance of pixel p to any line segment pixel in the image, and d n p to denote the log-normalized distance value.

A. ANCHOR-FREE JUNCTION DETECTION
Junction acts as an important role in our line segment proposing algorithm. We follow the definition and representation in [6]. However, we made some modifications to make the representation more convenient.

1) JNCTION HEATMAP REGRESSION
In [6], the junction sub-network divides the image into K ×K grid, and detect at most one junction in each grid cell. The cell centers act as a kind of anchor commonly used in object detection methods [47]. And the use of anchor is based on the assumption that there exists at most one junction in each cell. However, the assumption does not apply to some scenes with dense junctions.
As mentioned in Section II, heatmap regression has become a common practice in keypoint detection tasks, such as human pose estimation [18]. By applying heatmap, estimation of the keypoint coordinates is transformed to spatial density regression. Inspired by this, we detect junctions by first regressing a junction heatmap. The ground-truth of junction heatmap is generated by placing a Gaussian probability distribution with a radius of 11 (the image size is 384) centered at each junction x. As is shown in Fig. 2, the junction heatmap decoder outputs a H × W junction heatmap. Then locations with local maximum density on the learned heatmap are extracted as junction candidates through non-maximum suppression (NMS). 1 After NMS, we obtain a set of junction candidates {(x i , c i ), i = 1, 2, . . . , N max }. N max is the maximum number (pre-set value) of junctions in an image. c i is heatpmap density value at x i . Overall, the proposed junction detection method shows more flexibility without relying on any anchors or assumptions. The representation of junction branches, adopted from [6]. If a branch falls into the k th bin, then it is represented with (k, k ). k if the offset of the branch from the bin center (the black dashed line) in clock-wise direction. Note that k is normalized to [−1, 1).

2) JUNCTION BRANCH DEFINITION AND REPRESENTATION
For junction branches, we adopt the same multi-bin representation proposed in [6]. As depicted in Fig. 4, the circle (i.e., from 0 to 360 degrees) is divided into K equal bins, with each bin spanning 360 K degrees. Let the center of the k-th bin be b k , then an angle θ is represented as (k, k ), if θ is located at the k-th bin, where k is the angle offset from the center b k in the clockwise direction. Thus, each valid bin (e.g.exists a junction branch) will regress to this local orientation k . We normalized k to be in [−1, 1) in all our experiments. For each predicted junction location (from heatmap), the branch decoder in Fig. 2 outputs a 2K vector, i.e.confidence score and angle offset for K bins.
However, the junction branch defined in [6] is incomplete for proposing line segments. Geometrically, junction is the intersection of n (n ≥ 2) line segments, that is, each junction should have at least two branches. However, the endpoints of some line segments lie on the image boundary (the line is cut off by the image boundary), or some curve lines. Both cases are not considered in the previous definition. Hence we extend the junction definition to the union of intersections of line segments and isolated endpoints of line segments (e.g.on the image boundary or none-straight lines). Then each junction contains n (n ≥ 1) branches. This definition is more flexible, and more suitable for finding line segments from junctions.
In [6], junctions locations and branches are predicted separately, we similarly use two decoders to predict junction heatmap (1 × H × W ) and branch map (2K × H × W ), respectively.

B. DISTANCE MAP LEARNING
The shortest distance of a point to a line segment is depicted in Fig. 3. For an image, the distance map D is computed by choosing the smallest among all distances {d(p, l i ), i = 1, 2, . . . , N L } for each pixel p, where N L is the total number of ground-truth line segments in the image. As mentioned previously, the embedding learned in instance segmentation is exploited to group pixels into instances in post-processing. In our case, the distance map is designed to filter out nonline textures and encodes a value for each pixel. Examples of ground-truth distance map are presented in Fig. 1. We use the log-normalized distance d n p = log(d p + 1) as the target for distance map learning, since d p can be very large. And Let r ik denote the ray formed by junction x i and branch θ ik , and ψ(x i , x j , r ik ) be the angle difference between vector − → x i x j and the ray r ik . The Least Distance algorithm takes following steps.
• At first, we initialized an empty candidate set C i and an empty proposal set P. C i stores the line segment candidates starting from • For a single candidate in C i , we extract a fixed length of distance value from D using our modified version of RoI-Align Module [13]. Specifically, for candidate (x Let ϕ denote the angle difference between two lines, ) if either of the two conditions below is satisfied. The condition 1 can cover most cases, and condition 2 is to handle the situation where a ground-truth line segment was not annotated with accurate length (a little shorter or longer). The tolerance threshold τ ⊥ is set to 0.2 in all our experiments.

2) PROPOSAL VERIFICATION
In the proposal verification module, features for all the line segment proposals are extracted by performing the modified RoI-Align on F n (feature layer after the n th hourglass module) and fed into a proposal classification network, which outputs a confidence score for each proposal. As to RoI-Align, we first sample S equidistant sampling points from the starting point to the ending point of a line segment, and sample values on the feature map F n (shown in Fig. 2) at those sample points. Note that the original RoI-Align Module [13] is intended for extracting features of bounding box areas, hence we modified the module to extract features for line segments.
To align the features with the line segment, our modified RoI-Align module samples feature values by bilinear interpolation to get sub-pixel features. Assume the number of channels of F n is C, the extracted feature map for each line segment proposal is of shape S × C.

E. NETWORK ARCHITECTURE
Our network is typical encoder-decoder style, as shown in Fig. 2. The first few layers are convolution and pooling layers, which downsample the input to 1/8 of original resolution. Then n (n = 3 in our experiment) hourglass module [18] are stacked to get the feature maps {F i , i = 1, 2, . . . , n}. The following decoders can take any F i as input, and we used F 3 as input for decoders as shown in Fig. 2. Each decoder consists of a set of convolution and upsampling layers, followed by a 1 × 1 convolution layer to get the desired embedding maps. For simplicity, all the decoders share the same layer configuration. To classify the proposals, the proposal verification module consists of two convolution layers and one fully connected layer. Note that all convolution layers are followed by batch normalization layer.
where J p and J g refers to the prediction and ground-truth junction heatmap, repectively. Note that we adopt the weighting scheme of focal loss [48] to combat the imbalance of zero and non-zero value on the heatmap. And where we weight the loss of each pixel p with d p p , since pixels far from lines result in much larger loss if not weighted.
For junction branches, as mentioned above, we first matched every predicted junctions x p i ∈ X to the groundtruth junctions G = {x g i }. We define a mapping function f (i) to represent the index of the ground-truth junction which is matched with the predicted junction i. And c i and r i are both K -dimension (K is the number of bins) vector representing the branch confidence and residual of junction x i , respectively. Assume there are N junction predictions in X, and As mentioned in Section III-D, each line segment proposal l p k is assigned a label g k . Let cls(l p k ) denote the confidence score output by the proposal classification network, then   (6) where w J , w D , w bc , w br , w prop are weights to balance the sublosses.

G. TRAINING
In the training and testing phase of our network, images are resized to 384 × 384 before input. And we follow the data augmentation strategy in [6], including mirroring and flipping images upside-down. The training of our network consists of two stages. In the first stage, we set w prop to zero in the first 25 epochs, since there are not enough proposals for classification in the early stage of training. At the second stage, we resume the training of L prop and train 5 more epochs. 2 We train our network with the stochastic gradient descent (SGD) optimizer with momentum 0.9, initial learning rates 0.25 3 and a batch size of 4.

IV. EXPERIMENTS
We evaluate the proposed method and make a comparison with the previous state-of-the-art wireframe parser [6]. 4 Also we compare our line detection results with the 2 Weights of other losses are not changed. 3 Note that our learning rate is large due to the scale of our loss is small 4 https://github.com/huangkuns/wireframe The Wireframe dataset [6] contains 5462 images of indoor and outdoor man-made environments. Each image is annotated with a set of junctions and lines. We follow the split in [6], i.e.5000 images for training and validation, 462 images for testing. On York Urban dataset, we simply use the model trained on Wireframe dataset to evaluate the performance.

B. EVALUATION METRICS
In all our experiments, the precision and recall as described in [6], [51] are reported. Specifically, precision and recall are computed by comparing the line segment pixels map of the prediction and ground-truth. The precision measures the proportion of true detections among all the detected line segments, whereas recall indicates the percentage of true detections among all the ground-truth line segments in the image.

C. COMPARISONS WITH OTHER METHODS
The proposed method is compared with the previous Wireframe Parser [6], the Line Segment Detector (LSD) [7] and the Markov Chain Marginal Line Segment Detector (MCMLSD) [50], and the Line Attraction Field (AFM) [12]. These methods filter out false detections with different thresholds. We set an array of thresholds for − log(NFA) (NFA is the number of false alarms) in LSD implementation, i.e.0.01 × 1.75 0 , . . . , 1.75 19 . And for MCMLSD, we select top K (in terms of confidence) line segment detections for comparison. For AFM, aspect ratio is used to filter out false detections, and it varies in the range (0, 1] with a step size of 0.1. The Wireframe Parser rejects false detections by applying an array of thresholds [2,6,10,20,30,50,80,100,150,200,250,255] to binarize the line heatmap, while keeping the threshold of junction confidence and junction branch confidence fixed. In the proposed method, we simply vary the threshold of the confidence of line segment detections in the range (0, 1] with a step size of 0.1 to pick true detections.
The precision/recall curves are presented in Fig. 5. As is shown, our method outperforms the previous Wireframe Parser by a descent margin on Wireframe dataset, and is close to the Wireframe Parser on York Urban dataset. It is worth noting that the York Urban dataset is a small dataset aimed for Manhattan lines estimation, and some line segments in the images are not annotated. Hence the performance of almost all approaches significantly drops on this dataset. Comparing with the state-of-the-art line segment detectors, our method outperforms LSD and MCMLSD, and is close to the recent AFM on Wireframe dataset.
All our experiments are conducted on one NVIDIA Titan X GPU device. We compare the inference speed of all the methods mentioned above in Table. 1. As is shown, our method is much faster than the previous Wireframe Parser [6], since it does not require any post-processing step. And our method runs as quickly as the AFM-unet [12].

D. QUALITATIVE EVALUATION OF RESULTS
The results of all aforementioned methods are visualized in Fig. 6. For the proposed method and the Wireframe Parser, the junctions (yellow) and line segments (green) are displayed together. Since the line segment detection methods do not output junctions, we simply treat the endpoints of the line segments as junctions and present the ''junctions'' and the line segments.
As is shown, LSD [7] and MCMLSD [50] are sensitive to local textures, and produces many short line segments. AFM [12] greatly improves this limitation by suppressing the interference of local textures. However, AFM still yields some short line segments. The Wireframe Parser further reduces the effect of local textures by defining junctions explicitly and then establishing connections between junctions. Unfortunately, some wrong connections are found due to the lack of means to reject wrong connections. Compared with the Wireframe Parser [6], our method produces much cleaner wireframe. Specifically, our method works better at suppressing wrong connection between junctions on two aspects. On one hand, the learned distance map guides the establishment of connections between the detected junctions, according to the Least Distance algorithm. On the other hand, the line proposal verification module explicitly suppresses the wrong line proposals.
In terms of line segment detection, our method is close to AFM [12], the state-of-the-art line segment detector. Different from AFM, our network yields both the line segments and the intersections between pairs of them.

V. CONCLUSION
In this paper, we propose an end-to-end wireframe parsing method based on the previous state-of-the-art wireframe parser. The proposed parser consists of an anchor-free junction detection module, a distance map learning module and a line segment proposing and verification module. Line segments are proposed from junctions with guidance of distance map, and further verified by the verification module. The network is end-to-end trainable and efficient. Experimental results show that our method outperforms the previous wireframe parser, and is close to the recent AFM [12] in terms of line segments detection. There is still plenty of room for improvement, and the end-to-end pipeline still can be further optimized. Considering the difficulty of the task and the great variety in the dataset, one future direction is to generate large-scale of synthetic wireframe data to help training better model.