DNN-Based Map Deviation Detection in LiDAR Point Clouds

In this work we present a novel deep learning-based approach to detect and specify map deviations in erroneous or outdated high-definition (HD) maps using both sensor and map data as input to a deep neural network (DNN). We first present our proposed reference method for map deviation detection (MDD) utilizing a sensor-only DNN detecting traffic signs, traffic lights, and pole-like objects in LiDAR data, with deviations obtained by subsequently comparing detected objects and examined map. Second, we facilitate the object detection task by using the examined map as additional input to the network. Third, we employ a specialized MDD network to directly infer the correctness of the map input. Finally, we demonstrate the robustness of our approach for challenging scenes featuring occlusions and a reduced point density, e.g., due to heavy rain. Our code is available at https://github.com/Volkswagen/3dhd_devkit.

D UE to shortcomings of today's perception and environment modeling algorithms [1], the driving function of automated vehicles relies on prior knowledge regarding the stationary environment in the form of high-definition (HD) maps. Such driving function is prone to failure as map data can deviate from the real world. In our previous work [2], we proposed a system framework to achieve dependable maps to address the issue of such map deviations. Such dependability requires a system that detects, specifies, and corrects map deviations to obtain maps that are safe to use, reliable, and available. Specifically, the safe use of map data refers to the detection of deviations ahead of the vehicle, while reliable maps feature short update cycles. Available maps are obtained by correcting deviations on the fly within the vehicle, e.g., in construction sites. This article proposes a map deviation detection (MDD) method of such a system. known (e.g., for road markings [5], traffic lights [6], or lanes [7]) but underperform in bad weather conditions [8] or in the presence of partial occlusions [7]. Therefore, a deviation detection method is needed that simplifies the detection task wherever possible by leveraging the map as source of prior hypotheses regarding possible element locations and features. In this work, we present a novel approach to MDD based on a deep neural network (DNN) that takes both map and sensor data as input (see Fig. 1 for illustration). With this additional map input, fewer measurements are potentially required to either verify or falsify the correctness of an existing hypothesis, as respective measurements can be compared to an expected measurement distribution for a specific type of map element. Using a DNN, respective distributions and the comparison method can be learned, whereas hand-crafting yields unsatisfying results in a related attempt [9], as expected distributions have been proven difficult to be derived manually. Thereby, contextual knowledge from the scene can be incorporated to facilitate the deviation detection task, e.g., locations for pole-like objects (subsequently referred to as poles) increase the likelihood for traffic signs.
As proof of concept, we train and evaluate our DNN using the 3DHD CityScenes dataset [10] as the only publicly available dataset providing a holistic set of HD map elements, i.e., comprising signs, lights, and poles. The dataset features high-density LiDAR point clouds that were used to annotate the corresponding HD map, yielding accurate labels both in terms of completeness and spatial alignment between sensor and map data. Such a high-quality dataset has several key advantages compared to data obtained from onboard sensors. First, nearly all deviations are known, apart from those due to annotation mistakes, as we induce artificial deviations by changing both map and LiDAR data intentionally. Also, no other dataset provides annotated map deviations. Second, the usage of high-density point clouds without occlusions allows for respective ablation studies, where both a reduced point density (simulating onboard scans or bad weather conditions) and occlusions can be exactly controlled. Third, the highly precise spatial data alignment allows for the induction of artificial misalignment errors in a controlled fashion. Our holistic approach extends to various types of HD map elements and can be combined with different DNN architectures for 3D object detection. In this work, we apply 3DHDNet [10] being designed for predicting vertically stacked elements such as signs. As 3DHD CityScenes does not provide camera images, we only employ LiDAR point clouds as method input. However, also multimodal architectures additionally fusing camera images can be employed (e.g., [11], [12], [13]), given a respective dataset.
Our contributions are fourfold. First, we present our reference method for MDD utilizing a sensor-only DNN for object detection, with deviations being obtained by comparing detected objects with the examined map. It comprises a multitask extension of our earlier single-task 3DHDNet architecture [10], now capable of predicting traffic signs, traffic lights, and poles simultaneously. Second, we demonstrate a way of facilitating the object detection task by providing the DNN with the encoded, examined map as additional input. Third, we present our specialized MDD network that directly evaluates the correctness of individual map elements as the full expression of our concept shown in Fig. 1. Last, we perform ablation studies simulating bad weather and low-density onboard scans by reducing point density, and simulating partial occlusions of objects, showing the superior performance of our specialized MDD network. Our deviation annotations published along with our code allow for benchmarks in the field of MDD, while our open-source method may serve as a baseline for future research.

II. RELATED WORK
In this section, we review detection methods for HD map elements in both geodesy and in the automotive domain regarding traffic signs, traffic lights, and poles, followed by a review of existing methods for map deviation detection. We conclude with a brief review regarding DNN-based object detection in LiDAR point clouds.

A. MAP ELEMENT DETECTION
The generation of HD maps based on detecting map elements in high-density point clouds is a frequently researched topic in geodesy [14], [15], [16], while in the automotive field, such maps are typically only used while driving [17], [18]. Compared to subsequently presented approaches, we provide a holistic, DNN-based method that extends to various types of map elements and omits the need for hand-crafted, type-specific algorithms. Our novel multitask 3DHDNet integrates our previous work on DNN-based pole and sign detection [10], [19], and is extended in this work to also detect traffic lights.

B. MAP DEVIATION DETECTION (MDD)
Our conceptual system framework for dependable maps [2] assumes an MDD component capable of evaluating individual HD map elements, for which we propose a holistic and robust solution in this article. In comparison, early approaches to MDD examine ordinary navigation maps, which do not provide the level of detail (e.g., lane geometry) required for map-based driving [4], [57], [58]. More recently, MDD methods for lane markings have been proposed [55], [59], [60], [61], [62], [63], [64], whereby respective methods rely on prior object detection results as input. In contrast, our proposed solution features an additional map input to facilitate object detection in the first place, which allows for a robust map verification in challenging conditions. Only few works consider the additional map input for MDD [3], [65], [66]. Specifically, Hartmann et al. [65] consider the detection of lane geometry deviations using a DNN predicting the probability of the entire map being correct, whereas our method is able to evaluate map elements individually. Also predicting the correctness of the map as a whole, Lambert and Hays [66] fuse image and rendered map data within a DNN, whereby map deviations regarding lane geometry and crosswalks are simulated by modifying the map. They highlight the generalization of simulated deviations to those seen in the real world. In our work, we also simulate map deviations, but we modify both map and sensor data to create various deviation types.
Moreover, probabilistic methods for evaluating map correctness have been proposed [9], [67], [68]. Specifically, Raaijmakers [9] manually derived estimated sensor measurement distributions for roundabouts, which yielded unsatisfying results. Fabris et al. [67], [68] estimate map correctness using Bayesian networks assuming respective conditional probabilities, e.g., to model the influence of bad weather on the map correctness estimation. In contrast, we propose a method that learns expected measurement distributions without the need of prior assumptions.

C. DNN-BASED OBJECT DETECTION IN POINT CLOUDS
While early research for DNN-based object detection relied on hand-crafted encodings [69], [70], the paradigm has shifted towards learned feature encodings [71], [72], [73], reducing the loss of geometrical information. More recent research examines point-based networks [74], [75] that omit the discretization step and create predictions for each point. Our 3DHDNet architecture draws from network topologies designed for road user detection [71], [72], [73], but specifically allows for the resolution of vertically stacked map elements such as signs.

III. DNN-BASED MAP DEVIATION DETECTION
In this section, we introduce our approach to DNN-based map deviation detection (MDD). To this end, we first provide definitions on maps and types of map deviations in Section III-A. Subsequently, we present an overview on the three MDD method variants examined in Section III-B. We then present in detail our reference method MDD-SC in Section III-C, serving as performance reference for later evaluations. Thereby, we present our core concepts on DNN-based map element and deviation detection, which are adopted or modified by the method variants MDD-MC and MDD-M, subsequently presented in Sections III-D and III-E, respectively. The variant MDD-M incorporates our specialized MDD network that directly infers the correctness of the map input, implementing the concept shown in Fig. 1.

A. DEFINITIONS ON MAPS AND MAP DEVIATIONS
An HD map can be defined as a set M = E ∪ R, consisting of a set of map elements E (e.g., traffic signs or lane markings) and relations R between elements (association, composition, or link relation) [2]. Map elements can be categorized into physical (real-world objects) and semantical elements (mental models, e.g., lanes or roundabouts), and can be differentiated by a type (major class, e.g., sign, light, or pole) and a set of (mandatory or characteristic) attributes (e.g., subclass or orientation). In general, to provide a complete definition for map deviations, deviating map items comprise both elements and relations: F = F E ∪ F R . However, we only consider deviations regarding map elements F E in the following. To obtain map deviations, a set of map elements E extracted from sensor data is associated with a set of examined, presumably deviating map elements E, which yields the set of evaluated map elements E eval = V ∪ U ∪ F E , with the set of verifications V comprising successfully associated ("verified") elements, U being the set of "unknown" elements that could not be evaluated due to occlusion, and F E being the set of "deviating" map elements. To successfully associate elements, their major class and type-dependent overlap criteria must be fulfilled. Moreover, the set F E = F PS ∪ F A can be further categorized into the set of deviating physical and semantical map elements F PS , referring to the existence of respective elements, and the set of elements with attributional deviations F A . In our approach to MDD, physical and semantical "deviations" F PS = D ∪ I ∪ S are determined first during the association step, comprising elements that are missing in the examined map (deletions D), falsely existing elements (insertions I), and replaced elements (substitutions S). In a second step, map elements with attributional deviations F A may be obtained from the initial set of verifications V by comparing attributes of associated elements on a more detailed level. As the second step is straightforward, we focus only on the more challenging predecessor step of predicting F PS = D ∪ I ∪ S in this work. An examined element e ∈ E for which we detect a deviation (e ∈ (I ∪ S) ⊂ E) is called "falsified".

B. OVERVIEW ON EXAMINED MDD VARIANTS
The method variants performing MDD that we compare in this work are shown in Fig. 2. Subsequently, we briefly introduce these variants, with respective details provided in Sections III-C to III-E.
Our reference method for MDD (MDD-SC) in Fig. 2 (a) utilizes a sensor-only (-S) object detection DNN and a subsequent comparison (-C), serving as a performance reference for later evaluations. Specifically, map elements are first detected in the sensor data using an object detection algorithm, yielding the set of predicted map elements E, which is then compared to the examined set of (potentially deviating) map elements E to obtain the set of evaluated map elements E eval , which comprises respective map deviations F PS . To this end, a point cloud P as unordered set of points is sorted into a spatial voxel grid m of fixed size featuring three spatial dimensions (subsequently referred to as "LiDAR feature map", see Fig. 1). Our DNN uses an encoder stage to learn an optimal feature representation for points contained in each voxel, which yields the encoded LiDAR feature grid g, that preserves the spatial dimensions. The network's backbone further extracts abstract features of higher semantics and provides the extracted feature grid g as input to the network heads. Specifically, we attach three individual network heads (visualized as one "heads" block in Fig. 2 for simplicity) that output bounding shapes for traffic signs u s , traffic lights u l , and poles u p , respectively. In a post-processing step, these respective output feature maps are converted into the set of predicted map elements E as input to the aforementioned comparison with E. As we use high-density point clouds that are free of occlusions as sensor data for our proof of concept, the evaluated set of map elements E eval = V ∪ F PS = V ∪ D ∪ I ∪ S omits the set of unknown elements U , which yields the four "evaluation states" S = {VER, DEL, INS, SUB} considered in this work: verification, deletion, insertion, or substitution, respectively.
As shown in Fig. 2 (b), the proposed and more advanced method MDD-MC, featuring a map-supported object detection DNN, uses the encoded, presumably deviating map as additional input m (-M) to facilitate the detection task, with map deviations still obtained by comparing detected objects in E with the set of examined map elements E (-C). The network can leverage the map as source of initial hypotheses regarding possible element locations and features, and as a source for contextual knowledge. However, as the map contains deviations, the network cannot rely on the map only to detect missing map elements (deletions) or falsify existing map hypotheses (insertions or substitutions) in E, but has to decide internally when to rely on sensor or map data, respectively. To this end, MDD-MC additionally includes the map encoding step in Fig. 2 (b), whereby map elements in E are matched to a voxel grid of the same size as the encoded LiDAR grid g, with respective element features (e.g., major class and bounding shape features) being incorporated into matched voxels, yielding the map representation m .
The third method variant includes a map-supported deviation detection DNN (MDD-M) that directly evaluates the correctness of the additional map input (-M) without an explicit comparison. Specifically, we force the network to classify the evaluation state s ∈ S for each map element individually by using a specialized loss function, with a changed network output including the evaluation state classification in u s , u l , and u p . In consequence, the network has to directly evaluate an existing element hypothesis by comparing available sensor data to map data. Thereby, the network needs to model expected sensor data distributions for specific map element types internally to infer an element's evaluation state, i.e., by comparing current sensor data to a learned, typical distribution in order to verify an element. A point density ablation study (cf. Section III-C1) will demonstrate the high performance of such an approach in the presence of degenerated sensor data, where the evaluation state has to be determined using few LiDAR points only.

C. REFERENCE VARIANTMDD-SC: SENSOR-ONLY
In this section, we provide details regarding our reference method MDD-SC. First, we briefly summarize our DNN architecture for object detection and the loss function used during training. Subsequently, we present the required post-processing and comparison steps.

1) OBJECT DETECTION DNN
For the object detection DNN in Fig. 2 (a) and (b) used for the methods MDD-SC and MDD-MC, respectively, we apply the 3DHDNet architecture [10] comprising three stages: encoder, backbone, and heads. Compared to the original architecture, we provide a multitask extension of the network with multiple heads attached to the backbone, simultaneously predicting signs, lights, and poles. The network provides a learned encoding g for the point cloud input P [19], [71]. For more details regarding the internal operation of the encoder and backbone stage, see Appendix A.
Let p = (p int , p crd ) ∈ P be a single point of the point cloud P that is input to any of the three methods depicted in Fig. 2, with p int ∈ I = [0, 1] being an intensity measurement of the reflected LiDAR beam, and p crd ∈ R 3 being the Cartesian coordinate of a point. As a pre-processing step, the point cloud P is first voxelized into a 3D voxel grid with N x , N y , and N z voxels in the x-, y-, and zdimension of the grid. Subsequently, to increase network speed, only the N occupied voxels with a maximum of K = 96 points per voxel are collected into the augmented LiDAR feature map m = (m n,k ) of size N × K × 10 as input to the encoder, with n ∈ N = {1, . . . , N} being the voxel index and k ∈ K = {1, . . . , K} being the point index, respectively. Each (augmented) point of the grid m n,k = (p int n,k , p crd n,k , p crd n,k − p n , p crd n,k − v n ) provides ten features, with p n ∈ R 3 being the mean of all Cartesian point measurements contained in voxel n, and v n ∈ R 3 being the Cartesian center coordinate of the point's assigned voxel n. First, the encoder stage encodes all points contained in a voxel, providing a single feature vector of length L = 256 for each voxel, which yields the encoded LiDAR grid g ∈ R N x ×N y ×N z ×L . Subsequently, the 3D backbone processes g using a series of 3D (transposed) convolutions, extracting more abstract features and including context from surrounding voxels, which provides g ∈ R N x ×N y ×N z ×L , with L = 384 features per voxel. As shown in Fig. 3, in a last step, the abstract feature grid g is decoded by three individual map element heads to provide the network outputs u s , u l , u p for signs, lights, and poles, respectively, which we define subsequently. Note that for poles, g is reorganized by concatenating all features along the vertical z-dimension, providing the poles head input g ∈ R N x ×N y ×(N z ·L ) .
Each head operates in a single-shot fashion [76] simultaneously predicting existing likelihoods and regression bounding shapes for a set of predefined objects, so-called "anchors". If an anchor is likely to contain (part of) a real-world object, the existence likelihood increases. Let et ∈ T = {s, l, p} be the map element type, sign, light, or pole, with T being the set of considered map element types. For signs and lights, we employ the 3D anchor grids G s , G l = {1, 2, . . . , G} with G = N x · N y · N z being the number of voxels, allowing for the vertical resolution of individual objects. For poles, due to their ground placement, we employ a 2D anchor grid in the x-y-plane, which reduces runtime and memory requirements, with G p = {1, 2, . . . , G } and G = N x · N y being the number of cells in the 2D grid. The object detection head predicts an existence likelihood for each anchor o et g ∈ I, with I = [0, 1], while the regression head adopts the anchor's (predefined) bounding shape to match the size and position of a real-world object. The bounding shape parameters differ for each map element head. For poles, we employ a bounding cylinder with the set of regression parameters Q p = {p x , p y , p z , d}, with p x , p y , p z being the Cartesian position of a pole's base point, and d being the pole's diameter. For lights, we use a bounding box Q l = {p x , p y , p z , h, w, ϕ}, with h being the bounding box height, w the width of a squared base plate, and ϕ being the light's orientation. Last, for signs modeled as bounding rectangles, the set Q s = {p x , p y , p z , h, w, ϕ} applies, with h and w indicating the rectangle's height and width, respectively. For signs and lights, p (·) indicates the element's center position.
Instead of directly predicting bounding shape parameters, the regression head predicts the difference between a real-world object's bounding shape (superscript O), and the bounding shape of the respective anchor (superscript A) for normalization purposes. With s x vox × s y vox × s z vox = s 3 vox and s vox = 40 cm being the size of a cubic voxel, and the orientation ϕ being encoded as complex number with real and imaginary components ϕ re and ϕ im , we define the regression head outputs r s with j = √ -1 being the imaginary unit, and κ = 1 for lights. Regarding signs, the network cannot easily distinguish between a sign's front and back without camera images.
Thus, we limit a sign's orientation to ϕ ∈ [ −90 • , 90 • ], instead of encoding a full 360 • range. To ensure that signs with the same spatial orientation (difference of 180 • ) generate no loss, we use the factor κ = 2 in (3). We derive the default anchor parameters (·) A g based on the ground truth (GT) parameter distributions (cf. Appendix B), which yields w A g = h A g = 0.65 m for signs, w A g = 0.3 m and h A g = 0.9 m for lights, and d A g = 0.2 m for poles.

2) OBJECT DETECTION LOSS
To generate the target existence likelihoods o et g and target regression features r et g (with (·) denoting ground truth (GT)) during training, GT map elements have to be matched to the anchor grid, setting matching anchors o et g = 1 and their regression features r et g according to (1), (2), and (3), whereby a single element can match with multiple anchors. To this end, we employ type-specific matching strategies, association metrics, and criteria. Anchors with only a small overlap with a GT element are ignored during training using a "don't care" state. For more details, see Appendix C.
We define the loss function optimizing the object detection DNN applied in MDD-SC and MDD-MC (cf. Fig. 2) similar to PointPillars [71] and SECOND [73] as with et ∈ T = {s, l, p} denoting the element type (sign, light, or pole), N et being the type-specific number of elements contained in a training sample, J(o et g , o et g ) and J(r et g , r et g ) denoting the object detection and the regression loss obtained for a specific element type, respectively, and λ = 2/3 being a weight factor. We formulate the object detection loss as focal loss [77], which focuses on particularly hard-to-detect elements using adaptive weights. Using the original paper settings [77] for the weight factors α = 0.25, β = 2, and with the mask factor λ mask g ∈ {1, 0} being zero in the "don't care" state, we define: whereby an anchor contained in voxel g is weighted higher when a prediction is further away from the target value (see the distinction of cases depending on o et g in (5)). The type-specific regression loss is formulated using the smooth L1-loss (also known as Huber loss [78]): with q ∈ Q et being a type-specific regression parameter defined in Section III-C1, r being the difference between predicted and target regression values, respectively, and with λ obj g ∈ {1, 0} being 1 only if an element is present in voxel g.

3) POST-PROCESSING AND COMPARISON
To obtain the list of predicted elements E during inference for both MDD-SC and -MC, the map element head outputs u et are post-processed (cf. Fig. 2 (a) and (b)). First, only map element predictions for single voxels g with a predicted existence likelihood o et g exceeding an element-type-specific threshold et score are considered as a valid detection: with the selection of thresholds being discussed in Section V-A. For valid detections, the normalization of bounding shape predictions is reverted. As multiple anchors can make predictions for the same real-world object, respective overlapping bounding shapes are filtered using non-maximum-suppression [71] to obtain the predicted map element set E comprising signs, lights, and poles, respectively. Thereby, only the element with the highest existence VOLUME 4, 2023 585 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
likelihood among overlapping predictions remains. To measure the overlap, we apply the association metrics detailed in Appendix C.
To obtain verifications V and map deviations D ∪ I ∪ S, predicted map elements E are now compared with elements in the potentially deviating map E. In the first association step, map elements in E and E with the same major type (sign, light, pole) are compared using again the association metrics. Successfully associated elements are appended to V, while predicted elements in E that cannot be found in the examined map E are considered as deletion and appended to D. Vice versa, elements in E for which no prediction from E can be associated, are considered as falsely existing and appended to the set of insertions I. Substitutions S are a special case, where one element of a specific major type (e.g., light) is interchanged with another element of a different major type (e.g., sign). In our case, only signs and lights are physically interchangeable with one another. Specifically, a sign cannot be interchanged with a pole, as the pole must already exist to serve as a mounting location. If the sign is mounted on a wall, there is no space available for a pole. Similar considerations apply regarding lights. For instance, after the first association step, let's assume a light falsely exists in E (insertion), with a sign missing in E (deletion) at the same location. In this case, to identify such substitutions in a second step, deletions and insertions are associated among signs and lights. As the aforementioned association criteria are defined for elements of the same major type, we simply require the Euclidean distance d E between elements to be d E ≤ SUB E for successful association of deletions and insertions, with SUB E = 0.3 m being a threshold value. If an association is successful, both deviations are removed from respective sets D and I, and replaced by a single substitution appended to S.

D. MDD-MC: MAP-SUPPORTED OBJECT DETECTION
The method MDD-MC follows the same concepts as presented in the previous section for MDD-SC, while additionally applying the map representation m as input to the object detection DNN (cf. Fig. 2 (b)), providing the network with initial hypotheses for possible map elements locations and features. In this section, we first define the map representation and subsequently present the applied map encoding procedure to obtain m , with a respective example result depicted in Fig. 7 (a). Last, we describe the integration of the generated map representation into the DNN.
Let the tensor m = (m g ) be of size N x × N y × N z × 10, whereby g denotes the voxel index of the spatial grid (cf. Section III-C1). Each map feature vector m g = ) provides ten features, with P (·) ∈ P and P = {0, 1} being a prior existence hypothesis that a sign s, light l, or pole p is present in voxel g, whereby we allow only one hypothesis per voxel, resulting in a one-hot encoding for P s , P l , P p . If an element of a specific type is present in voxel g, the respective regression features r (·) are set in a normalized fashion according to (1), (2), and (3). If a pole is present, we provide the respective pole diameter r ( ) g = r (d) g with the other regression features set to 0. For signs and lights, we provide the width parameter r To obtain map feature vectors, map elements E have to be matched to voxels in the spatial grid of m , whereby a single element can match with multiple voxels. For such matching, we approximate map elements using cuboid shapes. Such an approximation is less precise than the matching strategies applied during target generation (cf. Appendix C), but provides a greatly reduced runtime during training and inference. For a successful match of a map element with voxel g, we require dim∈{x,y,z} with dim ∈ {x, y, z} being a spatial dimension, dim dist being a dimension-specific threshold value, and p dim and v dim being the element's and the voxel's center positions in that dimension, respectively. Furthermore, we define with ν = 1.1 being an empirically determined factor enlarging the element's shape to include voxels surrounding the element as matches. Regarding poles, we apply the diameter = d for both the x-and y-dimension, and h being a pole-class-specific default height value for the zdimension. Otherwise, for signs and lights, we use the width parameter = w for the x-and y-dimension. The maxoperation ensures a minimal element size to obtain a match also for small elements. For poles and (squaredshaped) lights, this approximation yields sufficient matching results. For signs, however, matches are further refined to account for their rotated shape using the matching strategies applied during target generation. For unmatched voxels, we set m = 0.
Finally, to integrate the obtained map representation m into the network, m (providing 10 features per voxel) is concatenated along the feature dimension to the encoded LiDAR grid g providing L features, yielding a spatial grid of size N x × N y × N z × (10 + L ) as input to the backbone.
Note that the camera modality can be integrated in the same way as m , following the concept from [11]. For instance, a bounding box for each individual voxel can be projected into the camera image to crop and resize [79] respective image features (i.e., RGB or abstract features), which subsequently can be used as a voxel's feature vector.

E. MDD-M: MAP-SUPPORTED DEVIATION DETECTION
In this section, we first present our specialized MDD network and the according loss function used for training. We close with a presentation of the post-processing specific to MDD-M.

1) DEVIATION DETECTION DNN
The deviation detection DNN applied in method MDD-M as depicted in Fig. 2 (c) directly predicts map deviations instead of performing a mere object detection and obtaining deviations by comparison with the (deviating) map, omitting the need for such comparison. To this end, we adopt the network heads, now providing the outputs u s , u l , and u p , while leaving the rest of the network architecture (encoder and backbone) and the pre-processing steps (voxelization and map encoding) unchanged.
To predict map deviations, we substitute the output of the object detection head o et , r et ). Thereby, the spatial resolution of the anchor grids remains unchanged, comprising N x , N y , and N z voxels in the respective dimensions, with predictions being made for each anchor (or voxel, respectively). The deviation detection head output d g for an anchor g comprises four score values, representing the evaluation states being a state score value. To provide the target d g during training, we set the respective score value to 1 according to the element's evaluation state. Otherwise, if no element is present in a voxel, all score values are set to 0. In case of substitutions, the network is trained to predict the map element type present in the sensor data. For instance, in case of the map falsely hypothesizing a light, which is substituted by a sign in the real world, we set d s g = (0, 0, 0, 1) to predict a substitution and r s g according to (1), (2), and (3) to predict the sign's bounding shape.

2) DEVIATION DETECTION LOSS
Accordingly, we change the loss function (4) optimizing the network to comprise the deviation detection loss The deviation detection loss J(d Further, we apply the same regression loss as previously formulated in (6). In case of insertions (elements not existing in the sensor data), we train the network to reproduce the element's bounding shape as provided by m .

3) POST-PROCESSING
As the network directly classifies the evaluation states for elements of the examined set E, which are encoded as m and provided as input to the network, verifications V, deletions D, insertions I, and substitutions S can be directly obtained from the network output. Only a prediction for voxel g with a state score s (·) exceeding a specific threshold is considered as valid detection. As opposed to the object detection DNN used in MDD-SC and MDD-MC, where three element-type-specific thresholds are applied, the deviation detection DNN in MDD-M allows for the selection and optimization of in total 3 × 4 = 12 type-and state-specific thresholds et s , with et ∈ T being the element type and s ∈ S = {VER, DEL, INS, SUB} being the evaluation state: According to (13), an element is appended to respective sets V, D, I, and S only if the respective threshold is exceeded. Second, to regard deletions, we apply the ordinary object detection post-processing (cf. Section III-C3). Specifically, we consider all anchors for which the deletion score d et g,DEL is highest. Obtained deletion predictions are subsequently filtered using non-maximum-suppression.

IV. EXPERIMENTAL SETUP
In this section, we describe our experimental setup. First, we present the utilized dataset and applied metrics. Last, we refer to the training strategy.

A. DATASET
We employ the 3DHD CityScenes dataset [10] depicted in Fig. 4, which provides a large-scale HD map with an underlying high-density point cloud. The map comprises 20,163 signs, 5,762 lights, and 67,540 poles. However, we exclude bollards being the least significant but most varying pole class, which leaves 32,283 considered poles in the dataset. If vertically-stacked rectangular signs are of similar size with no spatial gap between, we merge respective signs as the individual boundaries cannot be determined.
3DHD CityScenes offers poses on the HD map comprising geo-location and orientation of the vehicle based on recorded real-world trajectories, providing 57,510, 8,087, and 13,061 poses for the training, validation, and test set, respectively. To generate samples comprising a point cloud P and examined map elements E for training, inference, and evaluation, we take crops from the larger point cloud, only considering the subset of map elements within this crop as part of the sample.
For the MDD task of this work, in particular the detection of verifications V (map elements correctly representing the sensor data), deletions D (elements missing in the map), insertions I (elements falsely existing in the map), and substitutions S (false element types present in the map data), 3DHD CityScenes is not providing annotations straightaway. Thus, map deviations have to be induced into the map artificially, which we term map deviation simulation.
Note that we assume the initially provided map as deviation-free, i.e., all elements represent GT verifications V, respectively. To simulate deviations, we use a probabilistic approach, whereby each map element is examined and either left unchanged, remaining a verification with the evaluation state VER, or turned into a deviation: deletion DEL, insertion INS, or substitution SUB. Specifically, let P (s) be the probability of a map element having assigned a certain evaluation state s ∈ S = {VER, DEL, INS, SUB}. For validation and test, we apply the probabilities (P VER , P DEL , P INS , P SUB ) = (0.75, 0.1, 0.1, 0.05), which we regard as an intermediate setting between a mostly correct map (e.g., with P VER = 0.95), and a construction site scene (e.g., with P VER = 0.25). For reproducibility purposes, we use a fixed assignment of elements to evaluation states that is generated in an offline-fashion and stored prior to training. The assignment is published along with our code. During training, however, we dynamically simulate deviations for each sample, using higher deviation probabilities (0.5, 0.2, 0.2, 0.1) to speed up the learning process for the method MDD-M.
If an element receives the VER evaluation state, both element and sensor data are left unchanged, with respective elements appended to the set of GT verifications V. On the other hand, if an element is assigned the DEL, INS, or SUB state, we turn it into a deviation by modifying map E or sensor data P, as depicted in Fig. 5. Subsequently, the simulated deviation is appended to the respective GT set D, I, or S. In particular, deletions D (Fig. 5 (a)) are simulated by deleting elements from the examined map E. To create insertions I ( Fig. 5 (b)), we remove the parts of the point cloud being contained in respective bounding shapes. Typically, such procedure leaves residual points from other map elements (e.g., an underlying pole) in close proximity -rendering the detection of insertions a valid task. Thereby, we consider the removal of semantic groups as a whole. Specifically, the removal of a pole from the point cloud leads to the removal of adjacent signs and lights as well. Further, we generate substitutions S (Fig. 5 (c)) by interchanging element types in E with realistic bounding shapes being created for the artificial element, which is appended to the set of artificial elements S (e.g., an element now being a light instead of a sign), while the respective correct element as indicated by the sensor data is appended to the set of GT substitutions S (remaining a sign in the example). Hence, the examined map E as input to all methods comprises verifications, insertions, and artificial elements:

B. EVALUATION STRATEGY AND METRICS
To evaluate the performance of the proposed method variants, we associate the predicted set of evaluated map elements E eval = V ∪ D ∪ I ∪ S (cf. Fig. 2) with the respective GT set E eval = V ∪ D ∪ I ∪ S. Thereby, only elements of sets with the same evaluation state (considered as "class") s = s ∈ S = {VER, DEL, INS, SUB} can be associated, e.g., V to V, which is the common approach in the field of object detection (i.e., cars and pedestrians not being associated).
To establish an association between predicted and ground truth elements among sets for which existing hypotheses in the examined set E are available (verifications V, insertions I, and substitutions S), an evaluated element's known identifier is utilized to find the element in the respective GT set. For deletions (elements missing in E without identifier), we apply the association criteria as provided in Appendix C. If a predicted element is successfully associated with an element in the respective GT set, the element is considered as true positive (TP). If a prediction cannot be associated with a GT element, the detection is considered as false positive (FP). Further, a GT element without an associated prediction is counted as false negative (FN). We indicate the deviation detection performance for each set (or evaluation state, respectively) individually using the following metrics, with TP and FP being the amount of true and false positives, respectively, and FN being the amount of false negatives: Precision Note that false predictions exist for each evaluation state. For instance, residual points in proximity of a GT insertion may lead a method to predict a FP verification instead, causing a (not associated or undetected) FN insertion. To evaluate the regression errors, we employ the L1 and L2 metrics: L2: with q ∈ {d, w, h, ϕ}, and p pos = (x pos , y pos , z pos ) being the center position of a predicted bounding shape, and i being the index of the predicted element. Note that the regression errors can only be computed from TP detections, excluding insertion predictions, as no map element in the sensor data is available for regression in their case.

C. TRAINING STRATEGY
Our experimental pipeline is implemented in PyTorch. All networks are trained from scratch for 8 epochs using a 2-GPU setup (Tesla V100) with a batch size of 2, whereby the network weights are initialized from a normal distribution following [80]. For the point density and partial occlusion ablation studies, we use a batch size of 4 to allow for a reduced training time. We utilize the Adam optimizer with a fixed learning rate of 2 · 10 −4 . Furthermore, we use a global augmentation strategy for both point cloud and map elements, applying a random

V. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we first discuss the threshold optimization on the validation set and then the results on the test set. Last, we conduct ablation studies, simulating degenerated sensor data and inducing sensor and map data misalignments.

A. VALIDATION SET THRESHOLD OPTIMIZATION
To obtain predictions from the anchor grids, all method variants require the selection of thresholds regarding predicted existence likelihoods (methods MDD-SC and MDD-MC) or evaluation state scores (method MDD-M) on the validation set, above which an anchor is considered to provide a valid prediction (see (8) and (13), respectively). Such selection requires an optimization procedure, which we describe in the following. The resulting optimized thresholds are applied later to obtain the test set results. For the method variants MDD-SC and MDD-MC that are based on object detection, three thresholds et score in (8) with et ∈ T = {s, l, p} for signs, lights, and poles have to be selected, which we optimize by maximizing the F 1 score on the validation set for GT ele-

FIGURE 6. Validation set precision-recall curves for lights, poles, and signs comparing MDD-SC (blue dotted), MDD-MC (green dashed), and MDD-M (red solid). The "verified" evaluation state (elements reported as correct) is abbreviated as VER, while DEV denotes the "deviating" state, comprising all elements classified as deletions (DEL), insertions (INS), and substitutions (SUB). For poles, no substitution by other elements exists.
To this end, we create the precision-recall curves for all method variants depicted in Fig. 6  First, the VER performance is quite similar over all three method variants and element types, with MDD-SC showing poorest performance and MDD-M providing the best F 1 score, reaching 0.99 for all element types. However, all methods perform around or quite above an F 1 score of 0.9 at very high precision levels close to 1, showing that FP verifications only occur rarely (towards very low thresholds).
Second, the DEV evaluation state summarizing DEL, INS, and SUB elements shows similar characteristics over all element types, with MDD-M again performing best with F 1 scores above 0.80, and MDD-SC showing poorest performance (F 1 ≈ 0.60). The degenerated curves for the object detection-based variants MDD-SC and MDD-MC are due to inverse threshold characteristics of the DEL and INS states. Specifically, higher thresholds lead to fewer object detections, increasing the DEL precision and the INS recall, while decreasing the DEL recall and the INS precision (i.e., causing more FP insertion predictions, also see Fig. 7 for respective examples), and vice versa. Thus, the opposed characteristics diminish the DEV precision towards both low and high thresholds, causing the loop-like precisionrecall-curves. In contrast, MDD-M optimizes an individual threshold for each evaluation state, leading to regular DEV curves.
Third, the DEL performance depends on element types. For lights, the advanced methods MDD-M and MDD-MC perform best (F 1 = 0.65), with MDD-M operating at higher precision but lower recall than MDD-MC. A similar tendency can be observed for poles, with MDD-M slightly outperforming MDD-MC in this case. Regarding signs, however, MDD-M clearly shows the best performance with an 0.12 absolute F 1 increase compared to MDD-MC. The significant performance gain can be traced to clusters of vertically stacked signs. For the comparison-based method variants MDD-SC and MDD-MC, detected objects are first associated to existing map hypotheses (e.g., yielding verification predictions). Only the remaining unassociated objects are considered as deletions, leaving many deletions undetected, causing a low recall just above 0.6. In contrast, MDD-M separates the evaluation of existing map hypotheses and the prediction of deletions (enabled by the evaluation state classification and exploited by the specialized post-processing), with existing hypotheses being evaluated directly and deletion predictions generated independently. Thus, more (not otherwise associated) deletion predictions are available, boosting the DEL recall. The optimized post-processing also improves the precision in sign clusters: Existing hypotheses cannot be deletions, preventing FP deletion predictions in close proximity (cf. the sign cluster in Fig. 7 (c) and (d)). Note that even without the optimized post-processing, MDD-M still performs best, mirroring the performance characteristics of lights in this case (increased precision, lower recall, and best F 1 score).  Interestingly, the classification of evaluation states in MDD-M further improves the INS precision, indicating an increased tendency of the network to trust and confirm a map hypothesis, which is also reflected by the slightly decreased INS recall compared to MDD-SC missing the additional map input: In some cases, MDD-M falsely confirms an existing map hypothesis, e.g., due to residual points or other objects in close proximity.
Last, regarding the SUB performance for lights, all method variants perform at a similar level with F 1 scores around 0.90. For signs, however, MDD-M clearly performs best, achieving F 1 = 0.97, which is an absolute improvement of 0.23 compared to MDD-SC. The curves for MDD-SC and MDD-MC appear degenerated in a loop-like fashion as observed for the DEV evaluation state. Also in this case, the inverse threshold characteristics of the DEL and INS states are the root cause, recalling that for both comparison-based methods, substitutions are created during post-processing from deletions and insertions (cf. Section III-C3).
In total, a clear rank order is visible, with MDD-M comprising our specialized MDD network performing best, and the object detection-based variants MDD-MC and MDD-SC being on second and third rank, respectively. Note the strong verification performance of MDD-M, indicating a greater trust of the network on the map data, leaving only few elements with an existing hypothesis undetected (high insertion precision) at the cost of generally fewer objects (e.g., lights) predicted if no such hypothesis is available (lower DEL recall but higher DEL precision).

B. TEST SET RESULTS
In the following, we present and discuss the test set results, which are obtained by applying the thresholds optimized on the validation set. Specifically, we compare the method variants MDD-SC, MDD-MC, and MDD-M based on unmodified, high-density point cloud inputs with highly precise spatial alignment with the HD map. First, we present the deviation detection and regression error results. Last, we provide example predictions for each method variant.

1) DEVIATION DETECTION PERFORMANCE
The test set results regarding the deviation detection performance obtained by comparing the predicted set of map elements E eval with the respective GT set E eval are summarized in Table 1, with obtained F 1 score (16), precision (15), and recall (14)  Overall, the test set results in Table 1 mirror the characteristics observed on the validation set discussed in the previous section, with the rank order of MDD-M, MDD-MC, and MDD-SC only changing on a detail level. Specifically, the F 1 score for light deletions is now best for MDD-MC, with MDD-M on second rank. Further, regarding light substitutions, MDD-SC slightly outperforms the other method variants, which, however, all show similar performance with F 1 = 0.92 or F 1 = 0.91, respectively.
Also on the test set, a clear rank order of methods appears regarding the F 1 score: The purely object detection-based  The performance is measured using the F1 score (16), recall (14), and precision (15), reported separately for lights, poles, and signs. Results are indicated individually for possible evaluation (eval.) states (verification VER, deletion DEL, insertion INS, substitution SUB, with the DEV state summarizing elements with DEL, INS, or SUB evaluation states). Best F1 score is bold, second best is underlined. position p pos (18), width w, diameter d, and height h (all (17)) are indicated in centimeters, while the orientation error E(ϕ) (17) is provided in degrees. Best results in bold face, second best underlined. method MDD-SC is overall the poorest method, with MDD-MC following on second rank, and MDD-M being the best for almost all element types and evaluation states. On average over all three map element types (lights, poles, signs), MDD-M achieves a strong F 1 = 0.99 for map verification, and a good F 1 = 0.85 for map deviation detection.

2) REGRESSION PERFORMANCE
The L1 (17) or L2 (18) errors given in centimeter or degree regarding the regression of bounding shapes obtained on the test set are provided in Table 2, indicating the performance of the method variants MDD-SC, MDD-MC, and MDD-M separately for each element type. As the object detection DNN in MDD-SC operates without additional map input, the reported errors indicate the network's capability to deduce respective bounding shapes from sensor data alone.
Regarding the L2 error of the predicted bounding shape center E(p pos ) compared to MDD-SC, best-performing MDD-M provides an absolute improvement of 2.2 cm, 5.7 cm, and 4 cm for lights, poles, and signs, respectively. Further, regarding width and height errors E(w) and E(h), MDD-M w.r.t. MDD-SC overall achieves greater improvements for signs than for lights, i.e., 3.4 cm and 6.7 cm for signs compared to 1.3 cm and 2.6 cm for lights, respectively. The higher error ranges for signs can be attributed to the greater variance of bounding shapes in size. Similar to signs, the diameter error E(d) for poles is reduced almost to half, comparing MDD-M and MDD-SC, with an absolute improvement of 3.1 cm. Regarding the orientation error E(ϕ), MDD-M reduces the obtained error almost to a third compared to MDD-SC, providing an absolute improvement of 9.6 • and 8.1 • for lights and signs, respectively.
First, the achieved error improvements of advanced method variants (e.g., comparing MDD-MC to MDD-SC) are due to the provision of regression hypotheses by the map representation m . Second, the specialized post-processing of MDD-M averages the regression features of multiple predictions for elements with an existing hypothesis, which yields a refined bounding shape. Overall, regarding the obtained regression errors, methods rank in the same order as before: MDD-SC, MDD-MC, and MDD-M, with MDD-M consistently achieving lowest errors. Fig. 7 provides obtained example predictions E eval for all method variants with visualized inputs E, m and P (cf. Fig. 2). Subsequently, we discuss an exemplary subset of predictions to illustrate the performance characteristics discussed in the previous sections.

3) EXAMPLE PREDICTIONS
First, Fig. 7 (a) shows the examined set of elements E (black shapes) used to generate the map representation m (colored voxels). The scene comprises several verifications (map hypotheses matching the sensor data), two deletions (missing poles on the left without map hypothesis), a single sign substitution (with the map falsely indicating a light instead of a sign on the left), and a single insertion (top sign hypothesis on the right without sensor data), with the white labels indicating the ground truth (GT) evaluation states of mentioned examples. If an example element (be it VER, DEL, INS, or SUB) first appears as true positive (TP) (e.g., in Fig. 7 (b)), the respective label is omitted for better clarity in subsequent figures (e.g., in Fig. 7 (c) and (d)), if the example element remains correctly predicted (TP).
Regarding the GT pole deletions, both are correctly detected (TP) by each of the three method variants (indicated by the respective text labels connected in green in Fig. 7  (b), which are omitted in (c) and (d)), reflecting the high DEL recall for poles achieved by all variants (cf. Table 1). However, MDD-SC also provides a deletion prediction not associated with a map hypothesis during comparison, which yields a pole FP deletion (right of Fig. 7 (b)). The FP deletion vanishes for more advanced method variants, corresponding to the increased DEL precision in Table 1.
Further, the GT sign insertion is correctly detected by all variants (green connected "TP sign insertion" label in Fig. 7 (b), omitted in (c) and (d)). However, MDD-SC fails to detect the light object corresponding to the GT light verification (top of Fig. 7 (b)). Thus, during comparison, no object detection is associated to the light hypothesis in E (black shape in Fig. 7 (a)), leading to one FP light insertion prediction, and one missed light verification, counted as FN. However, the advanced methods MDD-MC and MDD-M succeed in this case, reflecting the increased VER recall in Table 1. Similar considerations apply regarding the sign substitution: Both MDD-SC and MDD-MC fail to detect the upper of both signs on the left (Fig. 7 (b) and (c)), predicting only a single sign in between both elements, which is associated to the bottom sign hypothesis, leading to a (slightly misplaced) TP sign verification. Therefore, as the sign prediction is not associated to the light hypothesis during comparison, a FP light insertion is predicted, with the sign substitution left undetected (FN). Only MDD-M (Fig. 7 (d)) succeeds in predicting both elements correctly (bottom sign verification and top sign substitution).
In total, we can observe in Fig. 7 that the method variant MDD-M clearly outperforms the comparison-based method variants MDD-SC and MDD-MC, with no false predictions (FPs or FNs) being made in the scene by MDD-M.

C. ABLATION STUDIES
Subsequently, we present results for ablation studies. First, we degenerate the point cloud input by reducing the point density. Second, we induce partial occlusions. Last, to demonstrate the suitability of our approach for onboard usecases, we induce an artificial misalignment between point cloud and map, which is caused in practice by localization errors of the ego-vehicle on the HD map. Note that driving functions of today's automated vehicles rely on available HD maps as the onboard perception is unable to reliably deduce the stationary environment, especially in bad weather conditions or occlusion scenes. Thus, an approach to MDD is required to at least verify existing map elements reliably.

1) REDUCING POINT DENSITY
We conduct a total of 12 experiments including training and threshold optimization, randomly keeping 100%, 50%, 25%, or 10% of all points for the method variants MDD-SC, MDD-MC, and MDD-M, with the original point density of point clouds in the 3DHD CityScenes dataset [10] being 1/1 dm 3 . As the F 1 score (16) results show similar trends for each element type, we depict averaged results in Fig. 8  over signs, lights, and poles. We differentiate between verifications (VER) and deviations (DEV), the latter summarizing deletions, insertions, and substitutions.
Regarding deviations (DEV), all method variants show only a slight degradation of performance down to 25% point density; only when omitting 90% of the points, significant degradations are observed. Over all reported point densities, MDD-M remains about 21% absolute (F 1 ) ahead of MDD-SC. Similar observations hold for verifications (VER). Down to 25%, all method variants degrade only slightly in performance. However, MDD-M turns out to perform very robustly with an F 1 > 0.90 even if only 10% of the point cloud is available. The high verification performance at 10% point density of MDD-M demonstrates the network's capability to learn expected sensor data distributions for different map hypotheses internally, comparing (degenerated) sensor data to map hypotheses, while sensor-only MDD-SC frequently fails to detect and verify elements. Hereby, the explicit classification of evaluation states in MDD-M further facilitates distribution learning, shown by the performance gain w.r.t. MDD-MC. Further, the stable performance down to 25% point density for deviations and even to 10% for verifications highlights the suitability of MDD-M also for low-density onboard LiDAR or even RADAR data.

2) INDUCING PARTIAL OCCLUSIONS
Also in the case of induced partial occlusions, we conduct a total of 12 experiments including training and threshold optimization, while randomly selecting and occluding 0%, 25%, 50%, or 75% of all map elements that are still present in the sensor data after simulating map deviations, whereby we apply each occlusion setting to MDD-SC, MDD-MC, and MDD-M, respectively. To generate occlusions, we always remove half of the element's points, randomly selecting between left, right, top, or bottom. We average and report the obtained test set results in terms of F 1 score (16) in Fig. 9 in the same fashion as we did in Fig. 8.
Again starting with the deviations (DEV), we observe an almost linear decrease of all methods in the F 1 measure w.r.t. the occlusion ratio. Again, MDD-M turns out to be consistently about 20% absolute (F 1 ) better than MDD-SC for all occlusion ratios. With respect to verifications (VER), MDD-SC gradually degrades up to −4% absolute, while MDD-MC and MDD-M show only −2% absolute degradation for up to 75% occlusion, highlighting the robustness of both variants against scenes with dense occlusions.

3) INDUCING MISALIGNMENT
In the previously presented studies, point cloud and map data feature a highly precise data alignment. In practice, however, the ego-poses on the map obtained from an onboard localization are erroneous, causing a shift between sensor and map data. Also, laser scanner and ego-localization (providing the geo-location and orientation to register sensor and map data) produce their respective measurements at different timestamps, while the scanner typically rotates, which further reduces timely synchronization. While the last two effects can be compensated using odometry data, the imperfect ego-localization remains. As respective algorithms are well developed, however, localization errors seen in practical urban scenarios are indeed very small, e.g., being approx. 10 cm in translation and 0.1 • in rotation. For our study, we consider the default experiment for MDD-M from Section V-B1. During test inference, we induce a misalignment in x-, y-, and z-direction between 0 cm and 60 cm in steps of 10 cm. Rotational errors only cause a translation for larger distances (e.g., approx. 10 cm given a 60 m distance and 0.1 • rotational error), which we therefore omit.
To induce misalignment, we shift all map elements in E by applying a translation of constant magnitude with the direction randomly selected in spherical coordinates using a uniform distribution, while leaving the GT elements unchanged. The obtained results in terms of F 1 score (16) are averaged over all element types and reported in Fig. 10 separately for each evaluation state, with the gray curves depicting the performance without applying any countermeasures against misalignment. As expected, all MDD-M F 1 curves monotonically decrease with increasing map-sensor misalignment, whereby we observe steepest performance drop for INS starting at 20 cm misalignment. This performance drop is mainly caused by an increasing number of insertion FPs as the network misclassifies actual verifications due the sensor data shifted out of scope. To further increase misalignment robustness, we retrain MDD-M for another 4 epochs while inducing a constant misalignment error of 30 cm during training. The obtained performance during test inference is indicated by the colored curves in Fig. 10. It is clearly visible that the network learns to compensate misalignment errors. The performance drops for all states are less pronounced, i.e., achieving a +0.15 absolute F 1 increase for INS at 30 cm in Fig. 10. Thus, for all evaluation states, our proposed MDD-M shows a robust F 1 performance even up to 20 cm map-sensor misalignment when employing the proposed countermeasure.

D. BENCHMARKS
As previously stated in Section I, our deviation annotations for 3DHD CityScenes [10] published along with our entire MDD pipeline comprising code for training, inference, and evaluation allow for benchmarking MDD methods. To the best of our knowledge, no other MDD methods for signs, lights, and poles comparable to ours are publicly available, preventing direct 1:1 comparisons. However, our MDD method, be it MDD-SC, MDD-MC, or MDD-M as variants, can be combined with other DNN architectures for 3D object detection in the field. Especially MDD-SC is easily combinable with such architectures, e.g., PointPillars [71] or VoxelNet [72], requiring only an adaptation of the network heads to predict map elements instead of road users. Hence, we compare PointPillars [71], VoxelNet [72], and our 3DHDNet [10] as employed architectures for MDD-SC to our specialized MDD network in MDD-M, incorporating our proposed MDD novelties, i.e., the additional map input and the classification of evaluation states. Note that both MDD-MC and MDD-M principally generalize to other architectures as well, e.g., point-based networks [74], [75], given a modified map injection procedure.
All compared methods are trained 1 using the strategies described in Section IV-C with the larger 60.8 m crop 1 The results for 3DHDNet [10] (MDD-SC) and MDD-M in Table 3 are  taken from Table 1 and Table 2 to provide an easy comparison at a glance. Recall that 3DHDNet employs large cubic voxels of size s 3 vox with s vox = 40 cm to compensate for the computationally expensive 3D convolutions applied in the 3D backbone [10]. Table 3 summarizes the F 1 score performance (16) for the detection of verifications VER and deviations DEV, and the L1 (17) or L2 (18) errors measuring the regression of bounding shapes, all obtained on the test set, with the DEV state summarizing deletions DEL, insertions INS, and substitutions SUB. Comparing PointPillars, VoxelNet, and 3DHDNet as employed architectures for MDD-SC, 3DHDNet clearly performs best regarding all measures, whereby the performance boost relative to VoxelNet is most significant for signs, being +5% and +7% absolute F 1 for VER and DEV, respectively. In comparison, the F 1 performance gains for VER and DEV w.r.t. lights and poles are smaller, reaching between +1% and +4%, which is expected as only signs appear as vertically stacked objects, being the design focus of 3DHDNet.  the time for GPU data transfer and network execution, compared to VoxelNet with 210 ms + 203 ms = 413 ms, 3DHDNet with 30 ms+295 ms = 325 ms, and MDD-M incorporating the map with 43 ms + 313 ms = 356 ms. Regarding network execution time, MDD-M is a factor 6 slower than fastest PointPillars, while the application of larger voxels reduces data transfer time drastically from 206 ms to 43 ms. Using an optimized implementation with TensorRT and low-density onboard LiDAR, PointPillars achieves a network execution time of 16 ms in [71], translating to a runtime estimation for MDD-M of 16 ms · 6 = 96 ms regarding onboard use cases, being at the edge of real-time capability, considering the typical onboard scan duration of 100 ms. However, sparse convolutions can reduce the required runtime for 3D convolution required for 3DHDNet by factor 3 [73], further supporting real-time capability.

VI. CONCLUSION
In this article, we introduce a novel deep learningbased approach to map deviation detection (MDD) in high-definition (HD) maps and LiDAR data, utilizing the map as additional input to a neural network. To this end, we propose a specialized MDD network that verifies individual map elements (i.e., signs, lights, and poles) or detects and specifies respective map deviations. We compare our method to variants relying on ordinary object detection, showing the superior performance of our MDD network. Specifically, our approach achieves a strong verification and a good deviation detection performance of F 1 = 0.99 and F 1 = 0.85, averaged over all element types. Furthermore, our ablation studies degenerating the sensor input to simulate bad weather and partial occlusions show that our network maintains its verification performance with F 1 = 0.93 on average, despite 90% of all measurements being removed, which may allow for a continued and safe operation of the driving function even in challenging conditions. Also, our approach is robust against misalignment of sensor and map data as seen in practice, highlighting the suitability for onboard applications.

APPENDIX
In this section, we provide additional in-depth information for the interested reader.

A. NETWORK ARCHITECTURE
Subsequently, we detail the inner operation of the encoder and 3D backbone stages applied in the multitask extension of our earlier 3DHDNet [10] as mentioned in Section III-C1. Note that encoder and backbone are unchanged w.r.t. [10], while the concatenation of the LiDAR feature map m and map representation m is novel. The network architecture visualized in Fig. 11 implements the object detection DNN in Fig. 2 used for MDD-SC and MDD-MC.
The encoder stage learns an optimal feature representation for the point cloud. In the first encoding stage, each point m n,k of the LiDAR feature map m is initially mapped to L = 128 features using a 2D convolution with a 1 × 1 × 10 kernel, which yields c 1 ∈ R N×K×L . Subsequently, the maximum for each feature among all K points in a voxel n is taken to obtain c 1 ∈ R N×1×L . This maximum feature vector of length L is then repeated K times to match dimensions with c 1 , yielding c 1 ∈ R N×K×L . Each point contained in c 1 is concatenated with the maximum feature vector previously obtained for each voxel n, which provides the input with 2L = 256 features to the second encoder stage. After first mapping points to L = 256 features, the subsequent maximum operation yields the final encoding c 2 ∈ R N×1×L for all points contained in each voxel n. In a last step, the N obtained feature vectors for each voxel are scattered back to their original position in the 3D voxel grid with zeropadding applied to empty voxels, which yields the encoded LiDAR grid g ∈ R N x ×N y ×N z ×L .
The 3D backbone stage processes g and comprises a downstream and an upstream network, utilizing 3D (transposed) convolutions in both networks on the 3D voxel grid, which allows for the individual detection of vertically stacked traffic signs, as we showed in our previous work [10]. The downstream network is composed of three blocks ("Dn" in Fig. 11), comprising 3, 5, and 5 layers, respectively. Each layer is composed of one convolution, one batch normalization, and one ReLU activation. For instance, Dn1 applies a 3D convolution with a 3×3×3×L kernel on g. Upsampling blocks are each composed of one transposed convolution, followed by one batch normalization and one ReLU activation. Using the upstream network, feature maps obtained at different scales by the downstream network are upsampled to the original grid size, and are subsequently concatenated to obtain g ∈ R N x ×N y ×N z ×L as input to the multitask heads in Fig. 3 from Section III-C1, with L = 3L = 768 and L = 256. Note that the input concatenation of the additional map feature map input m with g is only active for MDD-MC and MDD-M.  DNN (Fig. 2 (a) and (b)) as a 3DHDNet multitask architecture. Blue: Processing stages. Gray: Encoder and decoder stages. A 2D convolution ("2D") Conv(1x1, F) with input dimension [N × K × 10] uses F kernels of size 1 × 1 × 10, while a 3D convolution ("3D") Conv(1x1x1, F) with [N x × N y × N z × L ]-dim. input uses 1 × 1 × 1 × L -sized kernels. "Concat" indicates a concatenation. Downsampling (Dn) blocks comprise multiple convolutional layers. FIGURE 12. Ground truth distributions for bounding shape parameters ( · ) for signs, lights, and poles, with the overline denoting ground truth. Brighter colors indicate higher occurrence. The z-position of signs is encoded as height above ground level.

B. ANCHOR DESIGNS
In the following, we describe the anchor design for signs, lights, and poles used to normalize the regression head outputs r s g , r l g , and r p g , respectively. We design our anchors featuring positions p A g (1) and sizes A g (2) based on the ground truth (GT) distributions for respective bounding shape parameters shown in Fig. 12 (a)-(d).
Regarding signs, the ground truth distribution for bounding rectangle height and width in Fig. 12 (a) features peaks around h, w ≈ 0.65 m (with the overline indicating GT), which we use as default anchor size A g ∈ {w A g , h A g } in (2). Moreover, most signs are sized above 0.4 m. Hence, we use s vox = 0.4 m as voxel size for the anchor grid. Also, the histogram in Fig. 12 (b) shows that most signs feature a height above ground z < 6 m, which is covered by the z-extent of point cloud crops used as network input (see Section IV-A). For lights, the distribution for bounding box height h and width w of the (square-sized) base plate in Fig. 12 (c) peaks around w ≈ 0.3 m, with three peaks occurring at h ≈ 0.3 m, h ≈ 0.6 m, and h ≈ 0.9 m, due to the varying number of stacked lights in traffic light boxes. As we only apply singlesized anchors, we select the highest peak h ≈ 0.9 m, which yields w A g = 0.3 m and h A g = 0.9 m used as anchor size in (2). Regarding the pole diameter distribution in Fig. 12 (d), no clear peak can be identified. Thus, we use the mean of all diameters as anchor size, which yields d A g = 0.2 m for (2).

C. MATCHING STRATEGIES, ASSOCIATION METRICS, AND CRITERIA
For target generation during training, comparison of map elements, and performance evaluation, measuring the overlap of map elements is required. To this end, we employ element-type-specific matching strategies (being algorithmic procedures) visualized in Fig. 13, as well as association metrics that measure the overlap of anchors and GT objects. Also, we define association criteria as thresholds for respective metrics that have to be met for a successful association. During training, GT objects are matched to the anchor grid to provide the targets o g and r g for the respective network outputs. Such strategies, metrics, and criteria also apply during the comparison (cf. Fig. 2) of predicted elements E with the examined map E to identify deviations, and during performance evaluation when comparing the (predicted) set of evaluated elements E eval = V ∪ D ∪ I ∪ S with the respective GT set E eval .
In general, a single GT element can potentially match with multiple anchors. If an anchor contained in voxel g matches with the GT element, we set the anchor's object detection target o (·) g = 1 and the regression target r (·) g according to (1), (2), and (3) using respective ground truth shape parameters. On the other hand, in case of a mismatch, we set o (·) g = 0 with r (·) g being "don't care", as regression features VOLUME 4, 2023 597 are not considered during loss computation for mismatches (cf. (6)). For all elements, to ensure that a GT element is matched with at least one anchor, we consider the anchor with the closest Euclidean distance d E to the GT element as match, independently from further association criteria.
Regarding poles, we employ the matching strategy visualized in Fig. 13 (a). To measure the overlap with anchors surrounding the ground truth element, we define the "Intersection over Smaller Area" (IoSA) metric as the intersecting area of the two circles in the x-y-plane (given by a respective anchor and the ground truth cylinder diameter), indicated relative to the smaller of both circle areas. 2 We require IoSA > 0.2 as association criterion to consider surrounding anchors as matches (see the three green dashed matching anchors in Fig. 13 (a)). Otherwise, if IoSA = 0, the respective anchor is considered as mismatch (blue). Moreover, for anchors having an insufficient overlap with 2 The IoSA measure provides a maximum value of 1.0 if the anchor completely encloses the ground truth element or vice versa. the GT element with 0 < IoSA ≤ 0.2, we allow a "don't care" state (cyan) during training, in which an anchor is not considered for loss computation. Note that for element comparison (cf. Fig. 2) and performance evaluation, we rely solely on the Euclidean distance d E as association metric, which has proven to be more robust than the additional usage of IoSA. For both comparison and evaluation, we require d E /s vox ≤ 0.75 as criterion for a successful association.
For lights, we apply a two-stage matching strategy visualized in Fig. 13 (b). First, we only consider the x-y-plane to find possible matching candidates. To this end, we approximate the base plate of a light's bounding box as a circle with diameter w to reuse the IoSA metric defined previously, whereby we require IoSA > 0.05 as association criterion for a match.
Second, we consider the z-dimension for obtained matching candidates, visualized in the second row of Fig. 13 (b). To measure the overlap with anchors in the vertical z-dimension, we define the "Vertical Overlap" (VO) metric using the overlap of line segments z over given by respective z min and z max values of anchor and GT element, normalized by the smaller of both segment lengths z as VO = z over / z. An anchor is considered as final match if VO ≥ 0.2, and is discarded as mismatch when VO ≤ 0.1. Furthermore, the "don't care" state applies for loss computation if 0.1 < VO < 0.2.
A similar two-stage matching strategy is employed for signs as shown in Fig. 13 (c). First, to identify matching candidates in the x-y-plane, we define the "Distance to Line" (DtL) metric, which indicates the shortest Euclidean distance between the line segment of the GT rectangle (defined by the edge points), and an anchor's center point. Here, we require DtL/s vox < 0.5 as criterion for a match. Second, we apply the VO thresholds in the same way as described for lights.