Deep Learning-Based Cooperative LiDAR Sensing for Improved Vehicle Positioning

Accurate positioning is known to be a fundamental requirement for the deployment of Connected Automated Vehicles (CAVs). To meet this need, a new emerging trend is represented by cooperative methods where vehicles fuse information coming from navigation and imaging sensors for joint positioning and environmental perception. In line with this trend, this paper proposes a novel data-driven cooperative sensing framework, termed Cooperative LiDAR Sensing with Message Passing Neural Network (CLS-MPNN), where spatially-distributed vehicles collaborate in perceiving the environment via LiDAR sensors. Vehicles process their LiDAR point clouds using a Deep Neural Network (DNN), namely a 3D object detector, to identify and localize possible static objects in the driving environment. Data are then aggregated by a centralized infrastructure that performs Data Association (DA) using a Message Passing Neural Network (MPNN) and runs the Implicit Cooperative Positioning (ICP) algorithm. The proposed approach is evaluated using two realistic driving scenarios generated by a high-fidelity automated driving simulator. The results show that CLS-MPNN outperforms a conventional non-cooperative localization algorithm based on Global Navigation Satellite System (GNSS) and a state-of-the-art cooperative Simultaneous Localization and Mapping (SLAM) method while approaching the performances of an oracle system with ideal sensing and perfect association.


I. INTRODUCTION
T HE demand for highly-accurate localization in Cooperative Intelligent Transportation Systems (C-ITS) has recently skyrocketed thanks to the advancements in the development of Vehicle-To-Everything (V2X) communication technologies [1], [2] which raised the attention to Connected Automated Vehicles (CAVs) and associated use cases [3].
Nowadays, most of the proposed solutions rely on the largely-available Global Navigation Satellite System (GNSS).However, even when augmented with inertial sensors, differential corrections, or multi-constellations receivers, GNSS is not able to satisfy the requirements of highly-accurate positioning services [4], [5], as it is heavily affected by poor visibility luca1.barbieri@polimi.it,bernardo.camajori@polimi.it,mattia.brambilla@polimi.it).
conditions that may occur in urban scenarios [6].Novel design strategies (both from algorithmic and technological points of view) are being considered to improve vehicle positioning by integrating multiple onboard perception sensors [7], [8], side information on the sensed environment [9] or exploiting V2X communication technologies [10], rooted in 5G or beyond scenarios [11]- [16].
Over the years, several positioning enhancements have been proposed to overcome the GNSS degradation in urban environments.Among them, cooperative algorithms based on centralized or distributed frameworks [17]- [22] are gaining popularity as they improve vehicle positioning thanks to the sharing of location-dependent information over V2X networks.Indeed, V2X links allow aggregating the sensing information collected at multiple vehicles, which act as mobile probes of a larger distributed sensing system with an augmented view of the vehicular environment (compared to the ego vehicles) that facilitates the detection of objects along roads [23].

A. Related works
Cooperative localization algorithms for vehicular applications have been proposed considering the exchange of GNSS [24]- [26] or other types of radio measurements [27]- [33], and also with data fusion approaches [34]- [36].In [24], a cooperative localization framework is proposed for correcting GNSS pseudoranges and improving the vehicle positioning accuracy, while [25] integrates GNSS and measurements obtained from Dedicated Short-Range Communications (DSRC) with a cubature Kalman Filter.Authors in [26] analyze the theoretical performances of inter-vehicle measurements extracted from GNSS observables.Besides GNSS-based solutions, radio measurements can also be used for enhancing localization accuracy.Methods exploiting angle measurements are proposed in [27]- [29], while [30], [31] use the multipath information available from the channel impulse response.Other approaches, instead, focus on inter-distance measurements among vehicles [32], or on the GNSS carrier phase [33].Lastly, data fusion methods where GNSS information and distances extracted from Ultra Wide Band (UWB) modules are combined together are studied in [34], [35], whereas [36] combines distance and angle measurements.
Besides relying on either GNSS, V2X radio measurements, or data fusion approaches, cooperative localization systems can also be implemented by exploiting imaging data coming from vehicle onboard perception sensors.A fast-developing technology that has been shown to provide promising detection and recognition capabilities for accurate environmental perception is LiDAR.The integration of LiDAR sensors within cooperative positioning schemes is therefore expected to bring significant advantages in terms of robustness and accuracy.Despite the LiDAR potentials, few cooperative approaches based on this technology have been studied [37]- [45].The works in [37], [38] investigate the association issue across multiple vehicles, whereas [39] develops a decentralized cooperative localization method to fuse LiDAR, GNSS and high-definition maps.In [40], a cooperative Simultaneous Localization and Mapping (SLAM) approach is developed for vehicle pose estimation, while [41] proposes a cooperative mapping technique to fuse multiple LiDAR views into an improved unified map.Similarly, also [42] develops a multi-vehicle cooperative scheme where vehicles share locally processed maps to improve cooperative perception.Other studies focus on collaborative Machine Learning (ML) methods where intermediate Neural Network (NN) outputs are shared by vehicles [43], [44] and the road infrastructure [45].Despite the competitive performances provided by the aforementioned methods, they require the sharing of either raw or partially processed point clouds, which may not be feasible for bandwidth-limited V2X applications, or they assume the availability of maps, limiting their application in rural or poorly visited areas.
In perception problems where passive, i.e., non-cooperative, targets are present, algorithms handling Data association (DA) between measurements and targets are needed.The pioneering Hungarian algorithm was proposed in [46] for solving such problems.It was then extended and superseded by approaches exploiting semantic information, which uniquely identifies the targets through visual features, such as color, size, or movement patterns [47].In case measurements include only position-dependent parameters, such as distance or angle, the association problem becomes non-trivial to be solved.Solutions based on the Sum Product Algorithm (SPA) aim at efficiently computing the marginal association probabilities through the use of message passing over suitable factor graphs [48], [49], although they might suffer in case of cyclegraphs and non-Gaussian noise models.To improve upon these limitations, some works employ a double-loop algorithm, which forces the convergence of Gaussian belief propagation, and non-walksummable models [50], which involve the use of a preconditioner in an iterative method based on diagonal loading.Recent works exploit the capabilities of NN over networks, also known as Message Passing Neural Network (MPNN) [51], [52], which learn the correct association directly from semantic data [38].MPNNs can discern both linear and non-linear relations between input and output data, they are time scalable similar to SPA [53] and have demonstrated superior performance on loopy graphs, provided the availability of sufficient training data [54], [55].

B. Contributions
In this paper, we propose a novel data-driven cooperative localization method for GNSS augmentation in connected driving scenarios and suited for both rural and urban areas.The method builds on the Implicit Cooperative Positioning (ICP) paradigm initially developed in [56] and later extended in [18] to address DA.Compared to those former works that assumed point-like detections obtained via generic sensing systems (e.g., LiDAR, camera, or radar), we here introduce a data-driven framework for accurate environmental perception from onboard LiDAR sensors at the vehicles.Furthermore, we also propose to solve the DA problem via MPNNs to address the limitations of the SPA-based solution in [18].We refer to this proposed methodology as Cooperative LiDAR Sensing with Message Passing Neural Network (CLS-MPNN).
In the proposed CLS-MPNN solution, each vehicle is equipped with a Deep Neural Network (DNN), namely a 3D object detector, to efficiently process the LiDAR point clouds so as to recognize and localize passive static objects in the surroundings.Note that the proposed framework can easily accommodate other types of sensors (e.g., cameras or radars) or multimodal solutions by properly modifying the 3D object detectors.Poles are selected as reference objects to be detected by vehicles as they are easily recognizable from the points clouds, widespread on most roads, and fixed.Nevertheless, the proposed framework is general enough to support other objects according to the scenario (e.g., rural or urban environments).The individual detections made by the vehicles are then exchanged with a centralized infrastructure, namely a Mobile Edge Cloud (MEC), via Vehicle-To-Infrastructure (V2I) communications.The MEC coherently combines the measurements provided by the vehicles by executing the MPNN-based DA algorithm and then it runs the ICP method for refining both objects and vehicle positions.We consider a MEC-centralized processing architecture as it provides the highest performances and it can be implemented at the road infrastructure, even if it has the drawback of being the critical single point of failure of the system.The method is general enough to be adapted to distributed architectures, following e.g., the approach in [18].
With respect to previous works that either rely on partially processed [43]- [45] or raw LiDAR sharing [40], [41], the proposed method foresees only the exchange of the detected objects and vehicles positions, thereby substantially limiting the communication overhead.Furthermore, the developed method does not rely on maps of the driving environment, as in [39], making the cooperative approach potentially applicable to any driving situation, especially when high-definition maps are unavailable.The proposed solution improves upon all our previous works, which provided an initial proof of concept using only single-stage detectors [57], [58], or focused on studying the DA problem alone [37], [38].Herein, we provide a more complete and robust solution that jointly addresses the DA problem and improves the localization accuracy thanks to the cooperation among vehicles.Specifically, the main contributions are as follows: • We embed the DA in a cooperative mechanism by extending the method in [37], [38] to enable tracking over time.
• We extend our frameworks in [37], [38], [57], [58] by integrating both single-stage and double-stage 3D object detectors to provide a comprehensive cooperative localization platform.• We introduce a comparison with state-of-the-art coop-erative SLAM approaches to show the benefits of the proposal in terms of positioning accuracy as well as in reducing the communication overhead.
• We study the generalization abilities of CLS-MPNN in new driving scenarios where neither the 3D object detector nor the MPNN were trained on.Besides, we also characterize the ability of CLS-MPNN to work with cheaper, lower-resolution LiDAR sensors and its robustness against false alarms.The overall goal is to provide the achievable performance of a cooperative system that uses LiDAR sensing to detect static objects in the road environment for vehicle positioning augmentation.For this reason, we assume noisy detections neglecting false alarms or False Positives (FPs) (i.e., undesired objects that are either erroneously detected by the DNN model or generate backscattering with high intensity).Experimental results show that CLS-MPNN substantially outperforms an ego-vehicle positioning method based on GNSS tracking for different traffic densities and LiDAR resolutions while approaching the performances of an ideal cooperative LiDAR sensing system with perfect DA that serves as a possible lower bound on the performances.Additionally, the proposal is shown to be advantageous even when compared with stateof-the-art cooperative SLAM algorithms and when the FPs are not removed.Lastly, when testing is performed in a scenario different from the one used for training, the cooperative approach still largely outperforms the ego GNSS tracking one, even though more missed detections are observed.
The paper is organized as follows.Sec.II details the system model considered throughout the paper.Sec.III presents the ML methods adopted for environmental perception, while Sec.IV describes the Bayesian tracking framework and the MPNN-based DA.Sec.V discusses potential implementation challenges.Sec.VI highlights the performances of the 3D object detectors and the MPNN, while Sec.VII presents the result characterizing the proposed cooperative localization framework.Lastly, Sec.VIII draws conclusions.

II. VEHICULAR SCENARIO AND MODELS
In this section, we introduce the connected vehicular scenario and the main assumptions used in the paper for the design of the cooperative localization solution.More specifically, Sec.II-A defines the models for the vehicle dynamics and the localization measurements available for cooperative positioning refinement; whereas Sec.II-B presents the modeling of the DA problem and the related variables.

A. Localization model
The road scenario comprises a set V = {1, . . ., N v } of N v vehicles moving over a 3D space.At discrete time t, each vehicle v ∈ V is characterized by a 3D position u v,y,t 0] T , which are collected into the time-varying state x Additionally, each vehicle can establish a V2I communication link with a centralized infrastructure, namely a MEC, for exchanging positioning and  sensing information.The vehicle state is assumed to evolve according to the following first-order Markov model, namely the constant velocity [59] x with F=[I 3 ,T s I 3 ;0 3×3 ,I 3 ] and K=[0.5T 2 s I 2 ;0 1×2 ;T s I 2 ;0 1×2 ], while w (V) v,t−1 is a zero-mean Gaussian driving process modeling the acceleration uncertainty with covariance and T s the sampling interval.Besides vehicles, the scenario contains a set O = {1, . . ., N o } of N o noncooperative static objects that can be detected by nearby vehicles.Ideally, ad-hoc objects should be placed in the driving environment to facilitate their detection from the imaging sensors of the vehicles.For this initial study, we select poles as detectable objects since they are extremely common, easily recognizable and fixed.Each pole is described by a position state x k,z,t ] T here assumed to be time-invariant: A visualization of a portion of the scenario from the software used for simulation, i.e., CARLA [60], is provided in Fig. 1.
The vehicles are able to estimate their own position using navigation units, e.g., GNSS, which are assumed to be always available.The GNSS measurement at vehicle v is modeled as where denoting the GNSS position accuracy.
Each vehicle is also able to detect possible objects present in the surroundings via LiDAR sensors.Specifically, the point cloud P v,t captured at time t by vehicle v is processed by a 3D object detector, whose task is to generate a candidate set of bounding boxes physically encapsulating the detected poles.Such set is denoted as where R s is the LiDAR sensing range (here assumed to be equal for all vehicles).
The 3D object detector provides two different types of measurements.The first one refers to the 3D position of the bounding boxes' centroids and is given by where ρ (O) v,k,t , with k = 1, . . ., |O v,t |, denotes the distance between the k-th bounding box center and the v-th vehicle position at time t.The second one, instead, represents the detected poles as a collection of eight corners where z (O) v,k,c,t represents the 3D coordinate of a corner point.We specifically represent objects as bounding boxes as they can be easily obtained from the output provided by the 3D object detector (as highlighted in Sec.III).Note that other choices can still be used by converting the output of the 3D object detectors in an appropriate format.For DA purposes, we also define the sets of estimated absolute locations of the centroids and the bounding box corners as where u v,t is the predicted position of vehicle v at time t from its previous state estimate x (V) v,t−1 according to the motion model in (1).The definition of the sets ( 5)-( 6) is required to enable the coherent fusion of the detected objects across vehicles by performing the DA step in a common reference system.Specifically, set z (O) v,t is useful for solving the DA problem as it represents the bounding boxes accounting for both orientation and dimension.By contrast, set ρ (O)  v,t enables to estimate the absolute position x

B. Data association model
A fundamental step required to coherently aggregate the bounding box measurements generated by each vehicle is the DA, which enables the identification of bounding boxes relating to the same object across multiple vehicles.This operation is typically affected by several errors.Examples include the intrinsic error on the point clouds (in the order of centimeters), the error introduced by the object detection process (for some reference values please see the analysis provided in Sec.VI), and the accuracy of the onboard positioning system (e.g., GNSS), which represents the highest source of error (in the order of meters).Overall, the combination of these noise sources makes solving the DA problem non-trivial.To address these shortcomings, we propose to employ MPNNs, as detailed hereafter.
Given the set v,t comprising all the bounding box corners generated by all vehicles at time t, the MPNN input is modeled starting from Z t so as to obtain a directed graph G = (N , E), where N and E are the sets of nodes and edges, respectively.The graph G is obtained via a mapping function Φ t : N → Z t × V such that each bounding box k = 1, . . ., |O v,t | of vehicle v is associated to a graph node n ∈ N , while the edge (n, m) ∈ E, with n ̸ = m, denotes a potential association between the bounding boxes encoded by nodes n and m.Additionally, we introduce the association variable a n→m ∈ {0, 1}, where a n→m = 1 denotes the existence of the edge (n, m), or equivalently that the bounding boxes associated with nodes n and m refer to the same object, while a n→m = 0 indicates that no edge exists between nodes n and m.The goal of the MPNN is to infer the association variables a n→m ∈ {0, 1} by considering all the potential bounding box combinations for all vehicles.To avoid associating two (or more) bounding boxes estimated by the same vehicle, we impose a n→m = 0 whenever the mappings Φ t (n) and Φ t (m) refer to the same vehicle.
Besides the measurement-to-measurement association variable a n→m , we also define the binary vector α t = [α v,o,t ] v∈V,o∈Ot collecting all the measurement-to-object association variables of all vehicles at time t, where O t = Nv v=1 O v,t is the set of indexes of all detected objects at time t.Given the output of the MPNN, another step is required to pair the (associated) measurements with the objects and obtain an estimate of the measurement-to-object association variable α t .This step will be detailed in Sec.IV-A.

III. DEEP LEARNING-BASED LIDAR SENSING
In this section, we detail the strategies used for enabling each vehicle to perceive the driving environment.In particular, Deep Learning (DL)-based 3D object detectors are used to efficiently process the LiDAR point clouds and obtain the position of the static objects in the vehicles' surroundings.Given the ego-vehicle LiDAR data, the 3D object detectors provide at the output a set B = {1, . . ., N B } of N B candidate bounding boxes physically encapsulating the detected objects.Each bounding box is compactly described as contains the object's dimensions, namely the length, width and height, while γ j is the yaw angle.The outputs provided by the detectors can then be suitably converted to obtain the measurements defined in (3)-(4).Besides, each bounding box is associated with a score probability, ranging from 0 up to 1, that indicates the (soft) confidence of the detector.Herein, we assume that all bounding boxes provided by the 3D object detectors refer only to poles present in the driving environment, namely all FPs are filtered out.Nevertheless, the score probability can be used to suppress the FPs as analyzed in Sec.VII-A.
To design the cooperative localization framework, we employ two state-of-the-art 3D object detection methods with increasing complexity, namely PointPillars and Part-A2 that are based on the single-stage and double-state paradigms, respectively [61].In the following, we initially describe the inner workings of PointPillars (Sec.III-A) and then present the Part-A 2 method (Sec.III-B).

A. PointPillars
PointPillars is a single-stage object detector that processes the input point cloud and extracts important features organized in pillars through the employment of PointNet [62] modules.The encoded features are then processed by a standard Region Proposal Network (RPN) to obtain the final predictions.In the following, we explain its inner working principles.
Starting from a single-vehicle point cloud 1 of N ℓp points, PointPillars converts it into an image-based representation.Specifically, P is discretized into an evenly-spaced x-y grid of P non-empty pillars, each containing N p points 2 .The output is then fed to a simplified PointNet architecture as presented in [63] to obtain a 2D pseudo-image of size C × H × W , with C, H and W denoting the number of channels, height and width of the canvas, respectively.The pseudo-image is used by a RPN to estimate the 3D bounding boxes.Specifically, the RPN is composed by a backbone, whose task is to compute important features, and a detection head that uses the backbone's features to generate the final predictions.In PointPillars, the Single Shot Detector (SSD) head with pre-defined anchor (or prior) boxes is used.The matching strategy [64] used by PointPillars assigns a positive match to the predicted bounding box having the highest 2D Intersection over Union (IoU) with the groundtruth box or a 2D IoU > 0.5.On the other hand, predicted boxes with 2D IoU < 0.35 are deemed as negative samples.All other anchors are ignored for the loss computation.During inference, an axis-aligned Non-Maximum Suppression (NMS) stage [64] with an IoU threshold of 0.5 is applied to filter out overlapping boxes that refer to the same object.Finally, the NN parameters of PointPillars are updated according to the losses presented in [63] using the same configuration parameters.

B. Part-A 2
Part-A 2 is a double-stage 3D object detector that introduces the concept of intra-object part estimation and aggregation.It extends PointRCNN [65] to concurrently learn segmentation masks for discerning foreground points, classify their position inside the bounding box and generate 3D proposals, i.e., rough bounding box estimates.The proposals are then fed into a part-aggregation network that generates the final predictions, exploiting also the intra-object part information produced in the preceding stage.Note that we use in our experiments the Part-A 2 -Anchor, which uses prior boxes for the proposal generation process.In the following, we briefly describe the overall architecture and how the bounding boxes are estimated.
In contrast to PointPillars, Part-A 2 process the singlevehicle point cloud P using voxelization.In particular, a regular 3D grid is computed starting from P, and the points associated with each grid point are grouped and averaged, leading to a new point cloud representation.Then, a UNetbased encoder-decoder [66] segments the foreground points, i.e., points belonging to the ground truth bounding boxes, from all the other points.Concurrently, an additional module is responsible for learning the intra-object part information.Then, a RPN is appended for generating the 3D proposals (with pre-defined prior boxes).The matching strategy uses bounding boxes with 3D IoU > 0.5 as positive anchors, while boxes having 3D IoU < 0.35 are deemed as negative.All other proposals are neglected during loss computation.After this step, only 64 positive and 64 negative anchor boxes are retained for further refinement.
Once the proposals have been generated, a second network is responsible for improving the detection accuracy.This module takes as input the intra-object part locations estimated before, together with the proposals and the associated learned features, and refines the estimated bounding boxes.At first, the points falling within each proposal are transformed into a canonical coordinate system where the x and y axes are approximately parallel to the ground plane, while the z axis is the same as the input point cloud.This reduces the impact of different rotation angles and locations of the proposals, leading to a better bounding box regression.Then, a Region of Interest (RoI)-aware pooling strategy is used to aggregate the features produced during the previous stage as in [67].Finally, the pooled features are fed to sparse convolutional layers whose outputs are used by two separate branches for bounding box regression and classification.The confidence scores are assigned by a 3D IoU guided scoring strategy [67].A confidence of 0 is assigned to boxes having 3D IoU < 0.25, while for boxes with 3D IoU > 0.75 the confidence is set to 1.In all other cases, the confidence is computed as 2IoU−0.25.During inference, a rotational NMS strategy with IoU threshold of 0.7 is applied to filter out overlapping boxes that refer to the same object.Part-A 2 is trained in an end-toend manner by exploiting the losses presented in [67].
Both DNN detectors are trained in a centralized manner and then deployed at the vehicles.In this way, each vehicle v can estimate the bounding boxes from the point cloud P v,t and obtain the corresponding sets ρ (O)  v,t and z (O) v,t .

IV. COOPERATIVE LIDAR POSITIONING
In this section, we present the CLS-MPNN localization framework employed for cooperatively localizing both objects and vehicles and tracking their states across time.Sec.IV-A introduces the Bayesian tracking solution, then we describe how the DA problem is addressed.In particular, we solve the DA problem in two steps: the first one enables the estimation of the measurement-to-measurement association variable via the MPNN (Sec.IV-B), while the second one estimates the measurement-to-target associations (Sec.IV-C).The overall block scheme of the CLS-MPNN positioning framework is summarized in Fig. 2.

A. Bayesian tracking solution
We assume that at each time t all vehicles v ∈ V are able to send to the road infrastructure via V2I links their position measurements ρ (V) v,t together with the sets z (O) v,t and ρ (O) v,t of detected poles.The MEC is then in charge of cooperatively tracking the state evolution of the vehicles and the objects over consecutive time instants.To do so, we resort to a Bayesian approach where we are interested in computing the joint posterior probability density function (pdf) of vehicles and objects, namely p(θ t |ρ (V) ] T collects all the vehicles x (V) t and object x (O) t states, while ρ (V) 1:t and ρ (O)  1:t aggregate all the measurements ρ (V) v,t and ρ (O) v,t , respectively, of all vehicles up to time t.The overall state can be then estimated using the Minimum Mean Square Error (MMSE) criterion, leading to Such estimation process entails computing p(θ t |ρ (V) 1:t , ρ (O) 1:t ), which needs to be evaluated after the DA is resolved.By conditioning with respect to the association variable α t , the posterior is rewritten as where p(θ t |α t , ρ 1:t ) is evaluated according to the mathematical formulation in [18], while p(α t |ρ where, g enc n (•) and g enc e (•) are two Multi Layer Perceptrons (MLPs) at the node and edge, respectively, whose goal is to extract the feature embeddings from the bounding boxes.The MPNN then further encodes the node and edge embeddings using three additional MLPs at each node, namely g n (•), g out n (•) and g in n (•), while an additional MLP g e (•) is used at each edge.The role of each MLP is detailed in [38].At iteration i, the MLP at each edge is used to update the edge embeddings as where N n is the set of neighbors of node n ∈ V, while the node embeddings are updated as where are the incoming and outgoing edge embeddings, respectively.This procedure is repeated for I message passing iterations and the association variable a n→m is obtained by applying a thresholding operation (with threshold Γ) to the output of an edge classifier, called g class The MPNN is trained using a weighted binary cross-entropy loss computed after each iteration i as where w is a weight given to the positive class to compensate for the class unbalances, which is computed as the ratio between the number of 0-class and 1-class edges.

C. Measurement-to-target association
Once the association variable a n→m is estimated, we can distinguish which set of bounding boxes from different vehicles have been generated by the same object.This is done by extracting the connected components of the graph G with only the positive-classified edges, i.e., a n→m > Γ} the set of vehicle-measurements tuples that are related to the same target with unknown identification o.The goal now is to associate each o to the indexes o ∈ O t−1 .
From the mean of the associated bounding boxes, we estimate the object position by computing the centroid averaging over the eight corners as Motivated by the fact that x (O) o,t holds a good approximation of the real bounding boxes positions, we solve this second association step by setting α v,o,t = 1 according to the maximum likelihood criterion over the distance being ∆ a given threshold.Otherwise, α v,o,t = 0 indicates the detection of a newly-observed object.

V. IMPLEMENTATION ASPECTS
This section highlights potential challenges that can arise in the implementation of the proposed cooperative approach in real systems.Specifically, we discuss about the limitations of LiDAR sensors in Sec.V-A, while in Sec.V-B we detail strategies to limit the impact of inaccurate synchronization, network reliability and signal interference.Privacy and security issues are covered in Sec.V-C.

A. LiDAR sensing technology limitations
Despite the highly-accurate environmental sensing, LiDAR sensors are impaired by noise, occlusion, and adverse weather conditions that require specific countermeasures.For example, ML-based methodologies could be designed to reduce the point cloud noise (see e.g., [68]).Similarly, data-driven strategies could be utilized to predict objects that are partially occluded (see e.g., [69]), reducing the degradation due to blockage.Under rain, fog, or snow conditions, the environmental sensing obtained by LiDAR may not be sufficiently accurate, and the integration of multiple imaging sensors (e.g., camera, radar, and LiDAR) is suggested to provide a reliable localization system under all weather conditions (for example, by drawing inspiration from [70]).

B. Synchronization, network reliability and signal interference
The performance of our cooperative localization system can be degraded by the impairments of the communication channel such as poor synchronization, communication delays, unreliable networks, and signal interference.All these aspects need to be handled by dedicated protocols or processing strategies.In particular, synchronization between vehicles and the road infrastructure could be guaranteed by dedicated protocols, such as the Precision Time Protocol version 2 (PTPv2) [71], or by using synchronization signals extracted from the onboard GNSS receivers of the vehicles.Communication delays introduced by the network must be handled by temporally realigning the measurements received by the infrastructure as discussed in [56].Alternatively, the asynchronicity of measurements can be statistically accounted for within the tracking algorithm [72].For what concerns unreliable networks, vehicles could implement retransmission schemes or the infrastructure could modify the fusion strategy to take into account packet drops.Signal interference among connected vehicles could be reduced by introducing ad-hoc scheduling strategies [73] or resource allocation policies [74].

C. Privacy and security considerations
From a privacy and security perspective, the proposed approach could leak information about the area where vehicles are currently located.This may be inferred from the information sent by the vehicles to the MEC.Additionally, unauthorized entities are able to tamper with the communications to modify the exchanged vehicles and/or object positions, jeopardizing the CLS-MPNN positioning performances.All these situations must be avoided by introducing ad-hoc security schemes.For example, the communication can be encrypted by adopting blockchain technologies as discussed in [75].On the other hand, the impact of malicious vehicles entering the cooperative process can be limited by employing authentication protocols [76].

VI. NEURAL NETWORKS TRAINING AND EVALUATION
This section details the strategies adopted for training the LiDAR-based 3D object detectors and the MPNN and their respective performances.In particular, Sec.VI-A provides the results obtained by the 3D object detectors, while Sec VI-B focuses on the MPNN DA performances.

A. LiDAR-based object detection
This section focuses on the evaluation of the detection performances of PointPillars and Part-A 2 .We are interested in characterizing their ability to correctly estimate the bounding boxes of the poles from the LiDAR point clouds to evaluate the overall error introduced by the detection process while assuming the vehicle position to be exactly known.
For training the 3D object detectors, we generated a synthetic dataset from CARLA [60], an advanced driving simulator that allows flexible configuration of realistic vehicular layouts and sensor specifications.The dataset is collected in the urban map Town02 of CARLA, that spans roughly a 200 × 200 m 2 area, where a single LiDAR-equipped vehicle moves in the environment for 5000 timestamps sampled at T s = 0.2 s.
The LiDAR sensor is placed on the roof of the vehicle and is configured to have 64 planes covering an elevation angle between −25°and 15°, a sensing range of 70 m and to output 1 million point clouds per second, on average, with a ±2 cm accuracy.For a more realistic simulation, we randomly remove 20% of the points for each captured frame and apply an additional dropping strategy where points having a reflectance value r j < 0.8 are randomly dropped.The ground truth data consists of 3D bounding boxes physically enclosing the poles for each captured LiDAR frame.Note that we consider only ground-truth boxes with at least 10 points inside.The performances of the object detector are evaluated over a heldout portion of the dataset composed by 1000 examples, while the remaining 4000 ones are used for training.
Both detectors are implemented relying on the open-source repository in [77].PointPillars is configured to have P = 16000 pillars of N p = 32 points each, while the output image dimensions obtained after scattering back the learned features are set to C = 64, H = 432 and W = 496.On the other hand, Part-A 2 utilizes 16000 voxels.The prior boxes width, length, and height are selected as 1.1 m, 1.1 m, and 7.67 m, respectively, for both detectors.PointPillars is trained over 160 epochs with batches of 6 examples using an Adam optimizer [78] with learning rate η = 0.01, momentum parameters β 1 = 0.95 and β 2 = 0.85, and a weight decay factor of 0.01.A cosine annealing strategy [79] without warm restarts is used to schedule the learning rate.Similarly, Part-A 2 is trained using the same optimizer and learning rate scheduler but considering a maximum number of epochs of 80 and with a batch size of 2 examples.
Data augmentation strategies are used to improve the performances of both detectors.During training, each point cloud is augmented by applying a random rotation of [-90°, +90°], a random scaling of [0.95,1.05],and a random horizontal flip.Additionally, ground-truth sampling [80] is used, where at most 10 bounding boxes are added to each point cloud example.For Part-A 2 , an additional augmentation strategy is adopted, whereby the ground-truth sampling is applied by adding noise to the bounding boxes.Specifically, two types of noise are considered: a translational noise which moves the bounding box on the horizontal plane, and a rotational noise, which adds a random rotation in the interval [-45°, +45°].
As performance metric we use the Average Precision (AP) as presented in [81].For object detection tasks, the AP requires the computation of the IoU to discern positive and negative predictions.Given a pair {b j , b j } of ground-truth and predicted bounding boxes, the IoU is the ratio of the volume of the intersection and the volume of the union between b j and b j .Thus, the IoU is and b j is deemed as positive if the IoU is greater than a threshold ε.The AP is evaluated considering two recall levels L = 11 and L = 40 as recently suggested in [82].
As an additional performance metric, we evaluate the localization error of the predicted bounding boxes with respect to the ground truths, neglecting any FPs.We consider the centroid of the bounding box to assess how accurate and precise are the predictions with respect to the ground truths.For this purpose, we evaluate the Circular Error Probable (CEP) at 95% confidence of the location error cumulative distribution function (CDF) over each axis separately.Both performance metrics are evaluated on the validation dataset.
Table II reports the AP for PointPillars and Part-A 2 considering IoU thresholds ε = {0.25,0.50, 0.70} and recall levels L = {11, 40}.The AP is evaluated over the Bird's Eye View (BEV) projection, denoted as AP BEV , which projects the predictions onto the x-y plane and in the 3D space, referred to as AP 3D .Comparing the results, the AP BEV and AP 3D metrics show that PointPillars provides more accurate detections compared to Part-A 2 for all values of L and IoU thresholds ε.Indeed, the AP of PointPillars is, on average, 5% higher than the one obtained by Part-A 2 until ε ≤ 0.50, while for more stringent thresholds, i.e., ε = 0.7, the gap between the two detectors becomes even larger.Focusing on the results under different recall levels L, one can notice that L = 11    To further characterize the performance of the 3D object detectors, in Fig. 3 we report the CEP95 of the location error obtained by PointPillars and Part-A 2 over the three axes.The results confirm the findings of the previous analysis: PointPillars provides more accurate detections when compared with Part-A 2 .More specifically, the two methods have similar CEP95 over the x and y axes, while for the z axis PointPillars achieves a location error lower than 13 cm in 95% of the cases, while Part-A 2 attains a much larger value of 29 cm.

B. MPNN data association
The MPNN is trained over the same CARLA map used for training the 3D object detectors.In particular, a new dataset is generated in Town02 where 20 LiDAR-equipped vehicles, with LiDAR sensors configured as in Sec.VI-A, move in the environment for 1000 timestamps, again with sampling time T s = 0.2s.The database is constructed by firstly running the PointPillars model on the point cloud data acquired by each vehicle individually to obtain the set of estimated bounding boxes.We specifically select PointPillars in this stage due to its more accurate detection performances compared to Part-A 2 as highlighted in the previous analysis.To make the MPNN-based DA strategy more resilient to the errors discussed in Sec.II-B, we add Gaussian noise to the bounding boxes provided by PointPillars.The noise standard deviation is set to 1 m.Then, the input graphs containing the true associations among the poles detected by the vehicles are used as ground-truth information for training the MPNN.We use 700 examples for training and the remaining 300 for validation.The MPNN is trained considering a batch size of 20 examples for 50 epochs and using the Adam optimizer with learning rate η = 0.001 and momentum parameters β 1 = 0.9 and β 2 = 0.999.To tune the hyper-parameter I, we analyze the accuracy, precision, and recall performance metrics in the validation dataset by varying the number of message passing iterations, whose results are reported in Fig. 4. The error bars are computed using the mean values and standard deviation as uncertainties.Results show that the optimal number of iterations I is between 5 and 7. Indeed, I < 5 prevents effective feature extractions and elaboration, while I > 7 may lead to over-complicated models and potential overfitting.That is, when I is too large, the model might start learning noise or outliers in the training data, rather than the underlying patterns.Therefore, for the following analyses we set I = 5.

VII. COOPERATIVE LOCALIZATION RESULTS
This section analyzes the performances of the proposed cooperative localization approach in two realistic driving scenarios.The goal is to comprehensively assess the DA approach and the overall cooperative system as compared to ideal and state-of-the-art solutions.The first one is again based on the map Town02 of CARLA but with new trajectories and vehicle interactions (Sec.VII-A).The second one instead is a brand new environment where the NNs have been not trained on (Sec.VII-B).Finally, Sec.VII-C discusses the communication overhead of CLS-MPNN.

A. Town02 results
We initially assess the proposed method considering the Town02 map of CARLA where twenty vehicles move in the area following different trajectories compared to the ones of Sec.VI.Unless stated otherwise, we assume that the onboard GNSS receiver provides a position estimate with accuracy  = 5 m all over the map.The same parameters used in Sec.VI are reused to configure the LiDAR sensors onboard the vehicles.The motion model ( 1) is calibrated according to the vehicles' motion, with σ Q = 0.5 m/s 2 , while the accuracy of the object detectors is as in Fig. 3.The overall simulation lasts for 300 seconds and comprises 1500 snapshots each sampled at T s = 0.2 s.
In the following, we assess the performances provided by the following positioning solutions: • The proposed CLS-MPNN tool integrating the detections provided by the 3D object detectors and the MPNN-based DA strategy.We will show results considering separately the detections obtained by PointPillars and Part-A 2 .• An oracle version of the proposed method, referred to as O-CLS-MPNN, with perfect DA and LiDAR sensing where all poles present in the field of view of the LiDAR sensor are detected.• A state-of-the-art cooperative SLAM approach [40] that requires vehicles to exchange their positions and the raw LiDAR point clouds, denoted as C-SLAM.1) Cooperative localization performances: Fig. 5 shows the spatial Root Mean Square Error (RMSE) of the vehicle location estimate computed all over the map for GNSS-KF (Fig. 5a), O-CLS-MPNN (Fig. 5b), the proposed CLS-MPNN method considering the detections provided by PointPillars (Fig. 5c), Part-A 2 (Fig. 5d) and C-SLAM (Fig. 5e).Additionally, Fig. 5f reports the CEP95 of the vehicle location error for all the compared methodologies.Analyzing the results, CLS-MPNN substantially outperforms GNSS-KF on all map positions, regardless of the detector employed, while approaching the performances attained by O-CLS-MPNN.The improvement brought by CLS-MPNN is nearly constant all over the map apart from a few spots in the leftmost road where a decreased level of cooperation is observed due to a lower number of vehicles driving in that area.The developed method also provides more accurate results compared to the ones achieved by C-SLAM as its location RMSE is largely reduced both for PointPillars and Part-A 2 .Looking at the CEP95 of the vehicle  location error in Fig. 5f, they further confirm the superiority of the proposed cooperative localization framework.Indeed, the location error of CLS-MPNN that integrates the detections provided by PointPillars and Part-A 2 is lower than 0.99 m and 1.17 m in 95% of the cases, respectively, while GNSS-KF and C-SLAM attain 6.36 m and 1.36 m, respectively.Therefore, CLS-MPNN enhances the localization accuracy in the considered scenario compared to GNSS-KF and C-SLAM.Besides obtaining more accurate position estimates, CLS-MPNN is more communication-efficient compared with C-SLAM as only the estimated positions of vehicles and objects bounding boxes are exchanged with the MEC while C-SLAM requires vehicles to share also the raw LiDAR point cloud.Focusing on the comparison between O-CLS-MPNN and CLS-MPNN, some performance loss is experienced.This should not be surprising as O-CLS-MPNN detects all possible poles recognizable by the LiDAR sensors while also carrying out perfect DA.Nevertheless, the gap between CLS-MPNN implemented using the detections provided by PointPillars and O-CLS-MPNN is only 6 cm.
2) Impact of GNSS accuracy: To complement the analysis shown before, we here evaluate the impact of varying the GNSS positioning uncertainty σ (V) P .Fig. 6 reports the vehicle location error CEP95 attained by CLS-MPNN and GNSS-KF when σ (V) P ranges from 0.1 m up to 5 m.The results show that CLS-MPNN is particularly useful for enhancing the positioning performances when the GNSS is highly inaccurate.Indeed, when σ (V) P < 1 m the average improvement brought by CLS-MPNN with respect to GNSS-KF is 57%, while for σ (V) P ≥ 1 m the same improvement is roughly 87%.Besides, the analysis highlights that CLS-MPNN is advantageous even when the GNSS accuracy is high.
3) Impact of number of vehicles: In Fig. 7, we analyze how the CEP95 of the vehicle location error is influenced  by the number of vehicles in the scenario.We compare the results for the CLS-MPNN using PointPillars and Part-A 2 when N v ranges from 2 up to 20.The achieved results show a diminishing return in terms of positioning accuracy as the number of connected vehicles decreases: the localization error is more than doubled when passing from Nevertheless, even when N v = 2, the proposed approach is still able to largely outperform GNSS-KF regardless of the detector employed.

4) Impact of LiDAR resolution:
We here analyze how the resolution of the LiDAR (i.e., number of channels) affects the localization performances of CLS-MPNN.In the left panel of Fig. 8 we report the CEP95 values of vehicle location accuracy when using LiDARs with 16, 32, and 64 channels, while the right panel highlights the percentage of correct detections obtained by PointPillars and Part-A 2 .Note that the 3D object detectors are not retrained to account for the lower LiDAR resolution.We observe that as the number of channels decreases, the positioning accuracy decreases as well, regardless of the detector employed.This should be expected since a lower resolution makes the detection process more difficult, leading to fewer objects detected and a consequent increase of the CEP95 value.To improve the object detection performances, super-resolution methods could be adopted [83].We remark that CLS-MPNN still outperforms GNSS-KF even when using low-resolution LiDAR sensors.
5) Impact of false positives: Even though the proposed method is not formulated to deal with FPs, we here highlight how the detections can be filtered to minimize the false alarms without heavily impacting the performances of CLS-MPNN.To reduce the number of FPs, we employ three filtering strategies.The first one (Strategy 1) retains only bounding boxes detected by multiple vehicles at the same time instant, whereas the other two take as input the output of the first one and remove boxes having a score probability less than 0.15 (Strategy 2) and 0.25 (Strategy 3).Table IV and Fig  9 report the percentage of FPs and the vehicle location error CEP95 of CLS-MPNN before and after applying the three filtering strategies for both PointPillars and Part-A 2 , respectively.Analyzing the results, the first strategy heavily limits the number of FPs without reducing the final accuracy.On the other hand, the other two policies introduce marginal accuracy drops but allow to almost completely cancel out the false alarms, i.e., the FPs are less than 1% of all the detections.Indeed, removing bounding boxes that have a score lower than 0.25 reduces the percentage of FPs to 0.18% and to 0.38% for PointPillars and Part-A 2 , respectively.This in turn makes CLS-MPNN slightly less accurate, increasing its CEP95 to 1.19 m and 1.46 m for PointPillars and Part-A 2 , respectively.
6) Computational requirements: We analyze the computational complexity for executing the proposed method by reporting in Fig. 10a the computational time and in Fig. 10b the associated memory utilization, with a breakdown on the individual contributions from the detector, DA algorithm and cooperative tracker.Results show that the most time- consuming stages for running the developed approach are the DA step and the 3D object detection process.For what concerns the memory utilization instead, a large memory portion should be devoted to storing the detectors.The figures also show that the times/memory required for running the DA and the cooperative tracking steps increase as the number of connected vehicles grows.This is reasonable since, as to the object detection process which is performed onboard the vehicles, a larger number of vehicles leads to a more complex cooperation among vehicles.
7) Impact of network reliability: We here evaluate the robustness of CLS-MPNN against unreliable networks by introducing packet drop events in the cooperative framework.They are assumed to be independent across vehicles and we consider that if a packet drop occurs between the v-th vehicle and the MEC all the transmitted information (i.e., the position of the v-th vehicle and the poles bounding boxes detect by it) is lost.Herein, we do not rely on V2X networking simulators as we want to benchmark CLS-MPNN under extreme conditions.Such conditions are likely not to be encountered when utilizing V2X simulation frameworks but are useful to highlight the potential limits of the proposed approach.Fig. 11 shows the location error CEP95 achieved by CLS-MPNN when the packet drop probability ranges from 0% (i.e., no packet drops) to 40%.The results show that the CLS-MPNN is negligibly affected if the packet drop probability is lower than 10% regardless of the detector considered.For larger packet drop occurrences, a strong reduction on the accuracy is experienced.III).Comparing the results for different values of R s , the performances of all methods do not deteriorate too much for R s ≥ 40 m, while for R s < 40 m a large performance drop occours.Therefore, the proposed approach could be employed with lower-grade LiDAR sensors provided that R s ≥ 40 m. 9) Object position accuracy: As a final analysis, we evaluate the positioning accuracy of poles, comparing the proposed approach with an ego-sensing solution.In the latter, vehicles obtain their positions via the GNSS-KF approach and use this information jointly with the LiDAR detections to localize the poles in a global reference frame.The results of this comparison are highlighted in Fig. 13 considering the detections provided by PointPillars and Part-A 2 detectors.The CEP95 location error of CLS-MPNN is largely reduced compared to the one attained by the ego sensing solution for both detectors.This highlights the benefits of employing CLS-MPNN to also improve the localization performances of the poles with respect to non-cooperative sensing schemes.

B. Town10 results
In this section, we evaluate CLS-MPNN in a new scenario where the NNs have not been trained on.This new evaluation is instrumental for characterizing the developed method under unseen conditions.For this scope, we employ the Town10 map of CARLA, a more complex environment comprising an overpass and roads with multiple lanes, where 20 LiDARequipped vehicles move in the environment.The same calibration and LiDAR parameters employed in Sec.VII-A and in Sec.VI are also used here.The simulation lasts 300 seconds and comprises 1500 snapshots each sampled at T s = 0.2 s.As done before, we report in Table V the percentages of correct detections, missed detections, and FPs attained by PointPillars and Part-A 2 .For this map, Part-A 2 is observed to generalize better as it provides fewer missed detections compared to PointPillars, despite detecting a larger number of false positives.This is in line with the results provided in [84] as they show that double-stage detectors (e.g., Part-A 2 ) outperform single-stage ones (i.e., PointPillars) when the testing conditions change.Compared with the values provided in Table III, FPs and missed detections are higher, indicating that fine-tuning the detectors on this new map may help improve the performances.Fig. 14 shows the spatial RMSE computed all over the map for GNSS-KF (Fig. 14a), O-CLS-MPNN (Fig. 14b), the proposed CLS-MPNN method considering the detections provided by PointPillars (Fig. 14c) and by Part-A 2 (Fig. 14d), while Fig. 14e shows the CEP95 of the location error obtained by all methods.The results of this analysis are in line with the findings of the previous section: CLS-MPNN largely outperforms GNSS-KF regardless of the detector considered while it approaches the performances of O-CLS-MPNN.Again, some spots can be noticed where the improvement brought by CLS-MPNN is only marginal due to a low cooperation level between vehicles.With respect to the previous case, the CEP95 of the location error is generally lower for all methods.This is related to the fact that a larger number of poles is present on this map with respect to the ones present in Town02 that can be exploited for cooperative vehicle positioning refinement.Another difference with respect to the previous results is that the performance gap between CLS-MPNN and O-CLS-MPNN is higher, possibly due to the larger number of missed detections.Nevertheless, CLS-MPNN is shown to generalize to this new scenario as it provides accurate vehicle positioning.Contrarily to before, now the detections provided by Part-A 2 should be chosen for maximizing the vehicle position accuracy.
To conclude the analysis, we report in Fig. 15 the CEP95 of the location error obtained by CLS-MPNN implemented with the detections provided by PointPillars and Part-A 2 and the one achieved by O-CLS-MPNN for varying LiDAR sensing ranges.Under this map, the predictions provided by Part-A 2 lead to a lower positioning error compared to the ones provided by PointPillars for nearly all sensing ranges.Therefore, despite Part-A 2 providing less accurate performances in Town02, it should be preferred compred to PointPillars as it generalizes to unseen scenarios and attain higher localization accuracy when integrated with CLS-MPNN.Notwithstanding this aspect, it is advisable to partially retrain or fine-tune the models to enhance even more the positioning accuracy as the performance gap between O-CLS-MPNN and CLS-MPNN is higher when compared to the same gap shown in Fig. 12.

C. Considerations on the communication overhead
We here compare the communication overhead of the proposed CLS-MPNN with the one required by C-SLAM.An upper bound on the communication overhead (in bits) required by each vehicle to run CLS-MPNN is The first term encodes the number of bits used for exchanging the 3D vehicle position, where N vd = 3 and E v denotes the number of bits encoding the position, while the second refers to the exchange of the detected objects, where N c = 8 is the number of 3D corners of the bounding box, N od = 3, and E o is the number of bits used for encoding the information.A typical value for E v and E o is 32 bits.On the other hand, the communication overhead of C-SLAM is upper bounded by N vd E v + N pd N ℓp E ℓp .The first term is the same as before, while the second highlights the bits required for sharing the raw point cloud, with N pd = 3 and E p denotes the number of bits for representing a single point of the cloud.Clearly, CLS-MPNN is advantageous over C-SLAM when N pd N ℓp E ℓp ≫ N od N c N o E o .This is typically verified considering LiDAR sensors with at least 16 channels.Indeed, on average, such sensors output N ℓp = 300000 points for each frame [85], thus, the term N pd N ℓp E ℓp corresponds to 0.28 Mbits (assuming that point clouds are compressed using G-PCC).To get the same communication overhead for CLS-MPNN, N o should be at least 375, which is far beyond the maximum number of output objects from 3D detectors (limited to 50).In this condition, the communication overhead of CLS-MPNN is reduced by 7.5 times compared to C-SLAM.

VIII. CONCLUSION
In this paper, we proposed a novel data-driven cooperative localization and sensing framework, referred to as CLS-MPNN, for GNSS augmentation in complex urban environments.The framework leverages a NN, i.e., a 3D object detector, at the vehicles for recognizing and localizing static objects in the surroundings.The detections are then aggregated at a centralized unit, namely a MEC, that performs a MPNNbased data association procedure to coherently combine the measurements provided by multiple vehicles and referring to a same object.Once the data association is solved, the MEC employs a cooperative Bayesian tracking algorithm to improve the objects' localization accuracy and implicitly refines the vehicle positioning as well.
We employed two synthetically-generated, yet realistic, urban scenarios to assess the proposed cooperative method.The first one is used to assess the performances of the employed single-stage/double-stage object detectors, the MPNN data association accuracy, and, finally, to evaluate the performances of the overall cooperative localization framework.The second one, instead, is utilized to characterize the ability of the developed technique to generalize to new unseen conditions.The results show that CLS-MPNN is able to substantially outperform a conventional non-cooperative solution based on stand-alone GNSS and a state-of-the-art cooperative SLAM approach.Furthermore, when the scenario is modified with respect to the one used during training, a marginal performance loss is experienced by CLS-MPNN due to more missed detections.Nevertheless, CLS-MPNN is still able to substantially reduce the positioning errors with respect to the ego positioning solution while approaching the performances of an oracle system that perfectly associates and detects all objects present in the surroundings of the vehicles.The analysis also shows that double-stage 3D object detectors are to be preferred compared to single-stage ones as they show better generalization performances to new scenarios.
Future research activities will target the assessment in realworld driving scenarios as well as the development of end-toend cooperative learning strategies where 3D object detection and data association tasks are carried out by a single NN possibly designed to also support fully distributed implementations.Besides, the MPNN-based data association strategy could be extended to deal with the false positives. x

Fig. 1 .
Representation of vehicular sensing scenario with Nv = 3 vehicles and No = 9 non-cooperative objects (poles).(a) Top view of a scenario extracted from CARLA.(b) LiDAR point clouds captured by the vehicles.

Fig. 2 .
Fig. 2. Block scheme of the proposed CLS-MPNN localization framework.Each vehicle processes its local LiDAR point cloud via 3D object detectors to localize passive objects.The GNSS and the bounding box measurements are aggregated and shared with the MEC, which performs DA and cooperatively tracks both objects and vehicle states.
, ρ (O) 1:t ) is obtained by solving the DA.Specifically, we split the computation of p(α t |ρ(V) 1:t , ρ (O)1:t ) into two steps.The first one uses the MPNN to pair bounding boxes hypothesized to refer to the same object while the second evaluates the measurement-to-target association, providing as output the pdf p(α t |ρ(V) 1:t , ρ (O) 1:t ),B.Measurement-to-measurement associationThe MPNN takes as input the pole measurements z (O) v,t and ρ(O)  v,t , and constructs the input graph G as detailed in Sec.II-B.The aim of the MPNN is to learn the correct association among the pole measurements by operating on the graph G with an iterative message passing scheme.The overall process is composed by I message passing iterations and for each iteration i = 1, . . ., I the node and edge embeddings, denoted as h (i) n and m (i) n→m , are used to spread the information throughout G.More specifically, at iteration i = 0, the node and edge embeddings are initialized as n→m ≥ Γ.Other strategies can be used for obtaining a (I) n→m , such as the Hungarian algorithm.TableIsummarizes the layers used by each encoder in the MPNN architecture.Note that after each linear layer, a dropout operation and a ReLU activation function are added.
provides a lower value compared to L = 40.This should be expected since L = 11 generally leads to an underestimation of the actual performances compared to L = 40 as relying only on 11 recall values compared to 40.

Fig. 4 .
Fig. 4. DA task validation performances reached after 50 epochs of training as a function of the number of message passing iterations I.
• A non-cooperative GNSS-based tracking tool where a Kalman Filter (KF) integrates the positions provided by the GNSS units, referred to as GNSS-KF.The main difference between O-CLS-MPNN and CLS-MPNN is that CLS-MPNN is subject to missed detections while O-CLS-MPNN is not as all detectable poles are always correctly recognized, provided they fall inside the field of view of the LiDAR sensor.The detection performances of PointPillars and Part-A 2 are reported in Table III which highlights the percentages of correct detections, missed detections, and false positives obtained by the two detectors.The results indicate that PointPillars provides more robust results as its missed detection and false positive percentages are lower when compared with the ones attained by Part-A 2 .

Fig. 7 .
Fig. 7. CEP95 values of the vehicle location error vs the number of vehicles.

Fig. 8 .
Fig. 8. CEP95 values of the vehicle location error and percentage of correct detection vs LiDAR resolution. .

Fig. 11 .
Fig. 11.Impact of packet drops on the vehicle location error CEP95.

8 )
Impact of LiDAR sensing range: Fig. 12 compares the performances on vehicle positioning obtained by CLS-MPNN integrating the PointPillars and Part-A 2 detections with the ones achieved by O-CLS-MPNN considering different LiDAR sensing ranges R s ranging from 20 m up to 70 m, i.e., the case considered previously.Comparing the results, PointPillars achieves a lower CEP95 compared with Part-A 2 in nearly all cases.This should be expected since, in this scenario, Point-Pillars detects more poles compared to Part-A 2 (as highlighted in Table

Fig. 12 .
Fig. 12. CEP95 values of vehicle location error in Town02 for O-CLS-MPNN and CLS-MPNN with PointPillars and Part-A 2 for different LiDAR sensing ranges.

Fig. 14 .
Spatial RMSE of vehicle positions over the Town02 simulation map: (a) GNSS-KF.(b) O-CLS-MPNN (perfect DA and LiDAR sensing) (c) CLS-MPNN with PointPillars, (d) CLS-MPNN with Part-A 2 , (e) comparison of the CEP95 of the location error for all methods.
Layer type Output Layer type Output Layer type Output Layer type Output Layer type Output Layer type Output