Cooperative Lidar Sensing for Pedestrian Detection: Data Association Based on Message Passing Neural Networks

This paper considers the problem of cooperative lidar sensing in vehicular networks. We focus on the task of associating the vehicle-generated measurements by lidars to enable a cooperative detection of vulnerable road users. The considered measurements are the three-dimensional bounding boxes extracted from the lidar point cloud. Focusing on a centralized architecture which aggregates and processes all the sensing information, we design a graph formulation of the association problem and we propose a novel solution based on Message Passing Neural Networks (MPNNs). The method has the advantage of accurately learning the associations and the measurement statistics directly from data. We validate the proposed approach on a cooperative sensing scenario simulated by CARLA, an open-source high-fidelity simulator for automated driving scenarios. For the generation of bounding boxes related to pedestrian detections, we consider both artificially-generated and realistic measurements obtained by employing the PointPillars model. We validate the performance by comparing the proposed MPNN model with the Sum-Product Algorithm for Data Association (SPADA), a common approach for data association in multisensor systems. The proposed data-driven MPNN model achieves an association accuracy above 99% and outperforms SPADA in case of moderate sensing errors, as foreseen by automated driving scenarios. We also assess the efficacy of data association in case of mis-modeling between training and testing datasets, observing good generalization capabilities when dealing with untrained conditions.


I. INTRODUCTION
A. Contextualization and Background I N the last two decades, driving automation functionalities have advanced at an incredible rate, allowing an accurate perception of the environment for enhancing vehicle safety [1], [2]. At the same time, the development of cellular communications for the automotive vertical (e.g., 5G and beyond) is driving a new connectivity paradigm for mobility [3], [4], [5]. Vehicleto-Everything (V2X) communications enable a seamless information sharing among vehicles, road infrastructures and any other road entity over Vehicle-to-Vehicle (V2V) or Vehicleto-Infrastructure (V2I) links. Examples of exchanged information in V2X networks include sensor data, driving intents and planned trajectories, or safety-related messages [6]. Moreover, V2X communications allow to extend the ego-sensing capabilities beyond the immediate field of view of on-board sensors, enabling the cooperation across sensing systems of different vehicles. The aggregation of data from spatially distributed sensors (both on vehicles and road infrastructure) through V2X links fosters the deployment of the so called Cooperative Localization (CL) systems [7], [8], [9], [10], [11], [12]. A relevant use case for CL in V2X networks is related to Vulnerable Road User (VRU) detection [13], [14], where cooperation can significantly improve the detection capability.
Regardless of whether these sensors are located in the same vehicle [15], [16] or across different units [17], [18], [19], CL heavily relies on the correct association of sensor measurements, i.e., data association [20], [21], [22]. While data association may appear as a simple task, numerous studies have emphasized the importance of addressing this problem due to the limitations of naive solutions that simply associate closely detected objects [23], [24], [25], [26]. These solutions only yield meaningful results if all vehicles detect an identical number of objects, which is an unrealistic assumption due to varying sensor hardware and fields of view, and do not have false alarms due to clutter. Consequently, it is essential to associate multiple sets of measurements related to distinct detected objects that are only partially in common among vehicles.
In this paper, we focus on a centralized solution in V2X networks where a central processing unit is in charge of combining the raw data (or derived characteristics) from all connected vehicles. Specifically, the aggregated measurements refer to bounding boxes extracted from lidar sensors at the vehicles. Centralized solutions in the literature mainly rely on probability-based methods such as Belief Propagation (BP), also known as Sum-Product Algorithm (SPA), which gives a systematic approximation of optimal Bayesian inference with an appealing performance-complexity trade-off [45]. BP uses an iterative message passing exchange of information over a suitable graph characterizing the specific problem. BP-based techniques are optimal in case of linear and Gaussian models, but provide only an approximation in case of loopy graphs or statistical distributions arising from real systems [21], [22]. Differently from BP-based solutions, we here propose to use Message Passing Neural Networks (MPNNs), that allow to improve performances upon SPA by directly learning the correct associations and noise distributions from data.

B. Related Works
First works of MOT task were developed in the domain of radar and sonar tracking [30], [31], [32], [33], [34]. Traditional MOT methods, such as Joint Probabilistic Data Association (JPDA) [46], Linear Joint Integrated PDA (LJIPDA) [47] and Multiple Hypothesis Tracker (MHT) [48], assume that the number of targets is known and jointly estimate the target states and association variables. These approaches have been later extended to consider also multi-sensor scenarios [49], [50] as in Linear Multitarget IPDA (LMIPDA) [51]. Recent studies, including probability hypothesis density (PHD) filters [52], [53], adopt finite set statistics to predict the number of targets and target states without directly estimating the association variables. Other studies addressing probabilistic data association can be found in [54], [55], [56], [57], [58], [59]. However, most MOT approaches have limited scalability as the number of sensors and targets grows. Improvements from this point of view have been introduced by BP techniques that are able to achieve high scalability [45]. BP approaches have been investigated for both centralized [27], [28], [29], [36] and distributed [37], [38] solutions.
As far as the data association is concerned, one of the most prevalent approach is to use a graph formulation, which facilitates the description of relationships among multiple measurements on a same set of detected targets. Many solutions that use graphs to solve data association take into account all feasible assignments at the same time, yielding to an NP-hard combinatorial problem [60], [61]. To reduce the complexity, sub-optimal (greedy) methods have been proposed, casting the problem as a linear one and addressing it through minimumcost [62] and maximum-flow algorithms [63]. These methods, however, do not guarantee satisfying performance, especially in cluttered and occluded environments [64], [65].
Another possibility to successfully manage data association is to build Machine Learning (ML) models that directly learn from data. ML, and in particular Deep Neural Networks (DNNs), have been embedded inside graphs thanks to the rise in popularity of Graph Neural Networks (GNNs) [66], [67]. GNNs, and more specifically MPNNs, inherit the message passing structure of SPA to produce the desired output from a set of inputs. Indeed, they have been jointly used with the SPA to improve the overall performances by correcting errors created by cycles and model mismatch [68], [69]. Compared to DNNs, MPNNs have fewer parameters but they can still catch the linear and non-linear relationships between input data and output, being at the same time scalable [70]. Moreover, MPNNs have been shown to outperform BP on loopy graphs, provided that enough training data is available [71].

C. Contribution
To the best of our knowledge, GNNs have never been explored for cooperative sensing nor for vehicular networks. Drawing inspiration from [72], where the detections obtained by a single camera system were associated over consecutive time frames, we here modify and extend the approach to a cooperative (i.e., multi-vehicle) scenario. We consider a centralized network of vehicles, each with a single lidar sensor, with overlapped Field-of-Views (FOVs) allowing a cooperative detection (at the same time instant) of pedestrians through the association of multiple bounding boxes extracted from the lidar point cloud. We selected pedestrians as they are passive elements of the environment and are extremely relevant for safety-related applications (e.g., vulnerable road user protection) as well as they are popularly present in urban areas. Alternatively, vehicles could also be used, but they are typically equipped with active devices in Cooperative Intelligent Transport Systems (C-ITS) (following V2X paradigms), thus notifying their presence in the near surroundings. On the other hand, passive targets, such as pedestrians, traffic signals or poles [73], are not univocally identified and, therefore, data association is needed for their recognition and cooperative detection by multiple sensors.
We assume that the lidar detection system does not incur false detections (i.e., incorrect bounding boxes), which would require a tracking over time to resolve the ambiguity; here we focus on a snapshot-based data association. This assumption may not always hold in real-world scenarios, especially when objects are partially occluded. However, we employ a filtering strategy that is widely adopted in deep learning object tracking and discards unlikely bounding boxes with low detection confidence (see e.g., [74] for a more complete discussion). This helps limiting the false positives as very unlikely bounding boxes are automatically removed by the detector 1 . We also assume the noise statistics as invariant across all sensors, a condition which in practice might not be fulfilled due todifferent hardware and lidar processing techniques. Including the variation over time of uncertainties and designing an MPNN-based tracker with data association are non-trivial issues that require deeper research relying on this first activity as a starting point.
We propose an ML approach based on GNNs that exploits the availability of training data, today largely accessible in most applications. In particular, we address the data association problem through MPNNs in a V2X network where vehicles share the detections of lidar sensors in a common infrastructure (e.g., cloud-based). We choose to focus on data association using MPNNs for two main reasons. First, data association is a crucial component in cooperative sensing algorithms, as accurate assignment of detections to tracks significantly impacts the tracking performance: our goal is to provide a solid foundation for more robust tracking solutions. Second, using MPNNs for data association is an innovative approach with great potentials for learning complex relationships in graphstructured data, providing valuable insights and future research opportunities.
A preliminary version of the proposed method has been presented in [75], where we developed a unique graph representation of the data association problem which is handled by an MPNN model that captures the measurements' characteristics and produces a compact and effective feature representation. While in [75] we focused on a simple proof-of-work implementation and stand-alone validation in a vehicular scenario, in this paper we extend the work with the main following contributions: • proposal of MPNNs models for the cooperative association of 3D bounding boxes from lidar sensing in vehicular networks; • analysis of the generalization capabilities of the MPNN association model over a number of different and realistic measurement statistics; • validation of the suggested approach in a realistic cooperative vehicular environment simulated with CARLA [76] where a central unit fuses the bounding boxes obtained by multiple vehicles from on-board lidar data using the PointPillars [77] model; • comparison with the conventional Sum-Product Algorithm for Data Association (SPADA) [21], [22], with particular focus on association performances and generalization properties. Note that the assessment on a synthetic cooperative dataset considers the use of an efficient 3D object detector, which has been demonstrated to provide accurate performances in challenging real-world datasets [77]. Since the primary focus of this work is on the fusion over the V2X networks of bounding boxes from multiple vehicles and not on the processing of raw lidar point clouds, any signal losses or adverse weather conditions are not affecting the proposed MPNN as they only reduce the performance of the 3D object detector operating over the lidar point cloud.
Numerical results show that the proposed method is able to efficiently address the data association issue in cooperative connected multi-vehicle systems, and to correctly learn extremely complex (e.g., multi-modal) distributions, such as the realistic PointPillars outputs. Moreover, with respect to SPADA, the proposed MPNN model can achieve higher performances across different noise statistics and intensities in several circumstances.

D. Paper Organization
This paper is organized as follows: Section II introduces the system model of the cooperative sensing scenario and its graph representation. Section III firstly provides an introduction on the working principle of GNNs, and then defines the proposed MPNN solution. Section IV is devoted to performance analysis in a cooperative vehicular scenario with lidar-based pedestrian detection and to the comparison with SPADA. Finally, Section V draws the conclusion.

II. SYSTEM MODEL
Let us denote with S n = {1, ..., S n } a set of connected vehicles at time n. A vehicle s ∈ S n is described by the state vector x s,n , which can include kinematic (e.g., position, velocity, etc.) and non-kinematic (e.g., identification number, category, dimension, etc.) parameters. All vehicles are connected to a central processing unit (e.g., a road side unit or a mobile edge cloud) in charge of aggregating the vehicle-generated information and providing a cooperative detection system. We assume an always-available connectivity: model and effects of the communication protocol are out of the scopes of this paper. Each vehicle has a lidar sensing system embedding an ML algorithm for detecting non-cooperative vulnerable road users, here pedestrians, referred to as targets. The k-th target is described by the state vector y k,n , while the set Y s,n includes all the pedestrians detectable by vehicle s (i.e., within its FOV) at time n. By processing the lidar point clouds gathered at the vehicles via 3D object detection methods, such as [78], [79], each target falling within the lidar sensing range can be recognized and represented by a bounding box encoding its location, extension and rotation. Each target is assumed to generate at most one bounding box at a vehicle per each time step. This assumption, known as "data association assumption" [80], is common in object detection models for lidar point clouds, and more in general in MOT algorithms, as it helps to simplify the detection and tracking process, reduce ambiguities, and improve the overall tracking performance. The m-th bounding box at vehicle s at time n is z s m,n and the associated target is unknown. As such, at time n, a sensor has a set of unpaired (to the originating target) bounding boxes z s n = {z s 1,n · · · z s M,n }. Note that the set z s n could even be empty. The union set of all bounding boxes of all vehicles at time n is Z n = Sn s=1 z s n . To visualize the considered vehicular scenario, in Fig. 1(a) we report the case of two vehicles, x 1,n and x 2,n , jointly detecting two pedestrians, y 1,n and y 2,n , through the bounding boxes z 1 n = {z 1 1,n z 1 2,n } and z 2 n = {z 2 1,n z 2 2,n } for vehicles s = 1 and s = 2, respectively. The measurement z s m,n is described by the 3D coordinates of its eight corners, i.e., z s m,n = z s i,m,n 8 i=1 , as shown in Fig. 1(b), which take into account the overall footprint and orientation of the target. To correctly associate the bounding boxes, an absolute (fixed) Cartesian spatial reference system has to be used for the identification of the corners. In this case, we choose to label as z s 1,m,n the bottom-north-est corner and z s 8,m,n the top-south-west one.
In the proposed GNN solution, the union set Z n is modeled as a direct graph G = (V, E), where each node i ∈ V corresponds to a single measurement, while the edge (i, j), with i = j, indicates a candidate association. To univocally map the node i ∈ V with the measurement m of vehicle s at time n, we define the mapping function Φ n : V → Z n × S n . The function Φ n (i) = {m, s} cannot inherently prevent the association of two distinct measurements of a same vehicle. For this reason, we also introduce the association-related variable y i→j ∈ {0, 1} which denotes the presence/absence of the edge (i, j), i.e., the two bounding boxes embodied in nodes i and j refer to a same target. The goal of the data association algorithm (here addressed with MPNN) is to estimate the association variable y i→j ∈ {0, 1} by considering all possible pairings of bounding boxes, with the constraint of y i→j = 0 if the mappings Φ n (i) and Φ n (j) refer to a same vehicle s.  As an example of graph construction, we refer to the vehicular scenario shown in Fig. 2(a) where vehicle x 2,n detects only y 1,n through measurement z 2 1,n , while the other two vehicles x 1,n and x 3,n can detect both targets y 1,n and y 2,n , respectively. It follows that the graph with true measurements association for such scenario is the one indicated in Fig. 2(b), which has to be reconstructed from the fully-connected graph in Fig. 2(c) that includes all possible pairings. In next section, we detail the proposed algorithm for estimating the connections from all possible associations, i.e., how to get the graph in Fig.  2(b) from the one in Fig. 2(c).
A summary of the main notation variables introduced in this section and their description is provided in Table I.

III. ADDRESSING DATA ASSOCIATION WITH MPNN
In this section, we first introduce the general concept of GNN and more specifically MPNN (Section III-A), which is the base for the proposed model. Then, we define the proposed MPNN model with an insight on possible classification strategies. Finally, we describe the loss function used to train the model, as well as the performance metrics.

A. Introduction to GNNs
Neural networks acting on graphs have been investigated for more than a decade, being originally referred to as GNNs [66], [67] and successively extended to many variants such as MPNNs [81]. A complete generalization of GNNs is formulated in [82] under the name of Graph Networks (GNs). Models in this ML family have been studied in supervised, semi-supervised, unsupervised, and reinforcement learning contexts across a wide range of problem domains. They have been used to learn the dynamics of physical systems [83], predict the chemical properties of molecules [84], optimize the communication in multi-agent networks [85], or even employed in machine translation [86]. A further domain of applications includes vehicular environments, where GNN are used to predict road traffic [87], [88] or classify and segment 3D meshes and point clouds [89].
We here consider an MPNN that iteratively performs a message passing procedure over a graph G. Iterations are indexed with t, the maximum number of message passing steps (a design parameter) as T , while N i = {j ∈ V|(i, j) ∈ E} is the set of neighbors of node i ∈ V. We also identify the so called embeddings, i.e., attributes, of node i and edge i→j , respectively. The purpose of the MPNN is to train a function that propagates information from node and edge embeddings/attributes throughout G. The more message passing steps are performed, the more the node and edge embeddings contain elaborated information, just like the receptive field of a Convolutional Neural Network (CNN). To this extent, a Neural Network (NN) is present at each node and edge of the graph. The NN at node is indicated with g n (·), while the one over the edge by g e (·). Considering that g n (·) and g e (·) have the same parameters, respectively across each node and each edge, they may be trained on small-scale graphs before being applied to large-scale problems.
For each iteration t = 1, . . . , T , each node i ∈ V sends the following message to its neighbors N i Function Φ(·) is called aggregation function and it is invariant to permutations of its inputs (e.g., element-wise summation, mean, maximum). Concisely, and referring to Fig. 3, the message m

B. MPNN Model for Data Association
The proposed model consists of two parts: an MPNN and an edge classifier. The role of the MPNN is to process the input graph G derived from the measurements of all vehicles at a given time n, i.e., Z n . On the other hand, the edge classifier is a binary classifier with the role of determining the pairings of all the measurements referring to the same target, i.e., finding the association variable y i→j based on association probabilities.
As a consequence, at the output of the classifier we have a set of multiple disjoint subgraphs (as in Fig. 2(b)), each of them grouping all the measurements that are hypothesized to be originated from the same target.
The MPNN model is composed of four multi-layer perceptrons (MLPs): g e (·) at each edge and g in n (·), g out n (·), g n (·) at each node. The role of MLPs g e (·) and g n (·) is to update the edge and node embeddings, respectively, in a similar way as the conventional MPNN in (1) and (2). On the other side, g in n (·) and g out n (·) are introduced to better encode the structure of incoming and outgoing edges. In this way, we can split the problem into two parts and individually manage the incoming and outgoing edges in each node. The message passing over the graph works as follows. First, we update the edges embeddings as in (1) We remark that this is done to divide the problem into two parts, as the constructed graph for solving the data association is bidirected (i.e., undirected), which is common in most graphs used by MPNNs. However, our approach also needs to ensure the unique constraints of our data association problem, i.e., that the association-edge between two measurements is conceptually the same in both directions. After T message passing steps, the edge embeddings m The association variables y i→j are then obtained with a thresholding operation, with threshold Γ, to pair nodes i and j. Two nodes are associated (i.e., two bounding boxes at distinct vehicles refer to a same pedestrian) if which implies y i→j 1. However, it may happen that one measurement of a vehicle is associated to multiple measurements of another vehicle. To avoid this issue, a constraint is enforced such that a bounding box of a vehicle can be associated to at most one bounding box of another vehicle.

C. Loss and Performance Metrics
For computing the training loss and performing backpropagation, we employ the weighted binary cross-entropy that is estimated at the end of each message passing iteration t after the edge classifier's prediction y where w is a weight given to the positive class in order to compensate the class unbalances and it is computed as where 1(·) is an indicator function that returns 1 if the condition is true and 0 otherwise. Concerning the performance metrics, we adopt the accuracy measure defined as

IV. SIMULATION EXPERIMENTS
To evaluate the proposed MPNN model for data association we consider a network of vehicles localizing pedestrians through lidar sensing. We dedicate Section IV-A to the simulation scenario and dataset, while Section IV-B reports the results of performed simulations.

A. Simulation Scenario
Due to the unavailability of real-world cooperative perception datasets, i.e., collected by multiple and synchronous lidarequipped vehicles, we here employ a simulator of automated driving systems that allows us to generate lidar readings at multiple vehicles moving in a synthetic, yet realistic, mobility environment. Similarly to [90], we use the CARLA simulator [76], an extremely advanced software that integrates trajectory planning and sensing. The considered scenario is referred to as T own02 in the simulator, which spans over an area of roughly 200 m × 200 m. Twenty vehicles with lidar and fifty pedestrians populate the scene, unless otherwise specified. The state x s,n of each vehicle refers to its 3D position. A snapshot restricted to seven vehicles with associated point clouds of the simulator is show in Fig. 4, where we represent the effect of cooperative sensing by merging seven lidar point clouds. Specifically, for visualization purposes we group the vehicles into three subgroups and we show the partial point cloud in Figs. 4(a), 4(b) and 4(c), respectively, while the cooperative perception obtained by merging all the seven point clouds is in Fig. 4(d).
The duration of simulation is 300 s, with sampling time of 0.2 s. This results in 1500 snapshots of the scene, each one described by vehicles and pedestrians' positions and lidar detections. A top-view image of the simulation in a fixed time instant is shown in Fig. 5, where we include both vehicles (red squares) and detected pedestrians (blue triangles) as well as the  The ground truth information provided by the simulator includes the true positions of vehicles, i.e., x s,n , ∀s ∈ S n , and the true bounding boxes around the pedestrians, defined by its eight corners, i.e., y k,n = [y i,k,n ] Note that the measurement error distribution is the same for all corners, for all time instants and across all vehicles. Unless otherwise specified, the artificial noise w s i,m,n follows an isotropic Gaussian distribution with standard deviation σ = 10 cm. Note that the additive noise is absent in case of using a ML model for the automatic extraction of bounding boxes, as in the case of PointPillars [77], since the error is embedded in the model itself.
In the simulations, we validate the systems for a variety of noise intensities ranging from extremely accurate detections up to inefficient systems (errors in the order of meters) for the considered vehicular context targeting the automation of mobility. The former case can be considered as a condition in which the vehicle position is assumed to be perfectly known and the only source of error is attributed to lidar sensing and bounding boxes extraction algorithm. The latter case, instead, embeds both vehicle uncertainty and the errors in the generation of bounding boxes. We do not consider separate effects as we aim to assess the aggregated model robustness.
We divide the overall dataset into training (700 samples) and validation (800 samples) parts, with dimensions optimized as discussed in Section IV.B.2. Moreover, in order to assess the generalization of the method, we increase the number of validation samples by applying a random flip along the x and y axes of the bounding box positions, thus obtaining a total of 1600 validation samples. We remark that a sample is a snapshot of the scene at a given time instant n and it is fully represented by the graph of unknown measurement-pairings (Fig. 2(c)). To avoid the computational burden of dealing with too many edges, we introduce a gating which a-priori discards unlikely associations, i.e., ignoring edges related to centroids whose distance is greater than 10 m. As optimizer, we use the Adam optimizer with tuned learning rate of 10 −3 and hyper-parameters β 1 = 0.9 and β 2 = 0.999 [91]. The performance metrics are computed using the thresholding in (7) with Γ = 0.5.

1) Initialization of Node and Edge Embeddings:
To initialize the node and edge embeddings, we adopt a strategy that learns how to extract feature embeddings directly from measurements. This is done by using an MLP at each node and edge, called g enc n (·) and g enc e (·), respectively. For the considered cooperative lidar sensing scenario, we use the geometric characteristics of the bounding boxes as input to the two neural networks to obtain m This allows the MPNN to discriminate not only the position of the detected object, but also its dimension and rotation. We find this approach to be highly effective and efficient, as it uses a minimal amount of information for data association, limiting the data exchange among vehicles. Incorporating additional features, such as individual point cloud positions, into the current feature encoding could be beneficial and would require only modifying the encoding neural networks g enc n (·) and g enc e (·). While this could potentially enhance performance, there are two primary drawbacks to consider. First, the volume of information that would need to be exchanged with the central entity responsible for data association via MPNN could become unmanageable and unsustainable, given that a lidar sensor typically outputs more than 1 million point clouds per second. Second, increasing the number of input features might inadvertently introduce unrelated or redundant features that may not be beneficial or could even negatively impact the inference process due to the multi-dimensionality problem in machine learning.
2) Impact of Training Dataset Dimension: We first analyze the impact of the training dataset size on the model's performance. This is crucial in determining whether the model exhibits high or low bias. In essence, expanding the dataset size decreases the model's variance, meaning the residual error is predominantly due to bias. In Table II, we present the validation accuracy, precision, and recall after 100 epochs for varying training dataset size. We note that by increasing the number of samples, the model improves the performance metrics and reaches an upper bound on the accuracy after 700 samples, representing the best accuracy reachable by the model, i.e., its bias. It is noteworthy that the recall typically overcomes the precision, implying a larger number of false positives than false negatives. This is because the ground-truth graph retains a predominant number of zeroed edges, thus the model is more prone to mistake on edges that are labeled as zeros, despite the loss function employed (8) for unbalanced classes.
3) Impact of MPNN Iterations: This assessment aims to verify the role of the number of message passing iterations T , a fundamental parameter to tune the amount of information extracted and elaborated from the data. In Fig. 6, we show the accuracy (and associated confidence) metric in the validation dataset over the number of epochs for T = {1, 2, 4, 8, 12}. We notice that increasing T leads to a higher accuracy and a faster convergence, at the cost of increasing computational complexity. However, a saturation condition occurs for T > 4, leading us to select T = 4 as a good trade-off between accuracy, convergence and complexity. This value will be used for the following analyses.

4) Impact of the Measurement Statistics:
This assessment has the goal of verifying how the MPNN model handles unobserved noises (for which it has never been trained on). This is extremely useful in case the model is trained in a simulated and controlled environment, and then deployed in real systems typically characterized by different measurement statistics. Since there is almost no literature detailing the error characteristics of real ML-based 3D object detectors, we investigate the types of noise that are currently considered in point cloud denoising algorithms. As suggested in [92], we explore five different noise statistics: the already introduced isotropic Gaussian, the non-isotropic Gaussian, the Laplace, the uniform, and the discrete one, which are defined by the distributions f w (w) f w (w s i,m,n ) as reported in Table III,  We also implement the ML model PointPillars [77] to process the lidar point clouds and derive the associated bounding boxes. This method allows us to assess the performance of the MPNN without resorting to artificially generated measurements in (11), leading to a detection system that closely resembles practical scenarios. By using the detections produced by PointPillars as inputs to our data association system, we maintain a noise distribution that mirrors realistic conditions, which is essential for evaluating the effectiveness of the proposed MPNN-based data association strategy in various real-world situations. The  statistics of the 3D localization error (computed as the difference between the true and estimated bounding boxes over x, y and z axes) of PointPillars are reported in Fig. 7, showing that they can be well approximated by a zero-mean Gaussian distribution over the horizontal (x, y) space. On the other hand, for the vertical dimension z, the model is more likely to predict boxes in higher positions (i.e., above the road) instead of the opposite, so the error distribution is slightly biased. Furthermore, we notice that the error statistics do not vary significantly over the three axes, suggesting that they (almost) follow an isotropic Gaussian distribution with standard deviation σ = 10 cm.
In Fig. 8 we analyze the mixed impact of training and testing with different noise statistics. The value of σ for each noise distribution has been set to 10 cm as closely matching the standard deviation used for fitting the error statistics of Point-Pillars. Analyzing the results, it is apparent that training and validating the MPNN model on the same noise distributions lead to optimal performances. This shows that the model is able to obtain good accuracy regardless of the noise type, provided that the same noise is experienced for both training and validation phases. Focusing now on the different combinations of training/validation noises, results detail that training under the isotropic Gaussian or Laplace noise allows the model to generalize well over all noise types, suggesting that these distributions may be employed in real-life applications where noise statistics are not known beforehand. On the other hand, training considering discrete and/or uniform leads to poor generalization results during validation, most probably due to the simplistic noise distributions compared to all other noise types. Finally, training on realistic data, i.e., over the noise generated by PointPillars, does not allow the MPNN to generalize well over other distributions, particularly for Laplace and isotropic/nonisotropic Gaussian noises.

5) Impact of the Different Scenarios and False Positives:
This experiment aims at verifying the validation performances of the proposed model in a brand new scenario where false positives, i.e., false alarms, are present. This allows us to assess the robustness and adaptability of the model in more realistic conditions, demonstrating its potential for practical implementation.
To this purpose, in Fig. 10, we report the results of performance validation in T own10 scenario of CARLA simulator, where we vary the number of cooperating vehicles from 5 to 20. The proposed MPNN association strategy is evaluated considering both the absence ( Fig. 10(a)) and presence ( Fig.  10(b)) of false alarms, which are obtained from the Pointpillars detector. We would like to highlight that the model has been trained in the map T own2 illustrated in Fig. 5 neglecting any false positive, thus T own10 and the presence of false alarms are unseen conditions. Starting from the scenario without false alarms in Fig. 10(a), we note that increasing the number of vehicles leads to better accuracy, up to a plateau around 97%, which is just 2% below the results in the original scenario with Pointpillars (see Fig. 8). Accounting for the false positives, we notice in Fig. 10(b) a decrease of the accuracy to 93%. Even more relevant is the precision which falls to 78% due to the fact that each false positive introduces new nodes and edges in the graph which will be associated with real detections, leading to lower performances.

6) MPNN vs SPADA -Generalization Capabilities:
This experiment compares the performance of the proposed MPNN association model against a conventional SPADA over different combinations of Gaussian noise intensities used in the training and validation datasets. For the SPADA, a training phase is not needed, but we can embed prior knowledge on the noise intensity by calibrating the standard deviation used for computing the measurement likelihood function. To do so, we process the training dataset and extract a single standard deviation value that characterizes the considered noise intensity.
The comparison is reported in Fig. 9. Regarding the MPNN, we show, for different training and validation datasets, the validation accuracy reached after 25 epochs, while for the SPADA we represent the validation accuracy after convergence using different a-priori noise statistics (in terms of standard deviation). First, we can clearly observe that, in both algorithms, the bottom-left part of the matrix has higher values of accuracy if compared with the top-right part. This is due to the fact that, generally, overestimating the noise (i.e., bottom-left part) leads to a more robust model that can handle noises with lower intensity. On the contrary, underestimating the noise (i.e., top-right part) can incur into problematic situations, especially in the case of MPNN ( Fig. 9(a)). From this point of view, SPADA ( Fig.  9(b)) is more solid and can better handle different noise values. Under overestimating conditions, on the other hand, the MPNN is able to achieve superior performances compared to SPADA, reaching an accuracy of 99% against 97%, respectively. 7) MPNN vs SPADA -Performances on Different Noise Statistics: This experiment has the aim of comparing the peak or absolute performances of MPNN and SPADA in case we have a training dataset with same statistics of the validation dataset. Understanding the maximum performances is fundamental to have an upper-bound on a real deployment and to know the learning capabilities of the algorithm/model. In Fig. 11 we report the validation accuracy reached by MPNN and SPADA varying the adopted dataset and for different standard deviations of the noise. For the MPNN we use training and validation datasets with the same value of σ, while the standard deviation of the likelihood in the SPADA is the same as in the validation dataset. We notice that the absolute performances of MPNN outperform the classic SPADA for both σ = 0.1 m and σ = 0.5 m and for all datasets. Therefore, the proposed method is able to fully solve the problem and learn synthetic or realistic noise representations. Clearly, for the dataset obtained with PointPillars, we cannot tune the quantity of noise introduced by the ML model and consequently the red and blue circles for the dataset PointPillars coincide. Lastly, we can observe that the degradation of performances passing from σ = 0.1 m to σ = 0.5 m are worst for the MPNN. This behaviour is further investigated in the next experiment.

8) MPNN vs SPADA -Performances on Different Noise
Intensity: In this last assessment, we study how the MPNN and SPADA perform over different levels of detection accuracies. This is useful to understand if there are conditions in which one method outperforms the other.
To this aim, we consider different standard deviations of the Gaussian measurement error. The results of this analysis are in Fig. 12, where we report the validation accuracy of the MPNN and SPADA in a scenario with 100 ( Fig. 12(a)) or 50 ( Fig. 12(b)) pedestrians to be detected. First, we observe that the performances for the scenario in Fig. 12(b) are generally higher than for the scenario in Fig. 12(a). This is due to the fact that with a higher number of pedestrians, the uncertainty on the data association increases and the data association becomes Comparison of the impact of measurement error in terms of accuracy between MPNN and SPADA for a scenario with (a) 100 pedestrians or (b) 50 pedestrians. An isotropic Gaussian distribution with standard deviation σ is considered for the additive noise statistics. more challenging. Second, comparing the two methods across different noise intensities, we note that the MPNN is preferable when the standard deviation of the noise is below 1.8 m in both scenarios. We believe that this behaviour is caused by the fact that the MPNN has difficulties in learning a noise with high variance with respect to a low power noise. On the contrary, the SPADA depends on the standard deviation of the likelihood that in this case is known a-priori and equal to the standard deviation of the validation dataset. Therefore, with high noise intensities, it is preferable to use SPADA as we would need too many samples to learn the noise directly from data.

V. CONCLUSION
This paper addressed the problem of data association in a cooperative vehicular sensing scenario with multiple vehicles detecting pedestrians through lidar sensors. To solve the problem, we proposed an MPNN model based on a novel graph representation encoding node and edge feature attributes to express the detection knowledge. The validation was carried out in a vehicular environment simulated by CARLA software, which allows to reproduce realistic cooperative lidar sensing scenarios. We considered the PointPillars model for the extraction of bounding boxes from the lidar point cloud, obtaining realistic statistics of bounding boxes measurements. Furthermore, we compared the proposed method with the conventional SPADA to investigate the generalization capabilities and peak performances.
Results showed that the proposed MPNN model is able to learn the correct associations under several realistic measurement statistics and handles good generalization capabilities when it comes to dealing with untrained conditions, such as different measurement error statistics, noise intensities, number of vehicles and new scenarios. The lidar detection error introduced by PointPillars has been found to be well approximated by a Gaussian distribution with standard deviation equal to 10 cm. Under this condition, very high accuracy can be reached by training the model on artificial noises, e.g., Laplace or Gaussian, and then validate the model on the field with realistic noise distribution produced by PointPillars. Concerning the comparison with the classic SPADA, we found that, under overestimation of noise intensity, the proposed method achieved higher performances. Moreover, regarding peak performances, MPNN completely outperforms SPADA up to a noise standard deviation of 1.8 m.
In the incoming years, the relevance of cooperative perception is expected to grow rapidly, particularly in the context of automated and connected mobility, where the new-generation V2X communication technologies bring opportunities for the development of new services. It follows that an efficient management of data association is a fundamental and crucial step for enabling cooperative sensing. As a result, we expect our work to be extended and applied to different contexts. By enhancing the data association performance, our method provides a solid foundation for more accurate and robust object tracking when combined with existing tracking algorithms which exploit the information shared by the vehicles to perform cooperative positioning or sensing of the surrounding environment.
A natural extension of the work would be to manage and account for possible false and/or missed detections through intra-temporal association and non fully-connected vehicular networks. Future developments could also embrace the area of distributed sensor networks in which the flood of information over sensors demands fast interactions of locally-available data but guarantees higher resilience compared to centralized architecture, overcoming the problem of single point of failure. On the other hand, hop-by-hop transport might introduce a nonnegligible time delay before the same full information is available at all nodes. In addition, we plan to evaluate our method on real-world cooperative data which would help further validate and refine our approach, ensuring its effectiveness in addressing real-world object detection and tracking challenges.