Cooperative Deep-Learning Positioning in mmWave 5G-Advanced Networks

In application verticals that rely on mission-critical control, such as cooperative intelligent transport systems (C-ITS), 5G-Advanced networks must be able to provide dynamic positioning with accuracy down to the centimeter level. To achieve this level of precision, technology enablers, such as massive multiple-input multiple-output (mMIMO), millimeter waves (mmWave), machine learning and cooperation are of paramount importance. In this paper, we propose a cooperative deep learning (DL)-based positioning methodology that combines these key technologies into a new promising solution for precise 5G positioning. Sparse channel impulse response (CIR) data are used by the positioning infrastructure to extract position-dependent features. We model the problem as a joint task composed of non-line-of-sight (NLOS) identification and position estimation which permits to suitably handle geometrical location measurements and channel fingerprints. The network of base stations (BSs) automatically steers between egocentric (in case of NLOS) and cooperative (for LOS) positioning mode. We perform extensive standard-compliant simulations in a 5G urban micro (UMi) vehicular scenario obtained by ray-tracing and simulation of urban mobility (SUMO) software. Results show that the proposed cooperative DL architecture is able to outperform conventional geometrical positioning algorithms operating in LOS by 47%, achieving a median error of 71 cm on unseen trajectories.

Color versions of one or more figures in this article are available at https://doi.org/10.1109/JSAC.2023.3322795.
Digital Object Identifier 10.1109/JSAC.2023.3322795measurements in accordance with 3GPP standards [6], [7].However, future releases of 5G and beyond, such as 5G-Advanced from Release 18, will face challenges in achieving the centimeter-level absolute accuracy required by the most demanding 5G use cases due to higher path loss and frequent blockages [8], [9].Legacy solutions such as least squares (LS) multilateration/angulation may struggle to effectively handle situations with low signal-to-noise ratio (SNR), multipath ambiguities, and particularly non-line-of-sight (NLOS) conditions [10], [11].A potential solution is already foreseen by a novel paradigm called integrated sensing and communication (ISAC) [12], [13], [14].Specifically, 5G base stations (BSs) (also known as gNodeBs) can natively support ISAC through the use of a joint signal processing framework, allowing for a more efficient utilization of spectrum resources.The integration of communication and sensing features on the same hardware platform is, at present time, not yet been commercialized.Nevertheless, synthetic datasets are emerging [15], [16], [17] with the clear objective of permitting the design of novel artificial intelligence (AI) and machine learning (ML) algorithms which will play a fundamental role in next-generation networks [18], [19], [20], [21], [22].
In the context of cooperative intelligent transport systems (C-ITS), connected automated vehicles (CAV) rely on ML, and more specifically deep learning (DL), for various functions, including identifying and segmenting objects within images, controlling the vehicle and avoiding collisions, and determining the most efficient route [23], [24].Because of the high complexity of such tasks (including precise positioning), urban areas may install computing units, namely roadside units (RSUs), on busy roads that CAV will be able to use to offload part of the computing activities [25], [26].Cooperation between nearby RSUs, here referred to as BSs, is of paramount importance for enabling network-based precise localization [27], [28].A key aspect is that, in 5G industrial applications such as automated driving, BSs often have access to a large amount of historical channel state information (CSI) data that is continuously received from geolocalized vehicles [29], [30].
While having a perfect knowledge of the overall CSI is unfeasible, e.g., accurate delays, angle of arrivals (AoAs) and power gains, the usage of estimates of the channel impulse response (CIR) can be adopted by ML approaches to learn relevant information about the environment and its propagation characteristics [31].It is important to observe that not only LOS but also NLOS CIRs embed significant location information, though embedded in distinct peculiar distributions.Therefore, CIRs can be exploited in both cases for positioning goals.Distinguishing between LOS/NLOS conditions is extremely important since the derived measurements are fundamentally different.In case of LOS condition, the extracted features can be combined by a cooperative set of BSs as in conventional geometric methods in order to enhance satellite positioning.On the contrary, NLOS features represent a real challenge for the system since they are related to a specific environment, acting as area-specific fingerprints.A complete solution for both situations has yet to be found.Therefore, in this paper, we propose a cooperative DL solution for the joint problem of NLOS identification and 3D position estimation that takes as input the high-dimensional sparse CIRs.We model the problem of NLOS position estimation as an egocentric system at each BS, while a cooperative architecture is proposed for LOS conditions.
Table I and II list the main abbreviations and notations used throughout the paper, respectively.

A. Paper Organization
The remainder of this article is structured as follows.Sec.II presents an overview of the state of the art on wireless localization with ML and DL.Sec.III describes the system model, the MIMO-orthogonal frequency division multiplexing (OFDM) channel and related angle-delay channel power matrix (ADCPM) adopted as input for ML-based positioning.In Sec.IV, we discuss the proposed supervised ML setting and the DL model stored in each BS for the joint NLOS detection-positioning task prediction.Sec.V extends the model to be compliant with a cooperative architecture based on a set of collaborative BSs for positioning purposes.Section VI provides information on the simulated 5G scenario and compares the proposed method with conventional positioning algorithms.Finally, in Section VII, we draw the conclusions.

II. RELATED WORKS AND CONTRIBUTIONS A. Early Works on NLOS Detection and Positionong
First works on NLOS identification and localization with ML were applied to ultra wide-band (UWB) systems.They typically used hand-crafted features of the channel, such as energy, maximum power, rise time, mean excess delay, rootmean-square delay spread, and kurtosis, as inputs [32], [33], [34], [35], [36].The most commonly used ML models in these studies were Gaussian processess (GPs), support vector machines (SVMs), and relevance vector machines (RVMs).While achieving good results, these methods could not fully exploit the ML potential as they heavily relied on a pre-defined and limited set of features that could not express the whole location information enclosed in the CSI.
Other methods, mainly based on received signal strength (RSS) fingerprinting, were proposed for precise positioning using Wi-Fi technology [37], [38].With the advent of MIMO-OFDM systems in IEEE 802.11a/n protocol, allowing for the extraction of CSI from commercial Wi-Fi devices, there has been significant research on wireless positioning, behavioral awareness, and target tracking using CSI.The access to channel information over multiple carriers and antennas gave the possibility to extract detailed information about the propagation of the radio signal, and to learn not only the position of the user [39], [40], [41] but also information about the environment that shaped such propagation [42].DL approaches were employed to directly learn the optimal non-linear combination of features and produce the desired NLOS classification or position estimation as output.Studies in this field can be found in [43], [44], [45], and [46], adopting convolutional neural networks (CNN) for feature extraction.

B. DL for High-Precision Positioning
In outdoor conditions, DL-based positioning is a relatively new concept, but with a great potential of achieving high levels of accuracy.Authors in [16] achieve a mean positioning error of 1.5 m by using a cell-specific neural network (NN) in 5G Rel. 15 networks, but considering LOS environments only.NLOS prediction is handled through statistical tests [47] or directly included into the model's prediction [48], [49].A recent study [50] adopted a variational autoencoder (VAE) to extract features and impose a Gaussian distribution on the latent features for the purpose of binary classification of samples as LOS or NLOS.While the use of an autoencoder (AE) to obtain a compact representation of the channel can yield good results, the reliance on sampling-based methods for prediction makes this approach not suitable for real-time applications.
The employment of full CIR data, especially stacked into image-shapes, has emerged recently as a promising approach.Authors in [51] adopted both CIR (i.e., path power gain, phase and time of flight (ToF)) as well as geographical information (i.e., AoA and angle of departure (AoD)) to predict the user equipment (UE) location.However, they assumed perfect CIR knowledge by ray-tracing, which is hard to achieve under practical conditions.Authors in [52] employed as input the channel frequency response (CFR) matrix computed by practical channel estimation and augmented with additive noise at training time.Despite achieving good results, the CFR matrix does not explicitly express AoA nor ToF of each path and thus may add complexity in feature extraction.A recent work [53] adopted a 3D CNN with inception modules to directly predict the position of a UE from an ADCPM.This approach, however, highly relies on the fingerprint sampling-distance and it is not able to distinguish between LOS and NLOS conditions, treating each position as equal.In other words, there are no distinctions between geometrical features, useful in LOS environments, and merely NLOS position-dependent fingerprints.Furthermore, no works are available in the literature on DL-based location estimation using data collected by multiple cells, i.e., by the cooperation of BS's.Works mainly rely on centralized processing [54] or vehicle-to-vehicle (V2V) communications [55], [56].

C. Contributions
In this paper, we address the problem of precise cooperative positioning in urban environments covered by 5G mm-wave networks and we propose a new DL-based approach that allows to exploit the full potential of wideband space-time CIR for localization.The main contributions are as follows.
• We propose a new method for the extraction of location-related features from the CIR of 5G mmWave  [57] and provides realistic outdoor conditions through the use of Matlab ray-tracing software.We simulate multiple trajectories of vehicles, i.e., UEs, created with simulation of urban mobility (SUMO) software [58].Simulations are used to assess the performance of the DL-based method, showing significant gains over conventional techniques.

Notation
We denote with j = √ −1 the imaginary unit.Columns vectors and matrices are denoted by lower-and upper-case characters, respectively.Matrix conjugation, transposition and conjugate transposition are indicated as A H , A T and A * , TABLE II LIST OF NOTATIONS respectively.We indicate the Hadamard and Kronecker product between two matrices with ⊙ and ⊗, respectively.The symbol E[•] is used for expectation of random variable, whereas C and R are the set of complex and real numbers, respectively.We denote with N (x; µ, σ 2 ) the distribution of a Gaussian random variable x with mean µ and standard deviation σ. δ(•) and δ[•] indicate the Dirac delta and Kronecker functions, respectively, while ⌊x⌋ represents the largest integer not greater than x.

III. SYSTEM MODEL
In this section, we define the channel model and the location-related channel fingerprints that will be used by the proposed DL positioning algorithms in Sec.IV.For simplicity of notation, we here report the multi-user single-BS case, leaving the extension to a set of cooperative BSs in Sec.V.

A. Channel Model
Let us consider a wide-bandwidth multi-user MIMO-OFDM system operating at carrier wavelength λ c where K UEs communicate with a BS in uplink direction.The UE is equipped with an omni-directional antennas, whereas a uniform planar array (UPA) with N × M isotropic antenna elements (i.e., N and M elements in the vertical and horizontal directions, respectively) is installed in the cell panel of the BS.The antenna elements are separated by a distance d (h) and d (v) in the horizontal and vertical direction, respectively.A multipath channel with N k paths is present between the BS and the UE k.The overall scenario is represented in Fig. 1, where the generic p-th path of the k-th UE channel is represented, jointly with the direction of arrivals (DoAs) of the signal impinging the antenna panel, composed by a zenith angle 0 ≤ θ k,p ≤ π and an azimuth angle 0 ≤ ϕ k,p ≤ π.
We consider an OFDM modulation with sampling interval T s , N c sub-carriers and symbol duration T c = N c T s .The ℓth sub-carrier has frequency f ℓ = ℓ Tc , ℓ = 0, . . ., N c − 1 and we assume that the cyclic-prefix duration T g = N g T s is greater than the maximum channel delay among all UEs, τ M AX , while N g is the number of sampling interval that composes a guard interval.The temporally resolvable propagation delay related to the p-th path and k-th UE is indicated with r k,p = ⌊ τ k,p Ts ⌋.According to this notation, we model the baseband CIR of user k as [59]: where α k,p = a k,p e −j2π( d k,p λc −ν k,p τ k,p ) is the complex gain of p-th path which also includes the frequency shift due to Doppler ν k,p and has average power is the traveled distance (with c being the speed of light in air), δ(τ − τ k,p ) is the delta Dirac function and e(θ k,p , ϕ k,p ) ∈ C M N the array response vector [60].We assume that over the time interval τ M AX , the rotation due to the Doppler is almost constant.
Considering sampling time with rate 1/T s and assuming each different path independent and wide sense stationary [59], we can write the CFR at the ℓ-th sub-carrier as [61]: where ᾱk,p = α k,p e −j2πτ k,p f l are the equivalent channel gains in the frequency domain.Concatenating the different CFRs at each sub-carrier, we obtain the space-frequency channel response matrix (SFCRM): which will be used in the next section for ADCPM extraction.

B. ADCPM Location Fingerprints
For location estimation, it is convenient to convert the channel response to the angle-delay domain, where the identification of the LOS component (if present) and secondary NLOS macro-paths is facilitated.In fact, in LOS condition, AoA and ToF can be effectively used to localize a UE thanks to their geometric relationship with the location.At the same time, in NLOS circumstances, different surrounding environments hold different channel parameters, acting as location-dependent features (or fingerprints).Therefore, we transform the SFCRM in (3) into an angle-delay channel response matrix (ADCRM) by introducing the phase-shifted discrete Fourier transform (DFT) matrices . We denote with F ∈ C Nc×Ng the matrix formed by the first N g columns of N c dimensional unitary DFT matrix where Nc .Finally, we compute the ADCRM as [53]: where project the SFCRM into the delay and angle domain, respectively.
For positioning purposes, we propose to use power-angledelay profile, represented by the ADCPM: where It can be shown indeed that for N , M and N g → ∞, the ADCPM approximates a sparse matrix with elements [P k ] i,j matching the i-th AoA and the j-th ToF [53]: where r k,p indicates the index of the j-th ToF and m k,p N + n k,p refers to the index of the i-th AoA.Therefore, the statistical information of the ADCPM enables the learning by the DL model of the location-dependent characteristics, delivering steady and trustworthy fingerprints for positioning.

C. DL Model Input
We propose to employ the ADCPM as a set of measurements for location estimation.This sparse matrix provides indeed a visual image of the multipath configuration in the power-angle-delay domain from which a DL model such as CNN can extract the most representative location-related features.Additionally, since the first layers of CNN are often sparse and collect the highly discriminating features, the CNN eases features extraction from the ADCPM sparse channel matrix [62].Moreover, the ADCPM embodies all the relevant information (i.e., ToF, AoA and RSS for every path) with small storage and low complexity characteristics thanks to the channel sparsity.To highlight this aspect, in Fig. 2 we show an example of ADCPM P k composed by N g = 352 delay Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.samples and M N = 64 angle directions.The actual model input can be seen in Fig. 2a, while the polar representation of the ADCPM with physical AoAs and ToFs is in Fig. 2b.Even in the absence of a large number of antennas or a high sample resolution, the matrix sparsity is clearly visible.Therefore, we propose to use the ADCPM as input to the model and we denote the i-th sample with x i = P i , dropping the index k for ease of notation.

IV. METHODOLOGY: SINGLE-BS LOCALIZATION
In this section, we first tackle the localization problem in a supervised setting in which a single BS has to locate the UE.We propose a DL model (Sec.IV-A) and a loss function for the joint task of position estimation and LOS identification (Sec.IV-B).The approach will then be extended to the multi-BS, i.e., cooperative, case in Sec.V.

A. Deep Learning Model
We assume a supervised ML setting in which both a regression and a classification problem have to be solved by a single BS.The regression problem refers to the estimation of the UE's position, while the classification problem concerns the LOS/NLOS identification.We define the training dataset as S train = {(x i , u i , s i )} Ntrain i=1 , where x i = P i denotes the i-th input sample (i.e., ADCPM channel response), u i represents the 3D position and s i ∈ {0, 1} is the Boolean identifier of the sight condition.To validate the performances, we also hold a similar test dataset S test composed of N test samples.
We note that the regression and classification tasks are two interrelated problems that must be addressed accordingly.In fact, if the UE lies in a LOS condition, its position can be directly computed from the geometrical features of the direct path (i.e., ToF, AoA and mean power), while NLOS typically requires a finer training based on more complex multipath fingerprints.
To extract such key features from the ADCPM samples x i , we propose to employ an AE, with structure represented in Fig. 3.The encoder E(•) is used to produce the hidden (or latent) features z i (including the location-related information embedded in the channel), while the decoder D(•) tries to reconstruct the input samples obtaining xi .The AE is designed so as to minimize a metric of the reconstruction error ∥x i − xi ∥ 2 2 [63], making the model able to reconstruct the input x i from the low-dimensional data z i .This guarantees that z i contains all the necessary and sufficient information to accomplish the specific task.
The tasks herein considered are position regression and LOS identification.Therefore, in principle, two NNs with z i as input, should be sufficient.However, due to the major difference between position estimation in LOS and NLOS conditions, we propose to use three separate NNs: one for LOS identification, one for position regression in LOS conditions and one for position regression in NLOS conditions.Given the overall model with parameters defined as W, the output of the NNs is respectively: ps,i ∈ [0, 1], ûLOS,i ∈ R 3 and ûNLOS,i ∈ R 3 .Specifically, ps,i = p(s i ) predicts the probability that the sample x i relates to a UE in LOS condition, while ûLOS,i and ûNLOS,i predict the 3D position of the UE in LOS and NLOS settings, respectively.The overall position estimate, indicated with ûi , is obtained by applying a threshold Γ to ps,i and considering only ûLOS,i or ûNLOS,i according to the result: The position estimates in LOS and NLOS are then adopted in the loss function for the model training, described in the next section.

B. Loss Function for Joint Sight Detection and Localization
In order to train the model, we have to define a loss function whose objective is to enable learning the correct representation of the latent features and, at the same time, jointly estimating the sight condition and the location.To this aim, we first treat separately the classification and regression tasks and then we combine them with an overall objective function.For the classification task, we propose a discriminative probabilistic approach, namely the maximum likelihood [64], where we directly define the posterior conditional probability p(s i |x i ) using a parametric model W, i.e., p(s i |x i ) = p(s i |x i , W).Subsequently, we maximize the likelihood of the model p(s i |x i , W), by optimizing the parameters W over the training set.For the specific binary classification problem, the likelihood function is: For the regression task, we reiterate the discriminative approach with the constraint of belonging to a specific condition, either LOS or NLOS.Again, this is done considering that the two circumstances hold different statistics.We assume that in the LOS case, the target variable u i (i.e., the location) is Gaussian distributed with a deterministic mean ûLOS,i (x i , W): where ϵ ϵ ϵ LOS is a zero-mean random Gaussian variable with covariance C LOS = I 3 σ 2 LOS , and I 3 denoting the 3 × 3 identity matrix.The likelihood function is thus given by: Similarly, the likelihood function of the regression problem in NLOS conditions can be written as: Note that the requirement on the mono-modality of u i , both in LOS and NLOS, can be easily disregarded by applying for example a mixture of experts model [65].
Combining (8), (10) and (11), we define the joint likelihood for the variables (u i , s i ) as: A representation of the distribution can be found in Fig. 4.
In order to maximize the likelihood, we insert the negative log-likelihood in the loss function.It can be shown that the the negative log-likelihood of a batch of independent samples can be written as (see Appendix A): where N b is the batch size.Whenever the LOS condition occurs, the second right-hand side of (13) does not contribute to the likelihood, whereas if NLOS holds, only the first right hand side is considered.Finally, the complete adopted loss function is: where w rec regulates the sample reconstruction, w s compensates the unbalances between LOS and NLOS samples and w u controls the uncertainty of the model on the position estimation and includes both σ 2 LOS and σ 2 NLOS .

V. METHODOLOGY: COOPERATIVE LOCALIZATION
In this section, we extend the approach for UE positioning to a multi-BS scenario.We present at first the proposed cooperative AE architecture and then the cooperative training procedure that can be used in practice to deploy the set of BSs composing the localization infrastructure.

A. Cooperative Architecture
We propose a cooperative architecture where each BS adopts a cell-specific DL model.The main assumptions behind the proposed architecture are the following.First, each BS is able to evaluate whether a UE is in LOS or NLOS condition based on the observed ADCPM and the procedure described in Section IV.Second, in case of NLOS, a position estimate is obtained by the BS by means of a previous fingerprint training procedure.Third, the latent geometrical features adopted by the LOS position estimation are combined in order to get a more accurate inference.This is somehow intuitive if we consider the latent features as a sort of non-linear combination of ToF and AoA measurements that each BS can share with the neighbors' cells.
Let us denote by S BS,i = {1, . . ., S BS,i } the set of BSs which detect a UE at timestep i, and gather a batch of samples X i = {x through encoder E: z LOS,i ← z NLOS,i ← z Send {p LOS,i } to j ′ 6: Receive {p LOS,i } from j ′ 7: end for 8: Assign ẑ(j) Algorithm 1, the main idea is that, for position inference, each BS j computes its latent features z i .At this stage, there is no difference between z LOS,i , and z (j) NLOS,i and the inference proceeds as in the single-BS method.Then, if the BS predicts a NLOS condition, the latent features extracted by that BS are not combined with other cells (e.g., as a multi-lateration) and position estimation continues according to the û(j) NLOS,i prediction.On the contrary, if we are in LOS condition, then the latent LOS features z (j) s,i are exchanged with the neighbors' cells, defined with N (j) i = S BS,i \{j}.
After averaging the latent LOS features z (j) s,i , the position is estimated with û(j) LOS,i .As an example of cooperative inference, we refer to the scenario shown in Fig. 5 where three BSs detect a UE, here represented by a vehicle.NLOS and LOS links are indicated with red and green dashed arrows, respectively.Since BS j and j ′′ are in LOS condition with respect to the vehicle, the contribution of z (j) s,i and z (j ′′ ) s,i to the position estimation will be higher than z (j ′ ) s,i .The negligible contribution of NLOS BSs is highlighted with red solid arrows while significant latent features are colored in green.This is due to the fact that the probability of LOS condition of BS j ′ will be low, i.e., p(j ′ ) s,i < p(j) s,i and p(j ′ ) s,i < p(j ′′ ) s,i .The outcome is an increased accuracy on the position estimation which is not or little affected by NLOS BSs.

B. Cooperative Training
For the training of the cooperative set of BSs we propose the following procedure.During the data gathering phase, a UE (e.g., a vehicle) moves along specified trajectories (apriori divided into LOS and NLOS segments) sending uplink signals, i.e., sounding reference signal (SRS), to every BS in its range.The time instant of the transmission, the coarse position obtained from global navigation satellite systems (GNSS) and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the LOS identification are either sent to the BSs as auxiliary data through the communication link, or stored inside the UE for the post-processing.As for the inference, the BSs evaluate the ADCPM samples and store the data.In order to perform back-propagation at BS j, the loss function is computed as: NLOS,i (z where z(j) i .Note that, to speed up the training phase, a centralized loss computation can be performed in a batch-manner, i.e., in parallel way, using S BS,i as batch size.The rational behind this approach is that while the models for NLOS position estimation are trained to be BS-specific, the LOS networks can be shared as they have consistent parameters among all the BSs.

VI. PERFORMANCE ASSESSMENT IN A 5G NETWORK
To assess the performance of the proposed cooperative DL positioning system, we employ a data generator based on the 5G new radio (NR) clustered delay line (CDL) channel model [66], which is defined over a bandwidth of 2 GHz in the frequency range from 0.5 GHz to 100 GHz.The radio wave propagation is simulated using a ray-tracing method [67], [68], [69] from the Matlab package, which plots the propagation paths from the UE to the BSs based on the surface geometry from a map file.The ray-tracing method employs the shooting and bouncing rays (SBR) method with up to 10 path reflections [70].The channel model is then generated by combining all the paths and taking into account the small-scale fading caused by multipath and UE's movement.With this method, adjacent positions will have similar channel characteristics due to the similar scattering environment, ensuring spatial consistency.

A. Simulated Scenario
For the experiments, we simulate a 3GPP urban micro (UMi) scenario in a 1000 × 1000 m area, near the Leonardo's campus of Politecnico di Milano, in Milano, Italy, using the parameters described in [57].The specific values are summarized in Table III.As shown in Fig. 6, the scenario consists of 19 sites with an inter-site distance (ISD) of 200 m, placed in an hexagonal layout, each equipped with 3 cells and separated by 120 deg in azimuth.Each cell antenna uses an UPA configuration with N = M = 8 elements and a downtilt of 15 deg.A macro image of the antenna array pattern can be found in Fig. 7.Each antenna element was defined using specifications in [71], providing a front-to-back ratio of about 30 dB and a maximum gain of 8 dBi.
The UEs move around the region following various routes generated by the SUMO simulator that reproduces the vehicular traffic flow over the considered road network.The simulation runs for 170 seconds and generates up to 50 trajectories, which are sampled every second.Each UE uses a carrier frequency of f c = 30 GHz and a transmission bandwidth of B = 400 MHz to transmit 5G standard compliant SRSs to all nearby BSs.Finally, each BS demodulates the signals and obtains a channel estimation, i.e., SFCRM, via the received pilot signals through a LS estimator.The ADCPM is then obtained according to (4) and ( 5).

B. Positioning Tests
We divided the experiments into offline and online phases.In the offline (training) phase, the UE moves along the trajectories and each BS gathers both LOS and NLOS channel realizations according to the specific UE's position.We considered the training positions as perfectly known (i.e., no error was introduced to the ground truth positions) as we aimed at assessing the lower bound performances of the method.In total, we gathered about 5.9 • 10 4 samples in 1659 positions only for the training phase.In the online (test) phase, the NLOS position capabilities were verified in the same trajectories but in random positions not adopted in the  training.Specifically, we adopted about 2.5 • 10 4 samples in 711 positions for NLOS testing.In this way, we assess whether or not, each BS can learn the environment, i.e., the channel characteristics, around it.On the other hand, for LOS positioning, we validated completed different trajectories to analyze the capabilities of estimating the position from learned geometrical features.Here, the total number of LOS tested positions was 867.A representation of the adopted trajectories can be found in Fig. 8.Note that in the training trajectories, i.e., red markers, the number of detected BSs can vary significantly: we measured from 1 up to 13 BSs in the collected samples.The test samples for LOS evaluation, highlighted with purple placeholders, are located in the top-left corner of the map.To avoid biases and improve model convergence, before model training, we standardized with zero mean and unitary standard deviation all the samples, and we shuffled the dataset at each epoch.
The toolbox Antenna T oolbox of MATLAB 2022b is used to generate the channel fingerprints for the data points, while the model for training and testing is implemented using Pytorch [72] (v1.12 with Python 3.7.11).The simulations are run on a workstation with an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40 GHz, 96 GB of RAM, and a Quadro RTX 6000 24 GB GPU.The training and testing times refer only to the runtime on Pytorch 1.12.Unless otherwise noted, the model is trained for a total E = 60 epochs with a batch size N b = 64.The Adam optimization algorithm [73] is used with an initial learning rate lr = 10 −4 and momentum values β 1 = 0.9, For what concerns the hyper-parameters choice of w rec , w s and w u , we clarify that their values depend on the specific dataset at hand and thereby tuning has been performed as follows.Starting from w s , since this parameters regulates how much weight is given to the LOS samples class in order to compensate the class unbalances, it is computed as in weighted cross-entropy loss: w s = N LOS /N NLOS , where N LOS and N NLOS are the number of LOS and NLOS samples in the training batch, respectively.On the contrary, w rec and w u have been chosen empirically using a grid-approach in [0.1, 1] with step size 0.1 and assigned as: w rec = 0.1, w u = 0.9.This can be intuitively explained by two reasons.First, the AE model is much more complex than the MLPs for positioning, leading to a fast drop of the reconstruction error.Second, the number of features in x i is larger than the dimension of u i , which automatically increases the weight of the reconstruction error with respect to the positioning error.In order to balance these two quantities, we suggest values that satisfy w rec < w u .

C. DL Model Design
The adopted DL model architecture is as follows.The AE is the most critical component as a good feature extraction is essential to enable precise positioning.Driven by the necessity of handling sparse data, i.e., ADCPM input, we select the Segnet architecture [74] where the upsampling layers employ the encoder pool indices to create ad-hoc sparse feature mapping.At testing time, we can completely discard the decoder part as the input reconstruction is only adopted in the training phase to learn the latent features representation.
The choice of the NNs for NLOS classification and position estimation is dictated by the specific task to accomplish.A BS must be able to localize a UE regardless of whether it is in ego mode, i.e., only a single BS detects the UE, or in cooperative mode, i.e., the UE is detected by multiple BSs.In the former case, the latent features should be a non-linear composition of ToF and AoA for each of the multipaths.On the other hand, in the latter case, multiple ToF must be exploited to localize the UE.To assess these concepts, we experiment different multilayer perceptrons (MLPs) architectures trying to localize a UE with only a ToF and AoA or 3 or more ToF as inputs.
In Fig. 9, we show the test results of the MLP described in Table IV which was trained on a synthetic dataset using as input either ToF and AoA only (Fig. 9a), or 3 measurements of ToFs (Fig. 9b) or 10 ToFs (Fig. 9c).This was done in order to verify the positioning capabilities of the models, i.e., to assess the bias of the MLPs for LOS and NLOS position estimation.Since the testing positions (red circles) are never seen in the training phase, we can conclude that the model is able to learn the geometrical meaning of the input features and perform multi-angulation and circular lateration.It is worth noticing that, in the case of cooperative architecture, the BSs' positions (black squares) are never used as input features but they are automatically learned by the training.This can be an advantage in case the coordinates of the BSs are not known or partially known.
Given these results, we employed the MLP in Table IV for the LOS and NLOS position prediction NNs, while the MLP in Table V is adopted for NLOS detection NN.The key difference between these two types of NNs is the size and layer composition.First, we assumed that single supervised NLOS classification is a simpler task if compared with 3D position regression, thus resulting in a different number of layers and neurons.Second, we employed Tanh (instead of ReLu or GeLu) activation functions in the NLOS detection NN since blockage detection should be performed in the fastest and most efficient way possible, as the whole cooperative prediction highly depends on it.On the contrary, GeLu is more computationally expensive but can capture more complex patterns (as needed for regression) due to its smooth and differentiable nature [75].Finally, both adopt dropout to avoid overfitting.

D. Baselines
For performance assessment, we compare the proposed method with the following algorithms/models: • Ego DL model: the proposed DL model (Fig. 3) trained with the loss ( 14) for single-BS positioning.The model does not communicate with any other BS and has to rely only on its prediction.
• Cooperative DL model: the proposed DL-based cooperative architecture described in Sec.V. • Single-BS ToF-AoA: conventional positioning obtained by a single BS using the ToF estimate obtained from the cross-correlation with the SRS according to 5G NR Rel.16 and the AoA estimated through multiple signal classification (MUSIC) algorithm [76].• Multi-BS time difference of flight (TDoF): conventional hyperbolic-multilateration obtained using TDoFs estimated by all the BSs receiving SRS.

1) Training Convergence:
This first assessment has the aim of verifying the training convergence of the proposed ego DL model, i.e., the correct behavior of the loss function and of the root mean square error (RMSE) performance metric.In Fig. 10, we report the testing results separately for LOS and NLOS samples in the test set, for varying number of training epochs.We notice that the loss function decreases quite rapidly in just 60 epochs and does not show signs of overfitting.This may be due to the fact that the dataset is very large (about 50 GB) and thus it is difficult to memorize every sample.Furthermore, the model regularization, i.e., dropout, helps with this aspect.
Referring to the performance metric, we can see that the NLOS case holds much higher oscillations with respect to the LOS one.While learning geometrical patterns and associated positions is a more standard task, associating NLOS samples with the UE location is much more difficult.The DL model has not only to understand how the environment is configured but also how it could change between the training positions.Finally, we note that the LOS RMSE is slightly worse that the NLOS.This is due to the LOS performance bounds imposed by the physical layer parameters, i.e., the resolution of the ToF (limited by the signal bandwidth) and AoA (by the number of antennas).On the other hand, the NLOS RMSE highly depends on the training position resolution, i.e., the training spatial sampling, as the denser the training points, the better the performances.
2) Blockage Detection: This experiment has the objective of assessing the steering model capabilities in discerning LOS and NLOS conditions.The steering is taken into account in both the smart weighting of the geometrical latent features and in the final position estimate, i.e., lines 8 and 11 of Algorithm 1, respectively.To this aim, in Fig. 11 we show the testing accuracy, precision and recall for varying number of training epochs.The numerical results show that the model reaches an accuracy of approximately 85%, validating the capability of the proposed approach of learning the multi-task problem within such a realistic environment.In terms of the parameter Γ, we implemented a conservative yet effective approach, as detailed subsequently.We selected Γ = 0.6 not with the aim of matching the values of precision and recall, but rather to adjust it towards a lower number of false positives compared to false negatives.This approach is driven by the rationale that if the model is uncertain about the visibility condition, it slightly leans towards the NLOS case.As a result,   3) Online Computational Complexity: In this section, we analyze the computational complexity of the proposed localization method by assessing the number of floating point operations per second (FLOPS) and the inference time required for performing positioning.Starting from the computational complexity, we can divide the computational load into measurement extraction and position estimation.For measurement extraction, assuming to have N g signal samples and M N antenna elements, the ADCPM can be efficiently computed adopting a 2D-inverse fast Fourier transform (IFFT), with a complexity of O(M N N g • (log(N g M N ))).On the other hand, time-based measurements obtained with crosscorrelation, hold a complexity in the order of O(N 2 g ) (or O(N g • (log(N g ))) with efficient methods as FFT).Finally, angle-based measurements obtained by the MUSIC algorithm using N and M scanning directions in the azimuth and zenith domain, respectively, involve an overall complexity of O((M N ) 3 ) [77], considering the eigenvalue decomposition (EVD) on the signal covariance matrix as predominant on the scanning search for source directions.In order to have a reference measure of time required by our system (described in Sec.VI-B) to compute the measurements, we clarify that on average, the ADCPM, ToF and AoA computations took about 0.4, 0.3 and 0.4 ms, respectively.As expected, the complexity of the ADCPM results higher than single time-based or angle-based measurements, as it embodies all the location-related information provided by the propagation channel, i.e., including the ToFs, AoAs and received power of all multipaths.
For what concerns the algorithms for position estimation, we compare non-linear least square (NLS) methods, adopted in Single-BS ToF-AoA and Multi-BS TDoF, and the proposed DL model.Denoting with N u the unknown location coordinates to be estimated, N meas the number of measurements and N iter the number iterations for NLS convergence, for Gauss-Newton method we hold an overall complexity of O(N iter (N meas • N u + N meas • N 2 u + N 3 u )) (assuming that the cost of computing the Jacobian is roughly proportional to O(N meas • N u ), the cost of Jacobian matrix multiplication to O(N meas • 2 u ), and the matrix inversion to O(N 3 u )).However, estimating the complexity of the DL models is not straightforward as it would involve a detailed breakdown of the operations performed.To this aim, in Fig. 12, we empirically assess the inference times using the system hardware described in Sec.VI-B and we compare it with NLS solutions, for varying number of measurements.For NLS method, we empirically choose N iter = 50 and N u = 3 (3D position).Results show that the proposed DL model is able to perform the position inference in about 1.5-2 ms which corresponds to a NLS with about 10 measurements.We point out that these results highly depend on the specific hardware and implementation of both the DL and NLS algorithms.Nevertheless, the overall conclusion is that, despite being slightly more complex, the proposed method has the main advantages of having greater accuracy with respect to classical geometrical algorithms and retaining the ability to localize in NLOS conditions (as described in Sec.VI-E.4), while being at the same time compliant with the strict requirements given by cooperative, connected, and automated mobility (CCAM) of 5 ms [78].
4) Positioning Accuracy: In this assessment we test the positioning accuracy of the methods described in Sec.VI-D.To this aim, in Fig. 13 and 14, we show the cumulative density functions (CDFs) of the location error in the LOS and complete test trajectory, respectively.For the cooperative methods, we consider a position LOS if all the BSs detecting the UE are in LOS.To better compare the results, in both the figures, we report both the 2D and 3D error for each method.
Starting from the position methods with single BS in the LOS positions, we notice a huge improvement between the Single-BS ToF-AoA and the ego DL model, passing from a mean error of 26.38 m to 5.99 m in 3 dimensions.With the 2D error metric, the performances are slightly better,  obtaining 24.76 m and 5.35 m, respectively.The ego DL model automatically extracts a compact non-linear representation of the overall multipath profile (i.e., ToF, AoA and power of all paths in LOS conditions), which is much more informative than the direct path information, thus outperforming the traditional single-BS based positioning.Moreover, in 3D positioning, it outperforms even the Multi-BS TDoF.This is due to the fact that, in classical TDoF approaches, the vertical geometrical dilution of precision (GDoP) is very limited due to the poor geometrical arrangement of the BSs over the vertical dimension, as BSs are usually located at similar heights.On the contrary, the ML approach is able to learn the usual altitude of the user and exploit this information a-priori in the position's computation.
Moving to the cooperative methods in the LOS testing trajectory, we observe that the proposed cooperative DL model outperforms the Multi-BS TDoF even in the 2D positioning case, i.e., holding 66% of the points with an error of less than 0.81 m and 90% with an error of less than 1.3 m.The mean error decreases from 3.21 m with Multi-BS TDoF, to 1.68 m with the cooperative DL solution, with an improvement of above 47%.The cooperative DL model, in fact, holds a common-sense of where the UE could be, i.e., discarding or not considering possible unfeasible solutions that could be obtained by geometrical algorithms.This is confirmed by the results on the complete testing trajectory in Fig. 14, composed by both LOS and NLOS positions.Whenever one or more BSs are in NLOS, we suffer a severe degradation of performances.In these conditions, the ego DL model reaches a median 2D error of 3.01 m with respect to an error of 3.74 m in case of Multi-BS TDoF.Comparing the cooperative and ego DL models, we notice an approaching of the two CDFs, mainly due to the higher NLOS performances in case of ego-positioning.This is because, even when LOS positions are inaccurately classified, the positioning is corrected through fingerprint training without being impacted by errors in geometrical features.Essentially, the position estimation solely relies on either the LOS MLP or the NLOS MLP, but never a combination of the two.On the other hand, the cooperative DL model suffers slightly higher errors since in NLOS positions the BSs cannot cooperate.
In case the achieved performances do not satisfy the target accuracy for the specific location-sensitive service, several strategies can be implemented in order to improve the proposed method without changing the structure of the model.First, starting from the physical layer, increasing the bandwidth and number of antennas at the BSs would permit to enhance the space-time system resolution, and thereby the ability of the model to resolve the multipath and extract location information from the ADCPM.Second, from a design point of view, we can increase the DL model dimension (especially the AE part), which reduces the model bias, and simultaneously increase the number of collected data, thus reducing the variance.However, we need to keep in mind that this comes at the price of higher training and inference times, as well as higher costs of dataset creation, resulting in a performance-complexity trade-off.
5) Tracking Performance: This experiment compares the performances of the positioning methods in the testing trajectory where a UE, i.e., a CAV, moves along a road at variable speed.In Fig. 15, we can see in pink the covered ground-truth trajectory both in 3D (a) and in 2D (b).The CAV moves faster in the north part of the trajectory and then slows down in the southern part of the road.Since the objective is to assess the point-position estimation of each method, we do not implement any tracking filter and we rely exclusively on the channel realization at specific samples of the trajectory.Moreover, passing from the 3D to the 2D representation, we discard unlikely estimated positions just for easy-visualization purposes of the 2D trajectory.The Single-BS ToF-AoA and the ego DL model results are obtained from the BS number 48, while the cooperative methods consider only the positions where the CAV detect the BS number 48.Observing Fig. 15a, we notice that the Single-BS ToF-AoA method has the worst performances since it locates the CAV outside the road for most of the trajectory.This is mainly due to AoA estimations, which become worst the higher the distance from the BS, and ToF estimations.In LOS positions, i.e., north part of the trajectory, the ToF error is only due to the autocorrelation of SRSs and peak estimation.On the contrary, in NLOS positions, i.e., south part of the trajectory, the major source of error is represented by reflections.The ego DL model improves this aspect by halving the error (see Fig.  cooperative LOS measurements with egocentric NLOS predictions.

VII. CONCLUSION AND FUTURE WORKS
Given the paramount importance of providing enhanced solutions for high-precision positioning in next 3GPP Releases, in this paper, we propose a cooperative positioning network architecture based on DL.Each BS composing the localization infrastructure has the same proposed DL model which solves the joint task of NLOS identification and position estimation.Depending on the condition, LOS or NLOS, the task is solved in cooperative or egocentric (ego) mode, respectively.In order to cooperate, the architecture internally exchanges only the compact latent feature representation of the channel obtained with an AE structure, permitting to combine the complete non-linear measurements and enhance positioning accuracy.
The proposed cooperative architecture is suitable and fully-compliant with 5G massive-MIMO OFDM systems, where sparse space-time channel responses, i.e., ADCPM, are adopted as input-images to the DL model.The ADCPM embodies position-dependent features, such as ToF, AoA and RSS of each propagation path, which can be automatically extracted by the proposed DL model.With the use of Matlab ray-tracing and SUMO software, we simulate a complex and realistic C-ITS scenario where some CAVs create multiple trajectories and communicate with a set of BSs, i.e., 3GPP UMi scenario.Results show that the proposed cooperative architecture is able to improve upon classical geometrical algorithms, e.g., TDoF multi-lateration, both in LOS and NLOS conditions, by increasing the accuracy of 47%.Moreover, the cooperation overcomes the limitations of single-BS prediction based on DL by automatically switching between egocentric and cooperative mode.
ML, and more especially DL techniques, are foreseen to have a huge impact on next-generation cellular networks.This work is thereby a first attempt to implement a cooperative high-precision positioning system towards that direction.Further developments could be the integration of different DL models into the architecture or the tracking of many simultaneous targets with automatic data-association.

APPENDIX A JOINT LOG-LIKELIHOOD
To prove (13), we start by rewriting the likelihood distribution (12) as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
By discarding the terms that do not depend on u i or s i , we obtain: concluding the derivation.

Manuscript received 15
February 2023; revised 13 August 2023; accepted 13 August 2023.Date of publication 9 October 2023; date of current version 22 November 2023.This work was supported in part by a Ph.D. Grant from the Ministry of the Italian Government Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR), and in part by the project Centro Nazionale per la Mobilità Sostenibile (MOST), funded by the Italian Ministry of University and Research under the PNRR funding program.(Corresponding author: Bernardo Camajori Tedeschini.)

Fig. 1 .
Fig. 1.BS receiving the uplink signal from the k-th UE through an UPA.The DoA of the p-th path is composed by the zenith angle θ k,p and the azimuth angle ϕ k,p .

Fig. 2 .
Fig. 2. Example of sparse channel power-delay-angle profile encoding the UE location, represented by an ADCPM with Ng = 352 temporal samples and N M = 64 spatial samples (resulting from N = M = 8 antennas) (a) and in the transformed angle-delay domain (b).

Fig. 3 .
Fig. 3. Overview of the proposed model composed by an autoencoder (AE) for feature extraction, and 3 neural networks (NNs) for NLOS classification and position estimation in both LOS and NLOS conditions.

Fig. 4 .
Fig.4.Joint likelihood distribution of (u i , s i ).Here, for sake of simplicity, the multi-dimensional Gaussian distributions are represented as uni-variate.

1 :
j=1 .Note that the number of detected BSs can vary at every timestep, without being constrained to detect a minimum number of BSs, as opposed to classical geometrical approaches.As illustrated in the pseudo-code in Algorithm 1 Cooperative Position Inference Input: sample x (j) i , neighbors' BS N (j) i ▷ Run on BS j at timestep i Output: estimated position û(j) i Encode sample x (j) i

Fig. 6 .
Fig. 6.UMi scenario simulated for performing analysis composed of 19 sites in the area of Politecnico di Milano, Leonardo campus, Milano, Italy.

Fig. 9 .
Fig. 9. Testing positioning results of an MLP with input (a) AoA and ToF, (b) 3 ToFs (c) 10 ToFs.The BSs are located in the black squares.

Fig. 10 .
Fig. 10.Testing results, i.e., loss on top and RMSE on bottom, of LOS and NLOS samples varying the number of epochs in the training.

Fig. 11 .
Fig. 11.Testing results of classification accuracy, precision and recall varying the number of epochs in the training.

Fig. 12 .
Fig. 12. Boxplot of the distribution of the inference time per sample [ms].

Fig. 13 .
Fig. 13.Positioning performances in terms of CDF of the distance error in the LOS testing trajectory.The solid and dotted lines are the 2D and 3D errors, respectively.

Fig. 14 .
Fig. 14.Positioning performances in terms of CDF of the location error over the whole testing trajectory.The solid and dotted lines are the 2D and 3D errors, respectively.
15b) especially in NLOS sections.The Multi-BS TDoF struggles in high-speed conditions, i.e., east part of LOS trajectory, and in NLOS positions where the number of cooperative BSs is limited or the range-biases are severe, i.e., top-right corner of Fig. 15a.Finally, the proposed cooperative DL model (green markers) achieves the higher positioning accuracy in almost all conditions, combining

Fig. 15 .
Fig. 15.Positioning performance in a 5G urban scenario: (a) 3D testing trajectory and related estimate obtained by the positioning methods (represented by different colors), with location errors represented as solid lines.(b) Bird-eye view of the testing and estimated trajectories.

TABLE IV LOS
AND NLOS POSITIONING NETWORK