FedBEVT: Federated Learning Bird's Eye View Perception Transformer in Road Traffic Systems

Bird's eye view (BEV) perception is becoming increasingly important in the field of autonomous driving. It uses multi-view camera data to learn a transformer model that directly projects the perception of the road environment onto the BEV perspective. However, training a transformer model often requires a large amount of data, and as camera data for road traffic are often private, they are typically not shared. Federated learning offers a solution that enables clients to collaborate and train models without exchanging data but model parameters. In this paper, we introduce FedBEVT, a federated transformer learning approach for BEV perception. In order to address two common data heterogeneity issues in FedBEVT: (i) diverse sensor poses, and (ii) varying sensor numbers in perception systems, we propose two approaches -- Federated Learning with Camera-Attentive Personalization (FedCaP) and Adaptive Multi-Camera Masking (AMCM), respectively. To evaluate our method in real-world settings, we create a dataset consisting of four typical federated use cases. Our findings suggest that FedBEVT outperforms the baseline approaches in all four use cases, demonstrating the potential of our approach for improving BEV perception in autonomous driving.


I. INTRODUCTION
Recently, there has been a significant surge of interest in the bird's-eye-view (BEV) perception for autonomous driving.BEV representations of traffic scenarios are particularly appealing for several reasons.Firstly, as all autonomous vehicles (AVs) are located at ground level [1], [2], the BEV representation can omit the z-axis to make the perception results more efficient [3].Secondly, it provides rich context and geometry information that can be directly utilized for downstream tasks such as planning [4], [5].Especially, within the first comprehensive survey addressing control systems for autonomous vehicles and connected and automated vehicles Fig. 1.FedBEVT with camera-attentive personlization.The positional embeddings are considered as private parameters for each client.Other parts of BEVT (in gray) are shared to the server for an aggregation.[5], the authors thoroughly expound on the pivotal role of perception systems in enabling low-level control.Finally, the BEV representation can be a unified space for data from different sensor modalities (e.g., camera, LiDAR) and timestamps to fuse without much extra effort [6].Traditionally, achieving temporal alignment in multi-source sensor fusion requires additional robust algorithms, such as the popular estimation-prediction integrated framework proposed in [7] and the computation light multi-sensor fusion localization algorithm in GPS challenging scenarios [8].
BEV perception research has begun exploring the use of multi-view camera data for predicting BEV maps due to their low cost [9]- [13].Nevertheless, inferring 3D information from 2D data is challenging, as mono cameras only provide 2D information.Recent works have thus utilized vision transformers, known for their ability to reason about correlations between different data, to solve the 2D-to-3D problem [3], [9], [11], [14]- [16].Despite promising results, these transformerbased methods have been trained on limited public datasets, such as NuScenes [17], which may not generalize well.Companies such as automotive original equipment manufacturers (OEMs), sensor suppliers, software solution providers, and research institutes, need to install multi-sensor systems on several vehicles and gather large amounts of data during arXiv:2304.01534v2[cs.CV] 8 Sep 2023 extended driving sessions that can be used as the training dataset.Nevertheless, such data is often private and can be prohibitively expensive, with some being completely unshared.In this context, the advent of federated learning provides a solution that enables collaborative training processes without data exchange and addresses the issue of data privacy.
The federated learning approach offers many benefits but can lead to data heterogeneity while training across clients.In the context of camera-based BEV perception, this challenge of data heterogeneity is caused by the wide range of sensor configurations, with variations in the number of cameras and poses of installed sensors on vehicles or transportation infrastructure.[18].These variations can result in significant differences in data characteristics across clients, which vanilla federated learning methods may not be able to overcome.In this paper, we aim to address the practical question of how to leverage the advantages of external data while minimizing the impact of data heterogeneity in a federated learning framework to achieve the highest possible performance improvement for local models.
To address the above data heterogeneity challenge, we propose a new federated learning framework for BEV perception transformers (BEVT), called FedBEVT (as shown in Fig. 1) that enables selective sharing of specific parts of the BEV transformer with the server.Following the approach of most transformer-based BEV perception works [3], [11], [12], we split a BEV vision model into five components: encoder for image feature extraction, positional embedding that contains camera geometry information, cross-attention module to project front views to bird's-eye view, convolution layers to refine the features in the transformers, and decoder to transfer the BEV representation to the final prediction.
Given the variability of camera poses across different clients, sharing the entire model for global aggregation results in unsatisfactory performance.To address this, we adopt camera-attentive personalization in FedBEVT by privatizing the positional embedding that contains camera intrinsic and extrinsic information.Additionally, to address the varying number of cameras in each client, we introduce adaptive multicamera masking (AMCM) to train multi-view data with a consistent BEV size by overlapping masks, which are built based on the combined field of views of all cameras.This ensures a consistent BEV embedding size during federated learning and enables clients with varying camera systems to learn together.To evaluate the effectiveness of our methods, we create a dataset with these variations and distribute it among various clients in federated settings, simulating realworld federated learning scenarios in intelligent traffic systems.Our experiments demonstrate that our method achieves better test accuracy with reduced communication costs for most clients.Our contributions can be summarized as followings: • We present a federated learning framework for the BEV transformer in road traffic perception applications.To the best of our knowledge, this is the first federated transformer training framework specifically designed for the BEV perception task.
• We provide a benchmark multi-view camera dataset for BEV perception in road traffic scenarios under federated settings.To address the challenge of data heterogeneity, we consider two popular data variations in federated learning application scenarios in C-ITS domain: (i) diverse sensor poses and (ii) varying sensor numbers in perception systems.

II. RELATED WORK A. BEV Perception in Road Traffic Systems
BEV perception involves transforming input image sequences from a perspective view to a BEV, enabling perception tasks such as 3D bounding box detection or semantic map segmentation.There are two main approaches to BEV perception: geometric-based and transformer-based methods.Geometric-based methods leverage the natural geometric projection relationship between camera extrinsic and intrinsic parameters to project the perspective view to a BEV.For example, LSS [18] uses a categorical distribution over depth and a context vector to lift 2D images to a frustum-shaped point cloud, which is then splatted onto the BEV plane using the camera extrinsics and intrinsics.BEVDet [10] follows a similar framework, with the addition of a BEV encoder to further refine the projected BEV representations.Transformerbased methods, on the other hand, can implicitly utilize the camera geometry information with learnable embeddings.BEVFormer [9] uses the geometry information to obtain the initialized sampling offset and applies deformable attention to query the image features into the learnable BEV embedding.Although it achieves high accuracy, it is slow.CVT [11] develops positional embeddings for each individual camera depending on their intrinsic and extrinsic calibrations.These embeddings are added to the image features and learnable BEV embeddings, and a vanilla cross-view attention transformer is applied to fetch the image features to the BEV embedding.Building upon CVT, CoBEVT [3] replaces the vanilla attention with a novel fused axial attention to largely save computation cost and construct a hierarchical structure to explore the multi-scale features.As CoBEVT achieves a good trade-off between inference speed and accuracy, this paper selects it as the deployed model for federated learning study.

B. Federated Learning for Intelligent Traffic Systems
Federated learning brings great opportunities and potential to the intelligent traffic systems.It allows the use of more data to train better-performing models while respecting privacy regulations [19]- [22].Previous research has commenced the exploration of using federated learning to train models pertinent to applications in intelligent transportation systems, such as 2D object recognition and detection [23]- [25], behavior prediction [26], localization [27], [28], and blockchain-based autonomous vehicles [29]- [32].Nevertheless, the application of federated learning for BEV perception remains unexplored.Moreover, existing studies [33]- [35] primarily address challenges of heterogeneity in federated learning in the context of system diversity and label shift, while the issue of heterogeneous sensor configurations, a primary cause of data heterogeneity, has not been investigated.To bridge these gaps, our work concentrates on the application of BEV perception and strives to address the heterogeneity resulting from sensor configurations.

C. Personalized Federated Learning
Federated learning is a collaborative approach to machine learning that involves a large number of clients working together within a network [36].Typically, a server initiates the learning process and clients download the global model to train it on their local data.The trained local models are then uploaded back to the server, and the global model is updated through aggregation.After several communication rounds, a model that performs well on the global data is obtained.However, such models may not work well with local data, as the local data distribution in most federated learning applications is often different from the global data distribution.
To address the challenge of local data distribution discrepancies in federated learning, personalized federated learning has been proposed as a solution [37]- [39].This approach customizes the model in each client to account for the unique characteristics of their local data.One of the most popular and effective methods for achieving personalized federated learning is the architecture-based approach [40].This method decouples the model's parameters, allowing only a subset of parameters to be shared and aggregated among clients, while the private parameters are learned solely on local data.Previous research has attempted to select these private parameters based on model architecture [41]- [44] or data similarities [45]- [47].However, none of these methods have been proposed specifically for training personalized transformer models, which have a significantly different structure compared to other machine learning models.Therefore, we further study on developing personalized federated learning methods that are tailored to the unique characteristics of transformer models.

D. Federated Learning for Transformer
Transformer is a neural network architecture that uses selfattention mechanisms and was originally designed for natural language processing tasks [48].It has also been effective in computer vision tasks such as object recognition (ViT) [49]- [51].As demonstrated in [52], the Transformer architecture has displayed a heightened level of performance compared to traditional convolutional neural networks when trained on extensive datasets.Moreover, this architecture showcases promising potential in the domain of detecting small objects, as exemplified in UAV imagery.Federated learning offers the potential to further enhance the training process by incorporating data from multiple clients, making it a promising approach for training Transformer models.Recent experimental results in [53] have shown that the Transformer architecture is more robust to distribution shifts and improves federated learning for object detection tasks.
However, its application to the BEV semantic segmentation task, which is crucial in autonomous driving and intelligent traffic systems, remains an area for further research.Our observation has revealed that variations in camera setup are the primary source of data heterogeneity in federated learning for BEV semantic segmentation tasks.To tackle this challenge, our federated learning framework incorporates personalized positional embedding and integrates cross attention and CNN layers to boost the transformer's performance and leverage larger data amounts.

A. Problem Formulation
In a federated learning training scenario, we consider K clients, each with their own local dataset consisting of N k data points paired with the corresponding BEV semantic masks Y k .Each data point X k,i contains L k images from a multiview camera system, resulting in the format (X k , Y k ).Here, , where H and W represent the height and width of one image, and h and w represent the height and width of the BEV semantic mask.The semantic mask has two classes, background and vehicles.These clients may come from a variety of industries, including original equipment manufacturers (OEMs), logistics and taxi companies, car rental companies, or infrastructure owners such as traffic management departments.
In general federated learning, each client receives a global model from the server and trains locally for several epochs before returning the trained model to the server for fusion into a new global model.This process continues for a certain number of communication rounds until a trained model is obtained.The goal of federated learning is to minimize the following: where f i (w) = l(BevT (X i ), Y i ; w), N is the number of data points in all clients, l is the loss function, and BevT is a transformer-based BEV model with three major components: image encoder, BEV transformer, and BEV decoder.The image feature encoder utilizes a CNN network to encode multi-camera images.The core of the BEV transformer is usually a multi-layer attention structure [48].In this paper, we employ the Fax-Attention schema [3], which involves both sparse cross-view attention and self-attention between image features and the BEV query.Additionally, BEV query and positional embeddings are required as inputs to the BEVT.The BEV query is a learnable embedding to represent the grid world, and the ultimate goal is to locate the position of other traffic objects in this query.The positional embeddings contain position information of each transformer image feature, including information related to the coordinate system transformation for converting the camera perspective to the BEV perspective.This improves the intuitiveness of the input for generating BEV features.We adopt the approach from [11] and represent the output of the positional embeddings as follows, assuming the extrinsic parameters of each camera j are R i,j,k ∈ R 3×3 for rotation and t i,j,k ∈ R 3 : Here, the image coordinate-system position features in p are first transformed into the 3D vehicle coordinate-system using T. Next, the resulting positions are mapped by Φ (v k ) , producing a positional embedding z i,k .We denote the parameters in Φ as v k , to distinguish other parameters in the transformer model u used in Sec.III-B.
The BEV features are obtained by passing through multiple Fax-Attention layers and processed by a decoder network to produce semantic segmentation outcomes in the BEV perspective, which enables direct determination of the position of surrounding cars on the motion plane.Our approach addresses two key challenges: (i) developing tailored models for unique sensor systems, and (ii) effectively training models using data from varying numbers of cameras.

B. Camera-Attentive Personalization
The variation in camera positions across clients can cause the traditional federated learning results to deviate from the optimal values of local data.This is due to the encoding of camera position differences into positional embeddings during transformer model construction, which directly affects the parameter training.
To tackle this challenge, we introduce federated learning with camera-attentive personalization (FedCaP), which involves separating the locally trained parameters v k in MLP for positional embedding, as shown in Eq. 2, from the transformer and treating them as private parameters for a given client k.Apart from v k , we share the remaining parameters in w k in the BEVT and refer to these public parameters as u k for the client k.As a result, the global update in each communication round is represented by the aggregated public parameters from all clients, which we denote as u.Then the goal of local training is minimizing: and the global goal is rewritten from Eq. 1: Each client uploads the encoder, decoder, and other transformer parts in BEVT.After the server aggregates these parameters, the global parameters are coupled with the private parameters to train the next round of data.
In particular, The local update for each client, k, consists of two stages: where η v and η u are the learning rates for v k and u k , respectively If we consider that only a subset S t (where M = |S t |) of all clients are selected for aggregation, 6 can be reformulated as: The loss function for the client k is expressed as: The pseudocode of FedCaP is outlined in Algorithm 1.

C. Adaptive Multi-Camera Masking
For each client, multi-view data points may come from a different number of cameras.To facilitate federated learning across these clients, we introduce adaptive multi-camera masking (AMCM), which adaptively overlays a mask on the BEV query based on the comprehensive Field-of-View (FoV).In essence, all clients initialize the BEV query with a uniform size and spatial dimensions.
Fig. 2 demonstrates this process using two clients: the first employs four cameras, while the second uses only two, a front view and a rear view.Based on the total FoV applicable to each client type, the BEV query is masked, thus enabling activation strictly within the FoV area.The relationship between each camera and the BEV query leads to the projection of each client's data points onto a BEV query of consistent size.This facilitates a more effective aggregation process in federated learning.
▷ update private model weights locally 20: ▷ update public model weights locally

D. System Framework
The system overview of our federated learning framework is shown in Fig. 3.The framework considers both public traffic and private data resources as clients.Since connections in wireless networks for CAVs can be unstable, we select clients to avoid the straggler effect during the federated learning process.On the other hand, clients from private data resources are typically customers of the results of federated learning, and their local trained models are always aggregated in all communication rounds due to the high network quality.
To decrease the volume of communication in each round, clients can implement compression techniques, such as sparsification and quantization, prior to communication [54], [55].To enhance the model exposure in the networks, we incorporate a secure aggregation protocol, featuring secret sharing in federated learning [56].
Our personalized federated learning approach requires clients to share only a portion of the model while keeping the positional embeddings (camera and image embedding) local.After the secure aggregation, the customized BEVT is rebuilt in each client by concatenating the global and local model partitions.The concatenated model is then trained on the local dataset and decoupled again for further model aggregation.

IV. DATASET
Our federated learning approach requires a diverse dataset that comprises various vehicles equipped with different sensor installation positions and numbers, paired with BEV ground truth in different cities.However, since there is no realworld dataset that meets our requirements, we resort to using the high-fidelity simulator CARLA [57] and the full-stack autonomous driving simulation framework OpenCDA [58] to gather the necessary data.To simulate different camera sensor installation positions, we employ three distinct types of collection vehicles: a compact car, a pickup truck, and a bus.Each collection vehicle drives through various cities with consistent weather conditions and is equipped with four cameras to provide a 360-degree surrounding view, similar to [3].We employ post-processing techniques to control the number of cameras in each frame, ensuring that the collected data meets our quality standards.However, the installation poses of these cameras vary significantly, depending on the different vehicle models.In total, the car, bus, and truck datasets comprise 8352, 1796, and 1800 frames, respectively, containing 52, 14, and 9 unique scenarios.Each scenario contains diverse traffic situations and road types, following a similar collection protocol in OPV2V [59], to enrich the complexities.
In Tab.I, we present the sensor configuration parameters employed for data collection across a variety of vehicular clients, including cars, buses, and trucks.A qualitative illustration of the disparities in multi-camera input data from bus, truck, and car clients is provided in Figure 4. V. EXPERIMENT

A. Experimental Setup
Use cases.In road traffic system scenarios, the federated learning clients typically include industrial companies, such as automotive OEMs, and individual connected vehicles in public traffic.Generally, the datasets stored in these connected vehicles are relatively modest in size, while the quantity of these vehicles is substantial.In contrast, automotive corporations usually have significantly larger datasets, despite being considerably fewer in number compared to vehicles.We consider these two factors in our experimental design.Simultaneously, each client's data originate from their respective sensor systems, resulting in data heterogeneity among clients.Our experimental design takes this source of data variance into account by adopting differing sensor system configurations across various types of vehicles (car, bus, and truck).The specific sensor system configurations used in our experiments are listed in Tab.I.
To comprehensively evaluate the performance of each method in potential federated learning application scenarios in road traffic systems, we define four use cases (UCs) as follows: • UC 1 considers two industrial companies, training BEVT together.The local dataset in each is collected from trucks and buses.To further enlarge the size of training dataset, an open dataset is employed in federated learning and considered as a virtual client.The dataset in this virtual client is stored in the server and set as the third data silo.• UC 2 involves four industrial companies.The aim is to upgrade their models using the datasets from other clients, but without accessing others' raw data directly.• UC 3 represents a typical way of using public traffic data collected in vehicular networks for training, which often involves a larger number of clients (24 clients in total for federated learning).Each client owns a small amount of dataset collected from one or two specific driving scenario, e.g., a urban travel with crowded road traffic on a sunny day.• UC 4 addresses federated learning for clients with different numbers of cameras, which can happen when dataset is collected from different sensor systems among manufacturers.We summarize the motivation and the description of the four UCs in Tab.II, and the distribution of data among clients in Fig. 5. BEVT architecture.We begin by feeding our input images X i,k ∈ R L k ×H×W ×3 through a 3-layer ResNet34 encoder.To ensure consistency across inputs, we use the AMCM to resize all inputs to have L ′ k = 4.The image features are then encoded at different spatial resolutions, resulting in tensors of shape . Next, we perform FAX cross attention-based transformer operations between the BEV embeddings in R 128×128×128 (query) and the encoded image features (key and value).This step yields BEV features in R 32×32×128 .To convert these BEV features into our final BEV results in {0, 1} 256×256 , we use a decoder with a 3-layer bilinear upsample module.rounds where the learning rate remains constant.Afterwards, we employ a Cosine Annealing learning rate scheduler [60].Each client is locally trained for one epoch with a batch size of 4 using the AdamW optimizer [61].
Baselines.Given the limited research on federated learning on BEVT, we conduct the first trial of training fedAvg on this platform.However, to validate the effectiveness of FedCaP, we also incorporate recent research findings on federated transformer learning as a baseline on BEVT.In addition to showcasing the results of local training on each client, we compare FedCaP with the following baselines: • FedAvg [36] is the original algorithm for federated learning, which aims to train a common global model for all clients.
• FedRep [45] shares the data representation across clients and learns unique local heads for each client.Note that we only allows client to personalize their image encoder layers as local heads.
• FedTP [62] uses personalized attention for each client while aggregating the other parameters among the clients.

B. Performance
To compare the performance of FedCaP with other baselines, we compare the Average Intersection over Union (IoU) achieved by trained models in the four UCs.UC 1.As shown in Tab.III, it is obvious that FedBEVT achieves an IoU improvement of over 50% compared to local training, due to its indirect utilization of other clients' local training data.Additionally, FedCaP outperforms the basic FedAvg and the other two personalized federated learning approaches in the overall performance.Despite the fact that FedCaP trains a slightly more promising model to FedCaP on the bus client, the communication rounds required are significantly higher than FedCaP.UC 2. Tab.IV presents the results for UC 2, where the data volume for each client is more balanced, but still leading to outcomes similar to UC 1.In the car client A, FedRep training shows comparable results to FedCaP, while FedCaP outperforms other methods in the other clients.UC 3. We conduct a targeted comparison of the performance of FedCaP and FedAvg on 24 clients with only one or two scenario data points.Since UC 3 is aimed at scenarios with poor network environments, we restrict each method to only 100 communication rounds of training.Fig. 6 illustrates that more than 80% (20 out of 24) of clients achieve superior personalized models with FedCaP.UC 4.Although AMCM allows clients with different numbers of cameras to train a model jointly and enriches the data resources, it can also result in data heterogeneity across clients.Therefore, if the data is sufficient for training, AMCM may lead to worse training results.As Tab.V shows, the client with mono-camera data can achieve better results without AMCM because the other two clients both have front cameras and can train their model based solely on that data.However, when training a model for clients with tri-cameras, AMCM can enable federated learning with more clients and improve the training performance.The performance can be further enhanced using FedCaP, which reduces the effects of AMCM.For the client with Quad-cam, data heterogeneity becomes a bigger issue due to the significantly different data in the monocamera client.Nonetheless, FedCaP can alleviate such effects and achieve the best model among the three approaches.

C. Effectiveness
To validate the personalized FedBEVT is effective in addressing the variations across local datasets in clients, we conduct ablation experiments using UC 1 as a representative example to train models for different clients.
We first train models using only local data and tested them on various clients.In particular, we train a model using virtual car client with OPV2V data.Although it achieves an IoU of 30.48% on car client data, its performance significantly decreases on the bus and truck testsets (7.39% and 2.01%), and it even performs worse than models trained using local truck data on the truck testset.This motivates us to use FedBEVT.
Subsequently, we employ different personalized FedBEVT, namely FedTP, FedRep, and FedCaP, to train models for different clients and evaluate them on their own testsets and the testsets of other clients.The results shown in Tab.VI, Tab.VII and Tab.VIII are consistent, with only locally personalized models being the most suitable for local data.Since the camera pose and quantity of a car typically do not change once it is produced, the locally collected training dataset and the data used for future inference have strong similarities.This aligns with our experimental design.Therefore, these results further emphasize the significance of personalized FedBEVT for road traffic perception in BEV.

D. Visual Analysis
In Fig. 7, we visually show that the BEV maps generated by FedCaP are more accurate and holistic compared to other methods in test datasets from all three vehicle types, namely bus, truck, and car.We present the front camera view of each multi-view camera system in the first column.It is evident that the camera systems vary significantly in height across different     clients.Though a higher camera may offer a better BEV view, it also reduces the number of pixels occupied by each object, thereby making object detection more challenging.However, the ground truth for the BEV view remains the same -see the second column -as the height information in the z-axis is no longer relevant.
FedRep.Although FedRep achieves a relatively good ability of object recognition, it may miss some object vehicles.It trains the initial layers of the encoder to some extent to account for differences in image pixels.However, FedRep does not explicitly consider differences in the sensor perspectives.
When the data heterogeneity caused by various sensor heights matters, it leads to a decrease in the object recognition ability.
FedTP.We observe that the model trained by FedTP is more likely to recognize roadside trees and buildings as object vehicles.FedTP directly decouples the parameters of crossattention, which considers data differences, but this may cause a deviation in the optimization direction of attention and the original head when the model is coupled again.FedCaP.By privatizing the embeddings related to extrinsic and intrinsic parameters, FedCaP straightforwardly considers the differences in camera system configurations among clients.It results in promising overall accuracy and ability of object recognition.As shown in Fig. 7, it is the only method that recognizes all objects compared to the other methods for the data from buses.As for the data from trucks, it is the only method that recognizes objects on the left side of a crossroad.For the data from cars, it recognizes all objects and accurately estimate their size.

VI. CONCLUSION
This work investigates the efficiency of federated learning in training a transformer-based model for BEV perception in road traffic.Our analysis identifies two potential data heterogeneity issues that can impede the performance of federated learning approaches.To address these challenges, we propose two novel techniques, i.e., FedCaP and AMCM.We evaluate the effectiveness of our proposed approaches by collecting a new dataset and distributing it to clients such that typical use cases in federated settings are created.Our experimental results demonstrate that the proposed methods significantly enhance the overall performance of federated learning by personalizing the positional embeddings and increasing the data resources available for training.In conclusion, our work highlights the potential of using federated learning for BEV perception models and presents effective solutions to overcome challenges of data heterogeneity.

APPENDIX A CONVERGENCE ANALYSIS
Based on the descriptions of FedCaP in Sec.III-B, we first introduce the following assumptions that align with previous works in [37]- [39], [43]: and for some X , Assumption 2. (Bounded gradient).The variance of each client's local stochastic gradient is bounded, i.e., Assumption 3. (Bounded noise).There exist ζ ≥ 0 and P ≥ 0 applicable to all u and V , such that, In cases where F does not contain a regularization term, P equals 0.
Theorem 1. (Convergence).FedCaP can be converged throughout the duration of the training.
Proof.Assuming assumption 1 is valid, we can establish bounds for the updates of variables u and v k for the k-th client, as delineated below: where By integrating the inequalities denoted by 12, 13, and 14, we can establish a boundary for the complete update after a single round of communication in FedCaP.This can be expressed as: For inequality 15, we can explore the expectation of each term: E[ By integrating 17, 18, 19 and 20 into 16, we obtain the following inequality for the overall update of a communication round in FedCaP: Next, we evaluate the boundaries of the last two terms in inequality 20.Assuming that assumptions 2 and 3 are valid, and after simplifying the constants within the terms, we obtain the following: with learning rate η u ≤ (12EL u (1 + X 2 )(1 + P 2 )) −1 and η v ≤ (6EL v (1 + X 2 )) −1 , and C > 0 represents some positive constant.Finally, the convergence rate of FedCaP can be expressed as follows: F (u 0 , V 0 ) represents the function F with initial values of u 0 and V 0 , while F * encompasses all other instances of F .It is evident that the convergence rate of FedCaP approximates 1/ √ T , aligning with the vanilla Stochastic Gradient Descent (SGD), but this holds true only under the following conditions: In the scenario where all clients are selected in each round and there are no regularization terms in the local loss functions, we can reformulate condition 23 as follows:

Fig. 2 .
Fig.2.Illustration of AMCM used for two vehicles equipped with different numbers of cameras.By adapting the mask to the total field of view in each perception system, the size of BEV embeddings can be maintained and adjusted for use in FedBEVT.

Fig. 3 .
Fig. 3.The system overview of FedBEVT illustrates vehicle clients from public traffic and clients from private data resources.The federated learning server manages the training process with protocols for model compression, client selection, and secure aggregation similar to those in a typical federated learning framework for deep learning models.Note that the data heterogeneity exists due to various clients in intelligent road traffic systems.

Fig. 4 .
Fig. 4. Visualization of four-view camera data points and BEV groundtruth (GT) for car, bus and truck clients in the FedBEVT dataset.

1 Fig. 5 .
Fig. 5. Distribution of federated dataset for evaluation of the FedBEVT in UC 1-4.The specific camera configurations for each vehicle type are documented in TableI.This diversity in configuration contributes to data heterogeneity.

Fig. 6 .
Fig. 6.Comparison between FedCaP and FedAvg in UC 3, where the federated learning is organized with 24 clients including buses, trucks and cars.Each has the local data from only one or two scenarios.

Fig. 7 .
Fig. 7. Visual comparison of BEV results from different federated learning approaches.

:
Algorithm 1 FedCaP: Federated Learning with Camera-attentive Personalization Server Input: number of the communication round T Client Input: learning rate η v , η u Client Input: number of local training epochs E k Output: BevT with public parameter u and private parameter v k for each client 1: Server initializes u 0 to all clients 2: for communication round t = 1, 2, ..., T do ClientUpdate(u) 15: u k ← u 16: BevT ← Rebuild(u k , v k ) 17: for epoch e = 1, 2, ..., E do

TABLE II OVERVIEW
OF MOTIVATIONS AND DESCRIPTIONS FOR THE FOUR USE CASES.

TABLE III COMPARISON
BETWEEN FEDCAP AND OTHER BASELINES IN UC 1.

TABLE IV COMPARISON
BETWEEN FEDCAP AND OTHER BASELINES IN UC 2. The numbers within the brackets represent the indices of the cameras in the perception system.Specifically, the front camera is denoted by 1, the left camera by 2, the right camera by 3, and the rear camera by 4.2FedAvg without AMCM: The cameras with the same setup in each client is used for federated learning.For instance, to train a model for the client of Mono-cam, all data from front camera in other clients are used for federated learning.One significant limitation of using FedAvg without AMCM is that it can only train a model for clients with a particular camera system. 1