Vertical Federated Learning: Concepts, Advances and Challenges

Vertical Federated Learning (VFL) is a federated learning setting where multiple parties with different features about the same set of users jointly train machine learning models without exposing their raw data or model parameters. Motivated by the rapid growth in VFL research and real-world applications, we provide a comprehensive review of the concept and algorithms of VFL, as well as current advances and challenges in various aspects, including effectiveness, efficiency, and privacy. We provide an exhaustive categorization for VFL settings and privacy-preserving protocols and comprehensively analyze the privacy attacks and defense strategies for each protocol. In the end, we propose a unified framework, termed VFLow, which considers the VFL problem under communication, computation, privacy, as well as effectiveness and fairness constraints. Finally, we review the most recent advances in industrial applications, highlighting open challenges and future directions for VFL.


Introduction
Federated Learning (FL) [1] is a novel machine learning paradigm where multiple parties collaboratively build machine learning models without centralizing their data.The concept of FL was first proposed by Google in 2016 [2] to describe a cross-device scenario where millions of mobile devices are coordinated by a central server while local data are not transferred.This concept is soon extended to a cross-silo collaboration scenario among organizations [3], where a small number of reliable organizations join a federation to train a machine learning model.In [3], FL is, for the first time, categorized into three categories based on how data is partitioned in the sample and feature space: Horizontal Federated Learning (HFL), Vertical Federated Learning (VFL) and Federated Transfer Learning (FTL) (See Figure 1).
• HFL refers to the FL setting where participants share the same feature space while holding different samples.For example, Google uses HFL to allow mobile phone users to use their dataset to collaboratively train a next-word prediction model [2].
• VFL refers to the FL setting where datasets share the same samples/users while holding different features.For example, Webank uses VFL to collaborate with an invoice agency to build financial risk models for their enterprise customers [4].• FTL refers to the FL setting where datasets differ in both feature and sample spaces with limited overlaps.For example, EEG data from multiple subjects with heterogeneous distributions collaboratively build BCI models using FTL [5].
Due to their differences in data partitions, HFL and VFL adopt very different training protocols.Each party in HFL trains a local model and exchanges model updates (i.e., parameters or gradients) with a server, which aggregates the updates and sends the aggregating result back to each party.While in VFL, each party keeps both its data and model local but exchanges intermediate computed results.The output of the HFL training procedure is a global model shared among all parties, while each party in the VFL owns a separate local model after training.During inference time, each party in HFL uses the global model separately, while parties in VFL need to collaborate to make inferences.FL can also be categorized into "cross-device" and "cross-silo" settings [6].The cross-device FL may involve a vast number of mobiles or edge devices as the participating parties.In contrast, the participating parties in the cross-silo FL are typically a limited number of organizations.HFL can be either cross-device or cross-silo FL, while VFL typically belongs to the cross-silo FL.We compare these main differences between HFL, VFL, and FTL in Table 1.Note that Table 1 compares the conventional cases of HFL, VFL, and FTL.As this research area experiences explosive growth, some special cases may deviate from Table 1.The need for VFL has arisen and grown strongly in the industry in recent years.Companies and institutions owning only small and fragmented data have constantly been looking for compensating data partners to collaboratively develop artificial intelligence (AI) technology for maximizing data utilization [7,8].At the same time, data privacy and security regulations have been strengthened worldwide due to growing public concerns over data leakage and privacy breaches.Accordingly, many privacy-preserving projects and platforms supporting VFL have been developed in the past two years [9,10,11,12,13], and the number of commercialized projects as well as the economic values of VFL have grown significantly.Since in VFL, data parties with different attributes of people are typically from different industrial segments, for example, a local bank and a local retailer, they are prone to collaborate rather than compete.
While the applications and research on VFL have grown dramatically in recent years, there lacks a comprehensive survey on the advances, challenges, and potential research directions of VFL.Existing FL surveys focus either on HFL [6,14,15] or a limited perspective of VFL [16,17].
Therefore, we provide a comprehensive overview of current progress in VFL.We propose an exhaustive categorization for VFL settings and privacy-preserving protocols and discuss possible routes for improving effectiveness, efficiency, and privacy.In the end, we propose a unified framework, termed VFLow, which is extended from the original VFL definition and takes into account communication, computation, effectiveness, privacy, and fairness constraints.This paper is organized as follows: Sec. 2 overviews VFL's concepts and training procedures.Building on Sec. 2, Sec. 3, Sec.4, and Sec. 5 discuss the efficiency, effectiveness, privacy, and security aspects of VFL algorithms.Sec.6 discusses the challenges of data valuation, explainability, and fairness towards building a VFL ecosystem.Sec.7 introduces VFLow, a VFL optimization framework guiding the design and optimization of VFL algorithms, and Sec. 8 discusses application-oriented algorithms built on VFL.Finally, Sec. 9 discusses open challenges and future directions.Figure 2 dictates the relationships between sections in this work.

Vertical Federated Learning framework
In this section, we provide an overview of VFL formulation, algorithm, and variants.

Problem Definition
A VFL system aims to collaboratively train a joint machine learning (ML) model using a dataset D ≜ {(x i , y i )} N i=1 with N samples while preserving the privacy and safety of local data and models.We formulate the loss of VFL as follows.
where Θ denote the joint ML model; f (•) and γ(•) denote the loss function and regularizer and λ is the hyperparatemer that controls the strength of γ.VFL assumes that data are partitioned by feature space.Following [3,18], each feature vector , where d k is the feature dimension of party k, for k ∈ [K − 1], and the K th party has the label information y i = y i,K .We refer to the K th party who owns the labels as active party while the rest of parties as passive parties.Each passive party k has dataset D k ≜ {x i,k } N i=1 , while the active party has dataset D K ≜ {x i,K , y i,K } N i=1 .Without loss of generality, we decompose Θ into local models G k parameterized by θ k , k ∈ {1, • • • , K}, which operates only on local data, and a global module F K parameterized by ψ K , which is only accessible by the active party K.We rewrite the loss f (Θ; x i , y i ) as: where L denotes the task loss (e.g., mean squared error loss, cross-entropy loss, and hinge loss).Figure 3 pictorially overviews the architecture and core components of a VFL system.Each party's local data are not exchanged during the collaboration.The local model G k can take various forms including tree [19], linear and logistic regression (LR) [3,18,20,21,22,23], support vector machine [24,25], neural network (NN) [26,27,28], as well as K-means [29] and EM algorithm [30] etc.Although most of the existing VFL works consider linearly separable local models, recent works [31] also proposed kernel methods for incorporating non-linear learning over distributed features.
The global module F K can be either trainable [28,32,33] or non-trainable [28,34].If a trainable global module is in place, this VFL scenario is coincident with the vertical splitNN [35], where the whole model is splitted into different parties, thus we term it splitVFL (see Figure 4(a)).If the global module is non-trainable, it serves as an aggregation function, such as Sigmoid (for NN) or an optimal split finding function (for tree), that aggregates parties' intermediate results.We term this scenario aggVFL (see Figure 4(b)).Another variant of VFL is when the active party has no features and thus it provides no local model.In this variant the active party plays the role of a central server.We refer to the active party providing no feaures in splitVFL and aggVFL, respectively, as splitVFL c and aggVFL c .We illustrate these VFL variants in Figure 4 and summarize their architectural differences in Table 2.In a typical VFL system, passive parties communicate only with the active party, which serves as the coordinator that orchestrates the training and inference procedures.In some scenarios, a third party is involved and responsible for encryption and decryption [18].

VFL Training Protocol
In this section, we describe a general training protocol for VFL, which consists of two steps: 1) Entity Alignment; 2)Privacy-preserving training.See Figure 3.
Privacy-Preserving Entity Alignment.The very first step for a VFL system to start a collaborative training process is to align the data used for the training.This process can be referred to as entity alignment, which adopts private set intersection techniques to find the common sample IDs without revealing unaligned dataset.We discuss these techniques in Sec. 5. Whereas conventional VFL frameworks mostly consider entity alignment with exact IDs, recent studies [36] also demonstrated a coupled design for fuzzy identifiers to enable one-to-many alignment, which could be an interesting future direction of VFL.
Privacy-Preserving Training by Exchanging Intermediate Results.After the alignment, participating parties can start training the VFL model using the aligned samples.The most common training protocol is using gradient descent [37], which requires parties to transmit local model outputs and corresponding gradients, together termed intermediate results, instead of local data.Algorithm 1 describes a general VFL training procedure based on neural networks using stochastic gradient descent (SGD).Specifically, each party k computes its local model output H k = G i (x k , θ k ) on a mini-batch of samples x and sends H k to the active party.With all the {H k } K k=1 , the active party computes the training loss following Eq.( 1).Then, the active party computes the gradients ∂ℓ ∂ψ K of its global module and updates its global module using ∂ℓ ∂ψ K .Next, the active party computes the gradients ∂ℓ ∂H k for each party and transmits them back.Finally, each party k computes the gradient of its local model θ k as follows: and updates its local model.This procedure iterates until convergence.]] back to each party.A third-party collaborator is often responsible for encryption and decryption.Other privacy-preserving techniques, such as Differential Privacy (DP) and Gradient Discretization (GD) can also be applied to enhance the privacy and security of the VFL system.We provide detailed comparisons of these techniques in Sec. 5.

Tree-based VFL
Tree-based VFL complies with the architecture depicted in Figure 3 and follows the general loss defined in Eq. ( 2) for conducting VFL training, but it differs from the NN-based VFL in local models G k , k ∈ {1, ..., K}, the global module F K as well as the specific training process at each party.
In tree-based VFL, the local model G k at each party k consists of multiple partial tree models that each partial tree model, together with its counterparts from other parties, form a complete tree model.The F K is an aggregation function that identifies the optimal feature split based on feature splitting information received from all parties.

3:
Randomly sample a mini-batch of samples x ⊂ D

7:
end for 8: Active party K computes and sends ∂ℓ ∂H k to all other parties; 10: for each party k=1,2,. . .,K in parallel do 11: Party k computes ∇ θ k ℓ with Equation (3); 12: end for 14: end for HE to encrypt transmitted information.FederBoost [41] and OpBoost [42] adopt differential privacy to protect individual data trying to achieve a better balance between privacy and efficiency.
Random Forest [45] (RF) is another popular tree-based ensemble algorithm that has been integrated into VFL.RF-based VFL algorithms [46,47,48] typically leverage bagging and optimized parallelism to enhance the training and inference efficiency.Federated Forest [46] introduces a third party and applies RSA encryption to protect data privacy.VFRF [47] adopts randomized iterative affine cipher (RIAC) [49] to encrypt transmitted information.VPRF [48], a verifiable privacy-preserving random forest scheme, is proposed to verify data integrity and preserve data privacy.

Improving Communication Efficiency
In production VFL, network heterogeneity, long geographical distances, and the large size of encrypted data make the coordination a communication bottleneck.Thus, methods proposed to mitigate communication overhead typically involve reducing the cost of coordination and compressing the data transmitted between parties.We summarize these methods in Table 3 and discuss them in this section.

Multiple Client Updates
One straightforward way to save the communication cost is by allowing participating parties to perform multiple local updates during each iteration.Liu et al. [18] proposed a federated stochastic block coordinate descent algorithm, called FedBCD, that allows each party to conduct multiple client updates before each communication to reduce the number of synchronizations, thereby mitigating the communication overhead.Castiglia et al. [50] proposed a flexible local update strategy for VFL, named Flex-VFL, that allows each party to conduct a different number of local updates constrained by a specified timeout for each communication round.Zhang et al. [51] proposed an adaptive local update strategy for VFL, named AdaVFL, that optimizes the number of local updates for each party in each round by minimizing the total

Asynchronous Coordination
The core idea of asynchronous coordination is that each party can upload and download intermediate training results asynchronously.However, asynchronous coordination may result in stale information, which may harm the overall model performance and jeopardize communication efficiency if the stale information is not dealt with properly.Li et al. [54] proposed GP-AVFL that allows parties to update local models asynchronously by leveraging a gradient prediction technique to dynamically adjust local model gradients.Cai et al. [55] proposed AVFL that accelerates VFL training by omitting the updates from the slow parties with poor network conditions.Zhang et al. [56] proposed a truncated VFL algorithm, called T-VFL, to discard parties with channel gains lower than a threshold.Chen et al. [57] proposed a vertical asynchronous federated learning algorithm called VAFL, which utilizes a query-response strategy that decouples the coordination between the server and clients.Hu et al. [27] proposed FDML, allowing each party to update its local model asynchronously but based on the same sequence of randomly sampled training data.
AsySQN [59], VFB 2 [60], and FDSKL [61] all utilize a tree-structured communication scheme [75] to enhance the communication efficiency.AsySQN [59] additionally exploits approximated Hessian information to obtain a better descent direction.VFB 2 [60] supports multiple active parties.FDSKL [61] integrates a non-linear kernel method into vertical federated learning.It leverages the random features to approximate the kernel mapping function aiming to achieve efficient computation parallelism, and adopts doubly stochastic gradients to update the kernel function for scalability.Han et al. [62] employs the random forest (RF) [45] as the base learner for learning GBDT in order to enhance parallelism and save communication rounds.To reduce the long periods of idle time and accelerate the aggregation process under cryptography, VF 2 Boost [63] adopts a concurrent training protocol to take full advantage of computational resources and leverages a re-ordered accumulation technique and a histogram packing method to accelerate histogram construction and communication.
Asynchronous coordination may incur additional computation overhead for handling inconsistencies between the asynchronous updates.Thus, trade-offs between coordination and computation overhead should be carefully considered when applying asynchronous coordination methods.

One-shot Communication
One-shot Communication alleviates communication overhead by coordinating only once during the entire training procedure.All proposed one-shot communication approaches follow a two-step training procedure: (1) All parties extract latent representations from their original data using unsupervised learning; (2) The active party trains the global model using these latent representations.
Wu et al. [64] proposed FedOnce, in which each party leverages an unsupervised learning method, called NAT (Noise As Targets) [76], to extract latent representations from its local data.Then the active party trains the global model using its local features combined with latent representations passed from passive parties.AE-VFL [65] leverages autoencoder to extract latent representations from each party's local data, while CE-VFL [66] utilizes both Principal Component Analysis (PCA) and autoencoder to conduct the latent representation extraction.
A trade-off for one-shot methods is that sample-wise representations of original data are permanently passed on to another party.Therefore, the privacy risks for revealing these representations need to be carefully evaluated, e.g., through inversion attacks or information theory studies.Besides, one-shot methods typically involve computationally expensive unsupervised learning of effective representations.Therefore, the trade-off between communication and computation is worth investigating.

Compression
Compression is a commonly used approach in VFL to alleviate communication overhead by reducing the amount of data transmitted among parties.It can alleviate both communication and computation overheads, especially when expensive encryption operations (e.g., HE and MPC) are applied.
Neural network-based VFL algorithms naturally map high-dimensional input vectors to low-dimensional representations.Some works adopt specialized dimension-reduction techniques to compress data.AVFL [55] leverages Principle Component Analysis (PCA) to compress transmitted data, while CE-VFL [66] utilizes both PCA and Autoencoders to learn latent representations from raw data.Two follow-up works, SecureBoost+ [38] and eHE-SecureBoost [67], of SecureBoost encode encrypted first-order and second-order gradients into a single message to reduce the encryption operations and the size of data transmitted between parties, thereby saving communication bandwidth and computational costs.C-VFL [68] allows an arbitrary compression scheme to be applied to VFL to enhance communication efficiency and provides theoretical analysis on the impact of compressor parameters.GP-AVFL [54] employs a doubleend sparse compression (DESC) technique to save communication traffic volume by squeezing the sparsity in forward outputs of local models and backward gradients transmitted from the active party to passive parties.Adaptive quantization techniques [77,78,79] may also be considered in future VFL research.

Sample and Feature Selection
Another approach to improve communication efficiency is to reduce the amount of data used for training and inference.For example, Coreset-VFL [69] constructs a coreset of samples to alleviate the communication burden, while FedSDG-FS [70], SFS-VFL [71], LESS-VFL [72], FEAST [73] and VFLFS [74] filter out unimportant features to save communication costs.

Improving Effectiveness
Conventional VFL is only able to utilize aligned labeled samples.However, real-world applications often have limited aligned samples, especially as the number of parties grows.The availability of labeled samples is also scarce in many cases, resulting in unsatisfactory performance.Moreover, the collaborative inference is required since each party only has a sub-model after training.
To address these limitations, the literature has proposed various directions toward better utilizing available data to build a joint VFL model or helping participating parties build local predictors.
For brevity, we discuss existing works through a two-party VFL setting involving an active party A and a passive party B. We summarize these works in Table 4 and discuss them in the rest of this section.To better explain these works, we depict a general virtual dataset formed by the two parties (see Figure 5).We dissect this virtual dataset into several sub-datasets to illustrate which portions of the virtual dataset are utilized by a VFL algorithm to train models, as reported in Table 4. Specifically, D denotes the labeled and aligned samples, which is used by the conventional VFL formulated in Eq. ( 1), whereas D au denotes aligned but unlabeled samples.D uu A and D uu B denote unaligned and unlabeled samples of party A and party B, respectively.D ul A denotes unaligned and labeled samples of party A.
Table 4: Summary of existing works that aim to improve the effectiveness of VFL.Semi-SL, Self-SL, KD, and TL represent semi-supervised learning, self-supervised learning, knowledge distillation, and transfer learning, respectively.√ indicates its corresponding portion of data (see Figure 5) is utilized by a specific VFL algorithm.Note that VFed-SSD has two objectives: one is to build a local predictor for the active party, and another is to build a joint predictor.

Self-supervised Approaches
Recently, self-supervised learning (Self-SL) has been introduced to VFL to improve the performance of the VFL model by exploiting unlabeled samples, which are not used in the conventional VFL.For illustrative purposes, we consider a two-party VFL scenario and rewrite Eq. ( 1) as follows: Self-SL-based VFL approaches proposed in the literature typically train participating parties' models ψ A , θ A , and θ B by minimizing a Self-SL loss based on unlabeled samples in addition to the main task loss defined in Eq. ( 4).We formulate a general Self-SL objective in VFL as follows: where ℓ Self-SL is the self-supervised learning loss that optimizes ψ A , θ A and θ B for learning good representations using unlabeled data.Li et al. [80] proposed VFed-SSD that pretrains local models ψ A , θ A and θ B through Eq. ( 5) based on positive and negative sample pairs, which are formed from aligned data D au leveraging matched pair detection (MPD) technique.Then, VFed-SSD finetunes pretrained models ψA , θA and θB through Eq. ( 4) based on labeled and aligned samples D. He et al. [81] proposed FedHSSL, a federated hybrid self-supervised learning framework, that pretrains θ A and θ B through Eq. ( 5) based on cross-party views of aligned samples D au and local views (via data augmentations) of unlabeled local samples D uu A and D uu B .Then, FedHSSL finetunes ψ A and pretrained models θA and θB through Eq. ( 4) based on D. Feng [74] proposed a VFLFS algorithm that optimizes Eq. ( 4) and Eq. ( 5) in an end-to-end manner.It trains local models θ A and θ B using autoencoders based on unaligned data D uu A and D uu B , and simultaneously finetunes these local models and the global module ψ A based on labeled aligned samples D.

Semi-supervised Approaches
Rather than boosting representation learning capability leveraging self-supervised learning, Kang et al. [83] and Yitao et al. [84] proposed semi-supervised learning approaches that augment labeled and aligned samples D to boost the performance of the VFL model.We formulate a general Semi-SL-based VFL objective as follows: where ℓ Semi-SL is the semi-supervised learning loss that aims to expand D by pseudo-labeling unlabeled samples or adding newly labeled samples while achieving maximal stability and precision on labeling newly added samples.Kang et al. [83] proposed a Semi-SL algorithm named FedCVT to implement Eq. ( 6).More specifically, FedCVT estimates representations for missing features and predicts pseudo-labels for unlabeled samples to obtain an expanded training set, denoted as D. To improve the quality of D, FedCVT cherry-picks pseudo-labeled samples added to D through an ensemble approach.Then, FedCVT trains the VFL model based on D. Yitao et al. [84] proposed FedMC that integrates data collaboration [93] into VFL to implement Eq. (6).FedMC first forms a latent feature space using D. In this latent feature space, it measures the distance between each pair of unaligned samples from the active party and passive party, respectively.Then, FedMC aligns two samples in a pair and adds aligned samples to D if their distance is less than a threshold to form expanded training set D. Next, FedMC trains the VFL model based on D.

Knowledge Distillation-based Approaches
In conventional VFL, the active party A cannot make inferences alone, which limits the availability of the active party's prediction service.Some studies [85,86,80,87] proposed methods to help party A build a local predictor instead of a VFL model while still benefiting from VFL training.To this end, they typically leverage Knowledge Distillation (KD) techniques to transfer knowledge of teacher models obtained through VFL to party A's local models for enhancing performance.We formulate a general knowledge distillation-based VFL objective as follows.
where ℓ KD is the knowledge distillation loss that forces to transfer knowledge from teacher models ψ t A , θ t A and θ t B to party A's local models ψ s A and θ s A , ℓ A is party A's task loss that optimizes ψ s A and θ s A based on labeled samples D ul A , and γ is the hyperparameter that controls the strength of KD. ψ t A , θ t A and θ t B can be pretrained through Eq. ( 4) or Eq. ( 5).Wang et al. [85] proposed a vertical federated knowledge transfer approach (VFedTrans) via representation distillation that enables the active party A to make inferences on unaligned local data.To this end, VFedTrans first learns federated representations through FedSVD [94] based on aligned samples D au , and then it utilizes autoencoders as teacher models to transfer the knowledge encoded in the federated representations to the active party A's local models ψ s A and θ s A as students.Ren et al. [86] proposed VFL-Infer, a VFL framework that pretrains teacher models ψ t A , θ t A and θ t B through Eq. ( 4), and then leverages these teacher models to help party A train its local models ψ s A and θ s A through Eq. ( 7).Li et al. [80] proposed VFed-SSD that trains teacher models through Eq. ( 5) using cross-party contrastive learning based on aligned data D au and distills knowledge from teacher models to help the active party A to train its local models ψ s A and θ s A .In another work along this line of research, Li et al. [87] proposed a joint privileged learning in the VFL setting (VFL-JPL) to train local models for the active party A. By employing the feature imitation and ranking consistency restriction, VFL-JPL can effectively train the active party A's local models through Eq. ( 7) based on both aligned and unaligned samples as well as knowledge distilled from teacher models pretrained through Eq. (4).

Transfer Learning-based Approaches
Transfer-learning (TL) based VFL approaches [88,90,91,92,89] treat the active party A as the source domain with a large corpus of labeled samples and the passive party B as the target domain with only unlabeled samples or a limited amount of labeled samples.These approaches leverage VFL as the bridge to transfer knowledge from party A to party B. We formulate a general TL-based VFL objective as follows: where ℓ TL is the transfer learning loss that aims to reduce the domain discrepancy between source and target domains, and ℓ A is the source party A's task loss that trains models using samples with labels of the source domain.ℓ TL and ℓ A together transfer the knowledge from the source domain to the target domain.The target party B utilizes its task loss ℓ B to further adapt the transferred knowledge to its local task using samples D B with labels of the target domain if D B is available.ϕ B is the target party B's local predictor.The target party B may or may not need the help of party A for inference, depending on the specific application of Eq. ( 8).
Liu et al. [90] proposed a secure federated transfer learning framework (SFTL), the pioneering work exploring transfer learning in VFL.SFTL first trains feature extractors θ A and θ B to map two heterogeneous feature spaces into a common latent subspace through aligned samples D au .In this latent subspace, the passive party B's local models ϕ B and θ B are trained using data D B .As a follow-up work, Sharma et al. [91] leverage a more efficient secure computation framework named SPDZ [95] to further enhance the efficiency of SFTL.
SFTL can only transfer knowledge from one source party to one target party.To support multi-party knowledge transfer, Feng et al. [88] proposed a Multi-Participant Multi-Class VFL (MMVFL) that leverages consistency regularization to transfer label information from the active party to all passive parties such that each passive party can learn a local predictor with its pseudolabeled samples.Feng et al. [89] further proposed a semi-supervised federated heterogeneous transfer learning (SFHTL) that utilizes unaligned samples of all parties and aligned samples to build a local predictor for each party.Specifically, SFHTL utilizes an autoencoder to learn local representations from each party and then aggregates local representations to form global representations, through which labels of the active party are propagated to each passive party.With labeled local samples, each party can train its local predictor independently.
Kang et al. [92] proposed PrADA to address the label deficiency of VFL through domain adaptation (DA).PrADA involves a label-rich source party A, a label-deficient target party B, and a third party that provides rich features for both parties A and B. PrADA treats the third party as a bridge to transfer the knowledge from the source party to the target party and leverages the adversarial domain adaptation to minimize the domain discrepancy between the source and target domains.

Preserving Data Privacy and Defending Against Attacks
In a VFL system, privacy threats may emerge from the inside or the outside of the system, or both.If the attacker attempts to learn information about the private data of other parties without deviating from the VFL protocol, it is regarded as honest-but-curious.The attacker is regarded as malicious if it fails to adhere to the VFL protocol.In this section, we first review privacy-preserving protocols involved in the typical VFL framework (Sec.5.1 and Sec.5.2), followed by discussions on emerging research on attacks and defense strategies (Sec.5.3 and Sec.5.4).

Private Entity Alignment
Private Set Intersection (PSI) is the most common method for privacy-preserving entity alignment in VFL.In a PSI protocol, all parties cooperatively find the common ID intersection without revealing any information else.PSI protocols can be realized using various techniques, such as encryption and signature strategies [96] and oblivious transfer [97,98] etc.The standard PSI protocol is typically applied to a two-party VFL system.[99,100] proposed methods for Entity Matching and PSI protocols that can be applied to multiple parties.PSI still reveals the common ID information.Several attempts have been made to enhance the privacy of the intersection ID set.[101] proposed an adapted PSI protocol for asymmetrical ID alignment using Pohlig-Hellman encryption scheme and a obfuscate set to help protect the entity information of a weaker party with far less samples than the other party from being exposed.[102] proposed a method called FLORIST that safeguards the entity membership information for all parties by using a union ID set and generating synthetic data for missing IDs in the union set.However this method is limited to unbalanced binary classification tasks and incurs additional computational costs for generating and training the synthetic data.

Privacy-Preserving Training Protocols
VFL approaches proposed in the literature adopt various security definitions and privacypreserving protocols.In this section, we summarize these protocols based on what is protected and exposed during VFL training and inference.We first provide the basic protocol of VFL.We then discuss other protocols which adopt either relaxed or enhanced privacy constraints.Figure 6 illustrates these protocols.∂H k instead of raw data are transmitted, preventing private data from being revealed.Liu et al. [18] provided security proof proving that private features x k can not be exactly recovered in the P-1 protocol when no prior knowledge about data is available.
Relaxed Protocol (P-0): Nonprivate label or model.In literature and applications, there are also cases where this security assumption of P-1 is relaxed, resulting in a few variants of protocols, including: • Nonprivate Labels.These are cases where labels can be accessed by all parties for training and the security model is to protect features only [27,68,50].
• Nonprivate global module or local models.These are cases where the global module [103] or local models [104,105,106,32] are considered white-boxed to adversaries.
Since these variants relax the basic security requirement of VFL, we assign a lower level to them (P-0), and we use P-0(y) and P-0(g) to denote the nonprivate label and nonprivate model scenarios, respectively.
Building on the basic protocol P-1, privacy-preserving techniques have been adopted to further protect the training procedure, resulting in protocols with enhanced privacy.Below we describe the most representative protocols based on what is exposed, in ascending order of privacy level.Standard Protocol (P-2): Protecting transmitted intermediate results.In this protocol, P-1 is satisfied.In addition, the intermediate results transmitted between parties are protected by cryptography protocols, while other training information processed within each party is left in plaintext to balance privacy and efficiency.For example, HE [3,107] can be adopted to encrypt sample-level outputs H k and gradients ∂ℓ ∂H k transmitted between each passive party k and the active party to thwart privacy attacks.Batch-level gradients ∇ θ k ℓ computed within party k are in plaintext for efficient training.The SecureBoost [19] is another example where HE is used to protect transmitted intermediate results, but the aggregated gradients are exposed to the active party.
Enhanced Protocol (P-3): Protecting entire training protocol.In this protocol, P-2 is satisfied.In addition, no training information is revealed to any party except for the resulting trained models.For example, batch-level information such as local model gradients ∇ θ k ℓ and parameters θ k can be protected by adopting Secure Multi-Party Computation (MPC) [23].Most existing works focus on the honest-but-curious assumption, which assumes that the adversary follows the VFL protocol.To further handle malicious settings, more advanced privacy-preserving techniques such as SPDZ [91] have also been integrated with VFL [91,108].
Strict Protocol (P-4): Protecting training protocol and learned models.This protocol further enhances the P-3 to protect final learned models using privacy-preserving techniques such as secret sharing [44] and hybrid schemes that combine HE and SS [109,110].It only reveals the final inference results but nothing else.This protocol addresses the emerging privacy challenge that the local model is exploited by its owner to infer private information about other parties [28,19,44].However, it requires complex computations which limits its efficiency and scalability.

Defending against Data Inference Attacks
In a typical VFL system, both features and labels are considered private, whereas most data attacks to HFL scenarios consider features as the target.Therefore, both feature and label protections are critical research subjects for VFL. Figure 7 illustrates data inference attacks in VFL.

Label Inference Attacks
In real-world scenarios, labels such as patients' diagnostic results and individual loan default records are considered sensitive information that only authorized institutions can access.A passive party B (i.e., the attacker) may try to infer the valuable label owned by the active party A using the information they accumulate during training or inference.It may follow the protocol passively under the honest-but-curious security assumptions or actively by tampering with the protocol under the malicious assumptions.The literature has proposed various label inference attacks under various security protocols, as summarized in Table 5.
Label inference attacks using sample-level gradient.When the VFL applies P-1 protocol, a passive party B (i.e., the attacker) has access to sample-level gradients ∂ℓ ∂H B sent backward from the active party A. The attacker can exploit this information to conduct Direct Label Inference (DLI) [111,28].DLI can achieve accuracy up to 100% if the active party adopts a nontrainable global module F A such as a softmax function because the gradient vector for each sample has only one element that has an opposite sign against all the others, thereby disclosing the labels [111].For special scenarios like binary classification, the attacker can deduce labels from sample-level gradients by mounting Norm Scoring (NS) or Direction Scoring (DS) attack [111] even when the global module F A is a trainable model (e.g., neural network).
Label inference attacks using batch-level gradients.When the VFL applies the P-2 protocol, no intermediate result exchanged among parties is revealed to any party (e.g., encrypted by HE [107]).Thus, the passive party B (i.e., the attacker) cannot obtain samplelevel gradients ∂ℓ ∂H B , but it may have access to batch-level (i.e., local model) gradients ∇ θ B ℓ. Studies have shown that it is still possible to infer the true labels with high accuracy through the gradient inversion attack (GI) [34,113] or the residue reconstruction attack (RR) [112] using only the local model gradient.Following the same philosophy of the deep leakage from gradient method [119], passive party B leverages GI to reconstruct the active party's labels by minimizing the distance between the predicted local model gradients ∇ θ B l and the ground truth ones ∇ θ B ℓ.We formulate a general GI attack for inferring labels as follows: where ] by solving an optimization problem as follows [120]: where ξ is the variable representing the plaintext value of [[ ∂ℓ ∂H B ]] and ξ * is the reconstructed values of ∂ℓ ∂H B , based on which DLI, NS or DS can be applied to infer labels.Label inference attacks using trained models.When the VFL applies the P-3 protocol, no training information is revealed to any party but only the final trained local model.The P-3 protocol can be achieved through MPC-based VFL approaches [23,110].A possible label inference strategy is for a passive party to finetune its trained local model with an inference head using auxiliary labeled data, and then predict labels using the complete model (i.e., the finetuned local model with the inference head).This attack is called Passive Model Completion (PMC) [28], in which the passive party is semi-honest.An active version of model completion (AMC) is also proposed in [28].It leverages a malicious local optimizer instead of normal ones (e.g., Adam) to trick the trained federated model into relying more on the local model of the attacker than other parties such that the attacker can obtain a local model with better performance.MC relies heavily on the adequateness of the auxiliary data owned by the passive party as an attacker.Sun et al. [114] proposed a spectral attack (SA) that enables a passive party to predict labels by clustering outputs of the trained local model, thereby eliminating the dependency on auxiliary data.Qiu et al. [103] proposed a Label-related Relation Inference (LRI) attack targeting label-related relations in the graph owned by the active party, assuming the attacker has access to the global module and can obtain prediction results.LRI first recovers the active party's local outputs using an optimization-based method.It then recovers relations by forming an adjacency matrix based on outputs from the attacker's and the active party's local models and prediction results.

Feature Inference Attacks
An individual's original feature is at the heart of privacy protection because it contains sensitive information that is not allowed to share.Various attacking methods has been proposed to infer features from shallow models (e.g., logistic regression and decision trees) [116,104,115] and complex models (e.g., neural networks and random forests) [104,105,106,32].We summarize existing feature inference attacks in Table 5.These attacks are typically under the setting where the active party (with labels) A is the attacker who attempts to recover features of a passive party B. The attackers in proposed feature inference algorithms may or may not have the knowledge of the passive party's model parameters θ B , which are, respectively, referred to as the white-box and black-box settings.
Feature inference attacks under white-box setting.Under the white-box setting, the attacker (i.e., the active party or the server) has access to its own model G A , the passive party's local model G B , the aligned data indices and possibly labels.In literature, there are mainly two ways to conduct white-box feature inference attacks: model inversion [105,106] during the inference phase and gradient inversion during the training phase [32].
The core idea of model inversion (MI) is to optimize variable xB to approximate the passive party's real input data x B such that the predicted output v of the VLF model is close enough to the real output v computed based on x B .We formulate a general MI attack as follows: where L MI is the loss function that minimizes the distance between v and v to optimize xB , and R MI regularizes the variable xB based on a prior knowledge.v is computed by: where x A and y A are features and labels belonging to the active party; local models G A and G B can be linear models, tree models or neural network models, and their model parameters θ A and θ B are fixed during the optimization; F A is the global module that aggregates the outputs of local models and generates v.He et al. [105] and Jiang et al. [106] proposed similar white-box model inversion attacks under the SplitNN and aggVFL settings, respectively.Luo et al. [104] proposed three white-box feature inference attacks to learn xB for three different models.These attacks can be seen as specialized MI.More specifically, they designed an Equality Solving Attack (ESA) for the logistic regression, a Path Restriction Attack (PRA) for the decision tree, and a Generative Regression Network (GRN) for attacking the neural network and random forest.The three attacks generally follow the optimization problem defined in Eq. (11).
The gradient inversion (GI) attack was initially proposed in [119] under the HFL setting.The CAFE [32] extended GI to a white-box VFL setting, where the attacker has access to the passive party's model parameters and gradients as well as the aligned data indices.With this knowledge, CAFE can achieve state-of-the-art data recovery quality even with large batch sizes.
Feature inference attacks under black-box setting.Attackers under the black-box setting typically have some prior knowledge about the model or data of the passive party in order to conduct feature inference successfully.
Peng et al. [115] proposed a Binary Feature Inference attack (BFI) to reconstruct binary features from the passive party's local model output H B (P-1 protocol), assuming the local model only has one fully-connected layer.In addition, BFIA adopts the Leverage Score Sampling technique [126] to boost the attack efficiency.Weng et al. [116] and Hu et al. [117] proposed a Reverse Multiplication Attack (RMA) and a Protocol-aware Active Attack (PAA), respectively, to infer the private features x B of the passive party B in the vertical logistic regression setting that applies P-2 protocol.In RMA, the attacker infers features x B of the passive party B by solving linear equations in which x B is the only unknown variable, assuming the coordinator helps decrypt ciphertexts.In PAA, the attacker first obtains the passive party B's outputs through solving a linear system and then utilizes these outputs to infer features of the passive party B. Weng et al. [116] also proposed a Reserve Sum Attack (RSA) targeting SecureBoost.RSA aims to infer the partial order of the passive party's input features by encoding magic numbers into the least significant bits of the encrypted first and second-order gradients.He et al. [105] proposed a black-box model inversion (MI) attack to learn x * B under the splitNN setting.More specifically, the attacker first trains a shadow model ĜB that mimics the behavior of the local model G B using some auxiliary data, and then the attacker learns x * B according to Eq. ( 11) and Eq. ( 12) with ĜB in place of G B .Jiang et al. [106] proposed a similar MI method under the aggVFL setting.
Attribute Inference Attacks.Aside from original features, privacy-sensitive attributes not represented in training data may also be inferred through overlearned model [118].
In the rest of this subsection, we discuss defense strategies that alleviate the threat posed by these attacks.

Cryptographic Defense Strategies
Cryptographic Defense Strategies (CDS) use secure computations to evaluate functions on multiple parties in a way that only the necessary information is exposed to intended participants while preventing private data from being inferred by possible adversaries.Today, large-scale deployment of CDS to machine learning models, especially deep learning models, is still challenging.The focus of existing works in this direction is to improve the privacy-efficiency trade-off through the in-depth designing of privacy-preserving protocols.We adopt protocols defined in Sec.5.2 as a vehicle to compare representative CDS, as listed in Table 6.We consider a defense follow a particular protocol only when it satisfies all requirements of that protocol.
A line of research works [107,21,121,23,3,122,110] focuses on designing CDS to protect the data privacy of vertical linear and logistic regressions.Gascon et al. [21] proposed a hybrid MPC protocol that combines Yao's garbled circuits with tailored protocols for securely solving vertical linear regression (GasconLR).Hardy et al. [107] proposed a HE-based scheme for training the vertical logistic regression (HardyLR).Follow-up works BaiduLR [121] and SecureLR [122] remove the coordinator from the training and inference procedure by relaxing either efficiency or privacy constraint.HardyLR, BaiduLR and SecureLR are vulnerable to privacy attacks targeting batch-level gradients (Sec.5.3.1).To address this limitation, Chen et al. [23] proposed a hybrid defense, named CAESAR, that combines HE and MPC to encrypt all intermediate results during the training and inference phases except the resulting trained models.The HeteroLR module of FATE [110] extends CAESAR further to encrypt the passive party's local model after training.
Designing CDS for vertical neural networks (VNN) is more challenging for both computation and communication.Therefore, current CDS for VNN either target shallow neural networks [90,91,109] or are tailored to protect specific intermediate results exposed to the adversary [92,124] for balancing privacy and efficiency.SFTL [90] designed a HE-based protocol and an SS-based protocol, respectively, to encrypt information shared between two parties that adopt neural networks with one or two layers.The follow-up work [91] leverages SPDZ [95] to enhance the efficiency of SFTL further.BlindFL [109] is proposed to build privacy-preserving VNN models through a federated source layer (FSL), which leverages a hybrid scheme mixing HE and MPC to guarantee the privacy of original data.ACML [124] is proposed to build privacy-preserving SplitVFL and introduces a HE-equipped interactive layer between the active party and the passive party to protect the passive party's local model output.PrADA [92] extends the interactive layer of ACML to the splitVFL setting, in which the global module is a linear model and local models are neural networks.FedSGC [123] utilizes HE to protect transmitted graph structural information.
For tree-based VFL, SecureBoost [19], SecureBoost+ [38], SecureXGB [40], and MP-FedXGB [43] integrate XGBoost into VFL.SecureBoost and SecureBoost+ exploit additive homomorphic encryption (HE) to encrypt the information transmitted between parties to protect private data.SecureXGB protects all intermediate results through a hybrid scheme combining additive HE and secret sharing (SS), thereby enhancing the privacy level.MP-FedXGB proposed a SS scheme with distributed optimization to support more-than-two-party scenarios.SecureGBM [39] is a LightGBM-based VFL using additive HE to protect transmitted information.Pivot [44] utilizes SS mixed with additive HE to guarantee that no intermediate information is disclosed.It additionally proposed an enhanced protocol to conceal the values of leaf labels and split thresholds from all participating parties, as well as protocols to handle malicious parties.Targeting SecureBoost, Chamani et al. [125] introduced a feature inference attack leveraging approximate distribution of feature values and proposed two countermeasures based on Trusted Execution Environment (TEE) to mitigate feature leakage risks.
CDS are typically applied to utility-critical applications, such as finance and healthcare, to achieve lossless model utility (i.e., performance) while maintaining an acceptable balance between privacy and efficiency.For applications in which efficiency is a major concern or CDS are not feasible, non-cryptographic defense strategies are preferred.

Non-cryptographic Defense Strategies
Non-cryptographic Defense Strategies preserve privacy essentially by reducing the dependence between private data and the information exposed to the attacker.There are several representative ways to reduce such dependency, including adding noise, gradient discretization [131], gradient sparsification [132,133] and their hybrid [134].These methods typically exhibit a trade-off between utility and privacy.
Adding Noise (DP) [119,28,111,135] is a basic defense method for reducing leakage in FL.Noise following Laplace distribution or Gaussian distribution is commonly used.In VFL settings, it typically adds noise to the gradients or intermediate results shared with other parties to defend against feature or label leakages [28,82].[136] introduced a hybrid differentially private VFL method that adds Gaussian noise to all parties' intermediate results to achieve both local and joint differential privacy.[41,42] applied differentially private noise to federated gradient-based decision trees in customized ways to achieve a good privacy-utility trade-off.Chen et al. [137] integrate GNN into splitVFL setting and leverage DP-enhanced additive secret sharing to protect data privacy.Gradient Discretization (GD) [131] encodes originally continuous gradients into discrete ones, aiming to reduce the private information disclosed to the attacker so that the attacker cannot precisely infer private data through discrete gradients.[34,28] leveraged a specialized version of GD, named DiscreteSGD, to defend against label inference attacks in VFL.Gradient Sparsification (GS) [132] removes a portion of the original gradients with small absolute values by setting them to 0 while preserving the convergence of the original VFL task.Similar to GD, GS leverages information reduction to mitigate privacy leakage.GS are readily applied to distributed learning and HFL scenarios [132,133].It is also effective in defending against various label inference attacks for VFL.[34,28].A feasible direction to achieve better trade-offs between privacy and utility is designing hybrid defense schemes combining multiple defense strategies [134].Another direction is to design specialized defense strategies tailored to specific data inference attacks.

Emerging Specialized Defense Strategies
Emerging specialized defense strategies are designed to thwart attacks that are difficult to defend against by traditional defense strategies.We compare representative emerging defense strategies in Table 7.
Defenses against label inference attacks.Li et al. proposed MARVELL [111], which is tailored to defend against Norm Scoring (NS) and Direction Scoring (DS) attacks by adding optimized noise to the sample-level gradients.They also proposed a heuristic Max-Norm defense against the two attacks.Liu et al. [34] proposed label disguising methods, called Confusional AutoEncoder (CAE) and DiscreteSGD-enhanced Confusional AutoEncoder (DCAE), which directly protects label information by encoding the original real label to soft fake labels with maximum confusion.PEloss [127] and dCorr [114] are two auxiliary losses that are proposed to defend against the Model Completion (MC) attack and Spectral Attack (SA), respectively.
Both methods try to train the attacker's local model for a large generalization error.Tan et al. [128] proposed a Random Masking (RM) defense against the Residue Reconstruction attack (RR) by injecting zeros into randomly selected positions of the HE-encrypted sample-level gradients to prevent the RR from reconstructing these gradients correctly.FedPass [130] leverages passport techniques to thwart both label and feature inference attacks.
Defenses against feature inference attacks.Fake Gradients (FG) [32] is proposed to defend against Catastrophic Data Leakage in VFL (CAFE) by replacing the true gradients with randomly generated ones while keeping their corresponding positions.Sun et al. [129] proposed DRAVL to defend against Model Inversion (MI) through adversarial training.In [115], a Masquerade Defense (MD) is proposed to thwart the Binary Feature Inference attack (BFI) by misleading the attacker to focus on randomly generated binary features, thereby protecting the true binary features.Hu et al. proposed DP-Paillier-MGD [117] to thwart the Protocol-aware Active Attack (PAA) by masking encrypted sensitive information to prevent the attacker from learning the precise value of the passive party's output and, thereby, the private features.
Adversarial training [138,118] and mutual information regularization [118] were proposed to safeguard sensitive attributes of training samples.

Defending against Backdoor Attacks
Different from data leakage attacks, whose target is to invade privacy and steal data, the target of malicious backdoor attacks is to mislead the VFL model or harm its overall performance on the original task.Typically, passive parties are the backdoor attackers, while the active party is the victim since only the active party has labels.In the rest of this section, we summarize existing backdoor attacks and defenses.

Backdoor Attacks
Existing research on backdoor attacks can be divided into two main categories, targeted and non-targeted, depending on whether the attacker has a determinant backdoor target or not. Figure 8 illustrates backdoor attacks, and Table 8 summarizes the settings and methods for existing backdoor attacks.
Targeted backdoor attacks secretly train a model that achieves high performance on both the original and the targeted backdoor tasks.The objective function of targeted backdoor attacks can be written as follows: Adversarial attack [141,33] splitVFL/aggVFL P-1 Training -Missing attack [33] splitVFL/aggVFL P-3 Training -Graph-Fraudster [142] splitVFL P-2 Inference -  [34] aggVFL Add Noise Targeted GS [34] aggVFL Sparsify Gradient Targeted CAE [34] aggVFL HE+Disguise Label Targeted DCAE [34] aggVFL HE+Disguise Label+DG Targeted RVFR [33] splitVFL Robust Feature Sub-space Recovery Targeted/Non-targeted where ỹi is the prediction for sample x i , subscripts cln and poi are short for "clean" and "poisoned" respectively, τ denotes the target label chosen by the attacker.Liu et al. [139] proposed a Label Replacement Backdoor attack (LRB), in which the attacker replaces the gradients of a triggered sample with the ones of a clean sample of the targeted class to achieve a high backdoor accuracy while keeping the main task accuracy at a high level.Pang et al. [140] introduced the Adversarial Dominating Input (ADI), which is an input sample with features that override all other features and lead to a certain model output, and proposed gradient-based methods in both white-boxed and black-boxed settings.
Non-targeted backdoor attacks, similar to Byzantine attacks [143] that are typically studied in HFL, aim to hurt the convergence or the performance of the original task by using adversarial samples [141,33], noisy samples or missing features [33].An adversarial sample is generated using the Fast Gradient Sign Method (FGSM), in which a perturbation ∆x i = ϵsign( ∂ℓ ∂x i ) is added to the original sample x i where ϵ is the magnitude of the perturbation [141].Multiple research works [141,33] demonstrate the effectiveness of this kind of attack in its misleading performance.If ∆x i is simply a randomly generated perturbation, then the attack is referred to as the noisy-sample attack.
The missing-feature attack simulates real-world VFL scenarios with unstable network [33] in which, for example, the local model output of a passive party may failed to reach the active party for collaboration.

Defense Strategies
Traditional defense strategies such as adding noise and GS are effective in defending against targeted and non-targeted backdoor attacks [33,34].However, these defenses suffer from tradeoffs between main task accuracy and backdoor task accuracy.On the other hand, cryptographic defense strategies are generally ineffective for defending against backdoor attacks because they preserve the computed outputs and thus do not impact the backdoor training objectives.In [139], the authors show that gradient-replacement backdoor attacks can still survive in HE-protected VFL protocols.
Therefore, emerging defense strategies have been proposed to further improve the effectiveness of defenses.For example, CAE and DCAE both show promising effectiveness in defending against targeted backdoor attack [34].RVFR [33] is put forward to defend against both target and non-target backdoor attacks in VFL scenarios by robust feature subspace recovery.We compare these defenses in Table 9.
In summary, research works on defending backdoor attacks in VFL are still at an early stage.It is worth exploring new effective defense strategies while maintaining good model utility.

Data Valuation and Fairness
VFL opens up new opportunities for cross-institution and cross-industry collaborations.As industrial use cases grow, a critical challenge for establishing a stable and sustainable federation among parties is the lack of fair data valuation and incentive design to allocate profits.In addition, a responsible VFL framework should also address various bias problems towards certain groups of people.In this section, we discuss the research progress for data valuation, explainability, and fairness for VFL.

Data Valuation
Currently, most research works on data valuations for FL framework still focus on HFL scenarios [144,145,146,147], while data valuations on VFL are much less studied.[148,149] are among the earliest works that proposed contribution evaluation frameworks for VFL using Shapley valuations on features.Shapley-based approaches typically adopt model performance gain as a key metric to measure data value.[150] proposed a model-free approach that uses conditional mutual information for Shapley to evaluate the feature importance and data values in VFL.[151] proposed an embedding-based Shapley evaluation method for VFL and applied this method to both asynchronous and synchronous settings.[152] focused on party-level evaluation from a mutual information (MI) perspective and adopted such evaluations to select important participants to improve the scalability of VFL.However, Shapley-based and MI-based evaluations are computationally challenging, which makes them difficult to apply to real-world cases.Improving the efficiency of Shapley calculations is an important future research direction.

Explainability
In fields that are highly regulated, such as financial and medical fields, making the trained VFL model explainable to authorities and compliance is of paramount importance.Currently, only a limited amount of works are proposed to address explainability of VFL.For example, [153] proposed an explainable VFL framework using credibility assessment and counterfactual analysis to control data quality and explain counterfactual instances.[154] designed a VFL scheme based on logistic regression with bounded constraints for interpretable scorecards in credit scoring.[92] proposed a feature grouping method that converts original features with low explainability into explainable feature groups to enhance the explainability of VFL prediction models.While designing VFL with explainability is an important research topic, how to reconcile privacy preserving and explainability in VFL is also a crucial research direction because the two objectives may contradict each other.

Fairness
Machine learning models trained in a collaborative setting may inherit bias towards certain user groups.Addressing fairness problem in VFL is an emerging research topic.FairVFL [155] is a framework to use adversarial learning to remove bias for the fairness-sensitive features in a privacy-preserving VFL setting.[156] provided a fairness objective in VFL and developed an asynchronous gradient coordinate-descent ascent algorithm to solve it.The core challenge for addressing fairness in VFL is to identify fairness-sensitive features and perform collaborative debias training while preserving data privacy and protocol efficiency.

Datasets
We list datasets commonly used in current VFL works in Table 10.Most of the datasets used in VFL research are tabular datasets from Finance, Healthcare, and Advertising.This manifests that, on the one hand, VFL has a broad range of applications in the three fields.On the other hand, tabular datasets dominate VFL research for their convenience in forming multi-party scenarios in VFL, indicating that we are short of research datasets of diverse types (e.g., image, text, or video).In addition, only NUSWIDE and Vehicle datasets consist of multi-modal features that can naturally simulate the two-party VFL scenario.Other datasets listed in Table 10 are adopted from existing machine learning research works, and there is no established way for VFL researchers to partition these datasets for VFL research.Therefore, facilitating industrial applications and academic research in the VFL area calls for practical datasets and high-quality benchmarks.
Table 10: Commonly used datasets in VFL.In the Size column, the number represents the total amount of samples of each dataset.For the three graph datasets, the number on the left of / represents the number of nodes, while the number on the right represents the number of edges.In VFLow, we take into account major constraints, including privacy, efficiency, and fairness, to guide the design of a VFL algorithm from aspects of the model architecture and partition settings, effectiveness and efficiency improving strategies, privacy defense strategies, as well as fairness improving strategies covered in this work.In addition, VFLow consists of a separate risk evaluation module that comprehensively evaluates data attacks and defense strategies.Finally, for model usage, party contributions, accountability, and verifiability tools are necessary for a sustainable and trustworthy federation (also see Sec. 9).We further extend the objective function formulated in Eq. 1 to a more general meta-objective, in which we want to minimize the main task loss (i.e., maximize utility) constrained by privacy, efficiency (i.e., communication and computation), and fairness: where Θ and S denote specific models and a VFL setting, respectively; A denotes an effectiveness improving strategy, P denotes a privacy defense strategy, K denotes the collection of attack algorithms, E denotes an efficiency improving strategy, and R denotes a fairness improving strategy.M p denotes a measurement for measuring privacy leakage imposed by attacks K against the defense strategy P. M e is the efficiency measure, typically with respect to communication load and computation resources.M b measures the system bias.ϵ p , ϵ e , and ϵ b are constraints for privacy leakage, efficiency cost, and bias, respectively.
This optimization problem can be considered as a constrained multi-objective federated learning problem [178].Such formulation brings about a set of solutions, each of which is an optimal trade-off between multiple objectives and thus provides stakeholders with flexible decision options.

Applications
Due to its practical merits for enabling data collaboration between multiple institutions across industries, VFL has attracted increasing attention from both academia and industry.In this section, we provide an overview of VFL applications.
Recommendation systems are typically adopted in VFL to support advertising applications.Federated bandit can be used as a promising technique [179,180,181] for FL.Shmueli et al. [182] proposed a privacy-preserving collaborative filtering protocol.Atarashi et al. [183] proposed a higher-order factorization machine in the VFL setting.Recommendation systems can be built between two platforms holding different rating data.Cui et al. [184] proposed a secure cross-platform recommendation based on secure computation protocols.Zhang et al. [185] proposed a VFL recommendation based on clustering and latent factor model to reduce the dimension of the matrix and improve the recommendation accuracy.To achieve privacy-preserving recommendations based on the personal data cloud, Yuan et al. [186] proposed a hybrid federated learning recommendation algorithm named HyFL, which exploits the advantages of both HFL and VFL.Cai et al. [187] proposed a DP-based VFL recommendation framework between a social recommender system and a user social graph.
Many internet companies have adopted VFL to support advertising.For example, ByteDance developed a tree-based VFL algorithm based on the Fedlearner framework, which significantly improves its advertising efficiency [188].Based on the VFL module in its 9N-FL framework, JD has established a joint model for advertising, which has promoted the cumulative increase of all participating parties' income [189].Tencent applied its Angel PowerFL platform to establish a VFL federation between advertisers and advertising platforms to boost model accuracy [190].Based on the trusted intelligent computing service framework (TICS), Huawei applied VFL to advertising [191] to leverage user profile and behavior data dispersed in different platforms.
Finance is another major application that new VFL approaches have been rapidly developed.For example, a gradient-based method for traditional scorecard model training is proposed in [154].In [23], a secure large-scale sparse logistic regression algorithm is designed and applied to financial risk control.Kang et al. [92] developed a fine-grained adversarial domain adaptation algorithm to address the label deficiency issue in the financial field.Long et al. [192] discussed the applications and open challenges for FL in open banking.Wang et al. [149] provided an overview of the use cases of FL in the insurance industry.WeBank uses customers' credit data and invoice information from partner companies to jointly build a risk control VFL model [4].
Healthcare has been very active in applied research in VFL.A privacy-preserving logistic regression is proposed in [117] and applied to clinical diagnosis.Chen et al. [57] proposed an asynchronous VFL framework and verified the effectiveness of this framework on the public health care dataset MIMIC-III.In [193], the authors applied VFL to cancer survival analysis to predict the likelihood of patients surviving time after diagnosis and to analyze which features might be associated with the chance of survival.[65] proposed an efficient VFL method using autoencoders to predict hearing impairment after surgery based on a vestibular schwannoma dataset.Song et al. [194] applied VFL to the joint modeling between mobile network operators (MNOs) and health care providers (HP).
Emerging applications have also been exploited in recent years for discovering novel data utilization in fields such as electric vehicles and wireless communications.Teimoori et al. [195] proposed a VFL algorithm to locate charging stations for electric vehicles while protecting user privacy.[196] discussed the opportunities for VFL to be utilized in 5G wireless networks.[197] proposed a VFL-based cooperative sensing scheme for cognitive radio networks.[198] developed a VFL framework for optical network disaggregation.[199] applied VFL to collaborative power consumption predictions in smart grid applications.[200] proposed VFL modelings for predicting failures in intelligent manufacturing.
MultiModal Tasks are performed when participants in VFL hold data from multiple modalities, such as vision, language, and sense.Liu et al. [201] proposed an aimNet that helps the FL model learn better representations from textual and visual features through multi-task learning.Liang et al. [82] proposed a self-supervised vertical federated neural architecture search approach that automatically optimizes each party's local model for the best performance of the VFL model, given that participating parties hold heterogeneous image data.Vertical federated graph learning (VFGL) algorithms are proposed to leverage features, relations, and labels that belong to the same group of people but are dispersed among different organizations.VFGNN [137] and FedVGCN [202] perform node classification on the scenario where all parties share the same set of nodes, but each party only owns partial features and relations of these nodes.FedSGC [123] performs node classification on another scenario where one party has only graph structural information while other parties have only node features.

Open Challenges and Future Direction
In this section, we discuss some of the major open challenges facing the development of VFL frameworks and propose possible paths in the future.
Interoperability.Thanks to the rapid development of efficient privacy-preserving technologies in recent years, more and more VFL projects and open-sourced platforms have been developed and applied in real-world scenarios, connecting data silos in various industries.However, the lack of interoperability of existing frameworks has become a new pain point for its industrial growth.Different platforms adopt different sets of secure computation and privacy-preserving training protocols, making cross-platform collaboration difficult and turning data silos into platform silos.One possible route to solve this challenge is to enforce the interoperability of platforms by developing algorithm and architecture standards so that platforms can connect with others more readily.Another route is to develop seed projects to support fundamental functionalities and modules for interoperability as a plug-in tool for diverse platforms.
Trustworthy VFL.To be trustworthy, VFL frameworks must appropriately reflect characteristics such as privacy and security, effectiveness, efficiency, fairness, explainability, robustness, and verifiability.Data needs to be protected in transit and at rest with clear security and privacy definitions and scopes.Despite recent research efforts on this subject, there is still a lack of universally effective defense strategies that are lossless and highly efficient.The trade-off between utility-privacy-efficiency [203] is still the focus of future studies.Applying multi-objective optimization techniques [178] in VFLow is a promising research direction towards trustworthy VFL [204].In addition, the path toward a trustworthy FL framework is for the trained models to be verifiable and auditable.One possible route is for the released trained models in VFL to be protected by verifiable intellectual property (IP) protection methods [205] in an efficient manner to prevent malicious IP attacks while fulfilling privacy requirements.Blockchain is leveraged to address the issue that a vanilla FL framework heavily relies on a central server, which means the system is vulnerable to this party's mal-behavior.How to integrate Blockchain into VFL frameworks to improve the overall security and robustness is an interesting future direction.
Automated and Blockchained VFL.Automated machine learning (AutoML) is of great interest in alleviating human effort and achieving satisfactory model performance [206].Liang et al. [82] proposed a Vertical Federated Neural Architecture Search that learns individual model architecture for each client.[207] discussed challenges in applying NAS to VFL under encryption.For VFL, participants without labels can not perform individual training or evaluation locally.Thus, their hyperparameters are nested in the collaborative training.This unique setting makes AutoML in VFL more challenging.Blockchain is leveraged to address the issue that a vanilla FL framework heavily relies on a central server, which may lead to a single point of failure or privacy vulnerabilities.By utilizing Blockchain, participating parties can exchange their model updates in a decentralized and verifiable manner.How to integrate Blockchain into VFL frameworks to improve the overall security and robustness is an interesting future direction.

Concluding Remarks
Vertical federated learning enables collaborative learning of feature-partitioned data distributed across multiple institutions.It has become an attractive solution for solving industrial data silo problems caused by the enforcement of strict data regulations.Despite its practical usefulness, as evidenced by a growing number of VFL projects and use cases, the breadth and depth of the research advances still lag behind those of HFL.We present an extensive categorization of research efforts and new challenges in VFL and propose a novel framework towards comprehensively formulating relevant aspects of VFL.We hope this work will encourage future research efforts to address these challenges in this area.
(a) Horizontal Federated Learning (b) Vertical Federated Learning (c) Federated Transfer Learning

Figure 1 :
Figure 1: Three categories of Federated Learning

Figure 2 :
Figure 2: Relationships between sections in this work.

Figure 3 :
Figure 3: Illustration of the VFL system with three parties (two passive parties and one active party).G 1 , G 2 , and G 3 denote the local models of the three parties, respectively, and F 3 denotes the global module owned by the active party.The VFL training protocol typically involves two steps: 1) the three parties align their samples via private entity alignment; 2) the three parties collaboratively train G 1 , G 2 , G 3 and F 3 in a privacy-preserving manner (see Section 2.2 for details).

Figure 4 :
Figure 4: Four major variants of VFL illustrated with one active party and two passive parties.

Figure 5 :
Figure 5: The virtual dataset of a two-party VFL.D denotes the labeled and aligned samples used by the conventional VFL formulated in Eq. (1), whereas D au denotes aligned but unlabeled samples.D uu A and D uu B denote unaligned and unlabeled samples of party A and party B, respectively.D ul A denotes unaligned and labeled samples of party A.

Figure 6 :
Figure 6: A conceptual view on the information flowing within and between an active party A and a passive party B during training to illustrate security protocols P-1, P-2, P-3 and P-4.

Figure 7 :
Figure 7: Illustration of data inference attacks in VFL system.The active party A typically infers features or attributes of the passive party B, while the passive party B typically infers labels of the active party A.

Figure 8 :
Figure 8: Illustration of backdoor attacks in VFL.Passive parties are the backdoor attackers who aim to impact the task performance of the active party.

Figure 9 :
Figure 9: VFLow: A Framework for setting up, designing and optimizing VFL algorithms.

Table 1 :
Comparison of main characteristics between conventional HFL, VFL and FTL.

Table 2 :
Comparison of splitVFL and aggVFL Secure Multi-Party Computation (MPC) and Trusted Execution Environment (TEE) can be introduced into the VFL protocol to protect the crucial information from inner and outside attackers.For example, instead of sending H k , each party k sends [[H k ]] to the active party, who in turn sends [[ ∂ℓ ∂H k To prevent privacy leakage from the intermediate results H k and gradients ∂ℓ ∂H k , Cryptobased privacy-preserving techniques such as Homomorphic Encryption (HE) (denoted as [[•]]),

Table 3 :
Summary of existing works that aim to improve the efficiency of VFL.In the Model column, the LR denotes logistic regression, NN denotes Neural Network, XGB denotes XGBoost and GBDT denotes gradient boosting decision tree.In the Convergence Rate column, T represents the total number of local iterations and ∆ represents stochastic variance.

Table 5 :
Summary of existing data inference attacks in VFL.A.P. represents the Attacking Phase.In the A.P. column, TRG denotes Training Phase and INF denotes Inference Phase.
[113]ich ŷ is the label variable needs to be optimized, while ψA and ĤA are active party A's global module parameter and local model output, respectively, that are estimated by party B in order to mount GI attack because party B has no access to them;∇ θ B ℓ = ∇ θ B L(F A (ψ A ; H A , H B ),y) denotes the ground truth local model gradients; R GI regularizes the label variable ŷ based on the label prior, aiming to enhance the quality of ŷ[113].The RR attack is tailored to linear models and aims to infer the plaintext value of encrypted gradient [[ ∂ℓ ∂H B ]

Table 6 :
Summary of existing cryptographic defense strategies in VFL.In the Defense Scheme column, GC denotes Garbled Circuits, SS denotes Secret Sharing, HE denotes Homomorphic Encryption, FE denotes Functional Encryption and TEE denotes Trusted Execution Environment.In the Adversarial Assumption column, SH denotes Semi-Honest and MA denotes Malicious.In the Protocol column, we assign each defense with Protocols (see Sec. 5.2) it satisfies; "a" and "p" denote active and passive parties, respectively.

Table 7 :
Summary of emerging specialized defense strategies for defending against data leakage attacks (see Table5).

Table 8 :
Summary of existing backdoor attacks in literature.

Table 9 :
Summary of defense strategies for defending against backdoor attacks.