Mix2SFL: Two-Way Mixup for Scalable, Accurate, and Communication-Efficient Split Federated Learning

In recent years, split learning (SL) has emerged as a promising distributed learning framework that can utilize Big Data in parallel without privacy leakage while reducing client-side computing resources. In the initial implementation of SL, however, the server serves multiple clients sequentially incurring high latency. Parallel implementation of SL can alleviate this latency problem, but existing Parallel SL algorithms compromise scalability due to its fundamental structural problem. To this end, our previous works have proposed two scalable Parallel SL algorithms, dubbed SGLR and LocFedMix-SL, by solving the aforementioned fundamental problem of the Parallel SL structure. In this article, we propose a novel Parallel SL framework, coined Mix2SFL, that can ameliorate both accuracy and communication-efficiency while still ensuring scalability. Mix2SFL first supplies more samples to the server through a manifold mixup between the smashed data uploaded to the server as in SmashMix of LocFedMix-SL, and then averages the split-layer gradient as in GradMix of SGLR, followed by local model aggregation as in SFL. Numerical evaluation corroborates that Mix2SFL achieves improved performance in both accuracy and latency compared to the state-of-the-art SL algorithm with scalability guarantees. Moreover, its convergence speed as well as privacy guarantee are validated through the experimental results.


I. INTRODUCTION
U TILIZING large amounts of data through large-scale par- allel computing power is instrumental in high-quality deep learning [1], [2], [3], [4].In this respect, federated learning (FL) first explores a way to utilize distributed learning for harnessing the data and computing resources scattered across multiple clients [5], [6].In FL, each client trains a local model using its own data, and uploads it to a parameter server.The server averages the uploaded models from clients, and constructs a global model that is downloaded by each client.By iterating this process, FL allows each client to reflect other clients' data without raw data exchanges that may induce privacy leakage.However, FL is ill-suited for large-sized models particularly when clients cannot store and transmit such large models given small memory as well as limited computing and communication energy.
On the other hand, split learning (SL), first proposed in [3], [4], is an alternative solution to FL to cope with large-sized models in a resource-efficient way [7], [8], [9].A typical architecture of SL is constructed by splitting an entire deep neural network (DNN) model into two partitions, in a way that the upper model segment above the split-layer is stored at the server while the lower model segment is stored at each client.In contrast to FL exchanging model parameters, each client in SL uploads the split-layer output, also known as smashed data, and downloads the gradient from the server to update the entire model.As the first of its instantiation, Vanilla SL shows its accuracy on par with FL while guaranteeing scalability in terms of increasing accuracy along with the number of clients.Notwithstanding, Vanilla SL operates from client to client in a sequential way, suffering from large latency particularly with many clients.Such a limitation calls for Parallel SL that can utilize the clients' parallel computing power to reduce latency.
In Parallel SL, the smashed data and gradients from multiple clients are exchanged in parallel, reducing the latency.However, Parallel SL fails to increase accuracy with the number of clients, questioning its scalability [10].Indeed, due to the simultaneous connections of multiple clients to the server, Parallel SL has innate problems of server-client update imbalance in the forward propagation (FP) of smashed data and the backward propagation (BP) of gradients, and inter-client update imbalance in the model parameters (i.e., model weights) of clients.To be precise, in the FP, Fig. 1 shows that while a single data sample propagates through the lower model segment of each client, multiple smashed data propagate through the upper model segment at the server.Likewise, for every single lower model segment update in the BP, the upper model segment is updated as many as the number of clients.Lastly, the clients' weights are updated separately even with the same gradient at the server, since the gradient backpropagates through the nonidentical smashed data of different clients according to the BP chain rule.
In fact, split federated learning (SFL), one of the representative frameworks of Parallel SL, already addresses the inter-client update imbalance by applying FL across clients [11], [12], i.e., the lower model segments, which however fails to achieve scalability.Meanwhile, our prior works [10] and [13] focus primarily on addressing the server-client update imbalance by averaging smashed data across clients and splitting the learning rates of the upper and lower model segments, respectively, thereby reinstating scalability.Nevertheless, these two works lack a deep understanding of the aforementioned two imbalance problems, making the methods in [10] and [13] compromising communication efficiency and accuracy, respectively.
Motivated by these preceding works, in this article we aim to achieve the scalability, high accuracy, and communication efficiency of Parallel SL by proposing a unified framework, coined split federated learning with two-way mixup (Mix2SFL).Mix2SFL resolves two imbalance problems, of which the interclient imbalance can be solved trivially by applying FL as in SFL.However, in server-client imbalance, FP and BP problems are intertwined and jointly addressing them is non-trivial.For instance, a näive solution to the FP and BP problems is averaging all the smashed data uploaded from all clients, which however becomes nothing but a white noise particularly with many clients.Instead, inspired from [10], Mix2SFL in the FP averages a small number of smashed data in a combinatorial way, hereafter referred to as Smashed Mixup (SmashMix).Meanwhile, following the method in [13], Mix2SFL in the BP averages the gradients at the split-layer, henceforth referred to as GradMix.In doing so, both FP and BP flows as well as the weight updates become aligned, thereby ensuring the scalability with high accuracy.Furthermore, GradMix in the BP yields a common gradient that can be broadcasted to all clients using the downlink (DL) same bandwidth, improving the communication efficiency and latency.Numerical simulations demonstrate that Mix2SFL outperforms other SL baselines including Vanilla SL and SFL in terms of scalability and accuracy.The results also show that Mix2SFL excels in terms of communication efficiency, convergence speed, and privacy guarantees.
Contributions: The major contributions of this article are summarized as follows: r We point to the mismatch between FP & BP and lack of lower model segment integration as the causes of Parallel SL's unscalability.
r In order to solve this problem while improving both accu- racy and communication efficiency, we design Mix2SFL by carefully combining SmashMix, GradMix, and SFL and describe its detailed operation.
r The simulations results validate that Mix2SFL outperforms state-of-the-art SL algorithms in convergence speed and privacy guarantee, as well as the tri-fold goal of scalability, accuracy, and communication efficiency.Related Works: FL and SL are two promising distributed learning frameworks having their advantages and disadvantages [14].With a large number of clients, FL achieves scalable accuracy [2], [15].Due to the limitations of computation, memory, and communication resources on the client-side, however, it can only handle small models.SL, on the other hand, can run big models by separating them [4], [16], [17], obtaining even quicker convergence with less communication overhead than FL [18], [19], [20].However, as first of its kind, Vanilla SL has large latency with the number of clients due to its sequential operation.
SFL [11], [12] inherits the advantages of FL and SL by combining both techniques, and also becomes the beginning of parallel implementation of SL.In SFL, FL is applied to the lower model segment after the SL's BP is finished.Scalability of SFL, however, is debatable, which is especially important in Internet-of-Things (IoT) or Web-of-Things (WoT) scenarios where global data and computing capacity are scattered among a large number of clients.
To resolve this problem, one of our prior works [10] proposes LocFedMix-SL by jointly manipulating local parallel techniques [21], [22], data augmentation techniques such as mixup [23] and CutMix [24], and federated averaging.By doing so, LocFedMix-SL successfully achieves improved performance in terms of both scalability and convergence speed while compromising latency.
Another prior work of ours [13] proposes SplitLr, which separates and adjusts the server-side learning rate and the clientside learning rate to ensure the scalability of Parallel SL, and GradMix, which can obtain a bandwidth gain by averaging the gradient at the split-layer.SGLR, composed of SplitLr and Grad-Mix, guarantees scalability and low latency, but less accurate.

II. SL ARCHITECTURE FUNDAMENTALS
Our main framework, SL, can be largely divided into Vanilla and Parallel SL according to the server's weight update process.All these SL methods aim to train the same neural network composed of layers.These layers are divided into two chunks, where the network architecture of lower one (lower model segment) are allocated to multiple clients respectively.The server stores the remaining upper layers (upper model segment), generally occupying more of the entire network than the lower model segment.
In this architecture, clients whose set is denoted as C participate in the training process, and they store the network up to (k − 1)th layer, such that the ith client has the ith lower model segment denoted as w c,i = [w 1 c,i , . .., w k−1 c,i ] T for all i.Accordingly, a pair of a client and the connected server compose an entire network, w = [w c,i , w s ] T , where weight w k−1 c,i of the split-layer connects the clients and the server.Here, we denote F as a representation of running forward path with owned weight parameters and data set.Training data set D is distributed to clients, therefore each client i contains |D i | samples, where ∪ i∈C D i = D, and D i is composed of raw data x i and its corresponding ground truth y i randomly selected in batch of size |B i |.Detailed training process of each methods are described below.

A. Vanilla SL
Vanilla SL is an original form of SL, also known as sequential SL [3].In this architecture, only one client is active at a time to train the entire network with connection to the server.
FP: Each ith client is selected sequentially from a set of all clients, and at bth iteration, it generates a mini-batch B i,b from its own data and produces smashed data with the input data, s i,b = F (x i,b ; w c,i ).Then the client send a tuple of the smashed data and one-hot encoded labels, (s i,b , y i,b ), to the server, and the server runs FP through the upper model segment and produces output via activation as y i,b = F (s i,b ; w s ).Usually in classification task, softmax function is used for activation.Finally the server calculates cross-entropy loss of the final prediction with corresponding label, where CEloss(p, q) = − p log q.
BP & Model Transition: To minimize the loss L i , the server updates itself with calculated gradient of the upper model segment and backpropagates until the gradient of the split-layer.Here, the gradient of each layer on the server-side is represented as follows: The server sends the split-layer gradient g k s to the ith client, while updating its upper model segment as follows, w s ← w s − ηG V SL s , where η denotes the learning rate.Since clients act one by one in order, the server also updates in every sequential step jointly with each client.
After receiving the split-layer gradient from the server, the ith client backpropagates its lower model segment by calculating gradient starting from g k−1 c,i : Then the client updates its lower model segment by its learning rate, w c,i ← w c,i − ηG V SL c,i .After the update of ith client finishes, its model weight w c,i is sent to the (i + 1)th client in the next order: ( Repeatedly, the (i + 1)th client can update the received lower model segment by iterating training with its own data set D i+1 .

Limitations of Vanilla SL:
In general, Vanilla SL results in high accuracy and fast convergence rate, since the algorithm works as if one entire neural network trains with all the distributed data with the help of its model transition.However, such sequential architecture incurs a considerable latency, since each client should wait for its turn, until the former clients finish their updates.This delay increases proportionally with the number of clients.Moreover, additional communication overload occurs, as each client sends the lower model segment to the next client after every training iteration finishes.It is inevitable whether clients are directly connected or relayed through a server or secondary client.These drawbacks due to its own structural problem motivate the need for its parallel implementation.

B. Parallel SL
In this structure, the clients connect to the server at the same time.At bth iteration, all clients run forward pass with their own mini-batch of data and upload the smashed data along with its ground-truth label ∪ i∈C (s i,b , y i,b ) to the server simultaneously.
BP With Global Gradient: After the server receives smashed data from all clients in parallel, it runs forward pass and computes cross-entropy loss L i for all i ∈ C. Then the server generates ith gradient using each loss as below, and sends split-layer gradient g k s,i to corresponding client: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.These steps up to this point are same as Vanilla SL.However in Parallel SL, the server updates upper model segment with the global gradient which is obtained by weighted averaging ith gradients for all i as follows: Next, all clients backpropagate through their own lower model segment using the received gradient, which is actually same operation with Vanilla SL: Update Imbalance Problem of Parallel SL: Compared to Vanilla SL, Parallel SL can reduce latency and energy via efficient pipelining of computing and communication resources.However, the scalability of Parallel SL is questionable in a sense that accuracy does not always increase with the number of participating clients.This limited scalability is due to the update imbalance problem that is divided into server-client imbalance and inter-client imbalance, as elaborated next.To illustrate the server-client update imbalance, we first recall Vanilla SL where a single flow of a gradient is backpropagated through the server to each client, as seen by ( 1) and ( 2), respectively.In stark contrast, in Parallel SL, while each client experiences a single gradient flow in (7), the server encounters the number |C| of gradients in (5) for all clients.In other words, due to the parallel structure sharing a single server by multiple clients, the model parameter update rates between the server and each client are different under Parallel SL.For the same reason as such BP, the number of FP flows at the server is different from that of each client.Next, inter-client update imbalance is incurred by the non-identical model parameters of clients in Parallel SL, as opposed to the synchronous model parameters across clients in Vanilla SL as shown in (3).Given different model parameters, even when the same split-layer gradient is backpropagated from the server to multiple clients, non-identical gradients are backpropagated through different clients.In short, Parallel SL has great potential in latency and energy reduction, yet is not scalable, warranting a redesign of both server-side and client-side operations to address the update imbalance problem inherent in Parallel SL.

III. MIX2SFL: A SOLUTION TO THE UPDATE IMBALANCE PROBLEM OF PARALLEL SL
To solve the update imbalance problem of Parallel SL, in this section we propose split federated learning with two-way mixup (Mix2SFL), which consists of SmashMix and GradMix for server-client imbalance and SFL for inter-client imbalance as shown in Fig. 2.

A. Balancing Server-Client Updates Via Mixing Smashed Data and Gradients
After each client uploads the smashed data and its label as in Parallel SL, Mix2SFL feeds the averaged smashed data to the server through SmashMix elaborated as follows.
1 SmashMix:At bth iteration, the tuple of the ith client is mixed up with the tuple of the jth client (j ∈ C − {i}) as shown below: where λ denotes the mixing ratio that follows a uniform distribution (λ ∼ U (0, 1)).
For each ith client, this process can be repeated up to |C| − times, which implies that a total of |C| gradients, including gradient, denoted G SMU s,i (= G P SL s,i ), obtained through its own activation s i,b and |C| − 1 gradients, denoted G SMU s,i,j , obtained via the 1-to-1 manifold mixup of (8) with jth client, can be generated via server-side BP for SmashMix.Hereafter, the sample obtained through the SmashMix denoted by s i,j,b is coined mixed-up smashed data whose corresponding label is l i,j,b , and the number of them generated per client is denoted by n s .When the set acquired by sampling n s elements without replacement from C − {i} is connoted by C n s i , the server-side weight update formula for Mix2SFL through SmashMix is as follows: Then, for client set C ⊆ C, Mix2SFL provides the averaged split-layer gradient through GradMix, detailed as follows.Note that additional gradients through SmashMix, denoted by G SMU s,i,j in (9), are detached from the lower model segment and thereby do not affect the split-layer gradient or client-side weight update.
2 GradMix: Before sending the gradient, the server averages the split-layer gradients only for clients belonging to set C , thereby enabling them to download the averaged split-layer gradient ḡk−1 c,i , which is backpropagated as follows: (10) Clients corresponding to the remaining C − C download the same gradient as Parallel SL in (7).
By doing so, Mix2SFL successfully supplies the averaged smashed data and split-layer gradient through SmashMix and GradMix to the server and client, respectively, solving the server-client update imbalance rooted in the imbalance of FP and BP flow in Parallel SL.In addition, the effect of SmashMix and GradMix hyperparameters on accuracy and communication efficiency is specified as follows.
Impact of n s : Fig. 3 indicates the top-1 accuracy of SmashMix according to n s as well as |C|.The first thing to note is that accuracy is not always an incremental function for n s .While increasing n s has a similar effect to increasing the batch size, accuracy can also have a concave curve for batch size as in [25].Next, we can see the benefits of increasing |C|.Since the upper bound of n s is |C| − 1, a larger |C| leads to the potential improvement on accuracy with the help of searching for n s in a wider range.Finally, the optimal value of n s that maximizes accuracy varies for each |C|, which could be an interesting topic, but it is left for future work.
Impact of φ: Let φ be the number of clients to which GradMix is applied among all clients, denoted as: φ = |C |/|C|.Then, Fig. 4 demonstrates how the performance of GradMix varies in terms of top-1 accuracy and average downlink (DL) rate according to φ named fraction.Here, the average DL rate measures the average value for each client of the data rate on the DL transmission that the server transmits the gradient to the client.For simplicity, we ideally assume that the DL data rate R DL is equal to its theoritical upper bound, the capacity C DL , which is given by the following Shannon formula: where W and SN R represent channel bandwidth and signalto-noise ratio (SNR) in DL, respectively.First, in Fig. 4, the accuracy hardly fluctuates when φ changes from 0 to 1. On the other hand, the average DL rate increases exponentially as φ increases.Since clients in C shares the averaged split-layer gradient with the same value, GradMix enables broadcasting for clients within C in DL transmission.Consequently, the number of clients broadcasting through the shared bandwidth increases while the number of clients unicasting through the allocated orthogonal bandwidth decreases, resulting in an increase in the average DL rate.Combining them, the strategy for φ close to optimal is to fix it as 1, considering both accuracy and DL transmission rate.

B. Balancing Inter-Client Updates Via Averaging Client Models
After FP and BP, Mix2SFL unifies the lower model segment of clients through SFL.
3 SFL: After clients perform training for T s iterations, the server receives all lower model segments from clients, and computes weighted average with them to generate a global model w c for clients as follows: Then the server broadcasts the global model to the clients, completing aggregation process of Mix2SFL.When recalling and re-expressing ḡk−1 c,i in (10), Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The last term implies the summation of the client-side gradient.
Assuming that w k−1 c,j is the same for all j ∈ C (i.e., SFL) in ( 12), GradMix has the effect of taking a mixup of the entire gradient that is backpropagated to the client, whereas it alone mixes up only the gradient in the split-layer.In other words, GradMix unifies the gradient that is backpropagated to the client by additionally applying the SFL, and as a result, successfully solves the inter-client update imbalance.Impact of T s : While this aggregation step of SFL allows clients to share their information melted into the model weight to others, it also becomes a communication burden on the clientside depending on the model size or location of the split-layer.As a result, in Fig. 5, SFL with T s = 1 achieves the highest accuracy while compromising communication efficiency due to the increase in uplink (UL) communication cost B UL , which is measured as the accumulated number of bits in the UL transmission.
By combining SmashMix, GradMix, and SFL on the Parallel SL framework, we design Mix2SFL, whose detailed operation is in Algorithm 1.To resolve the update imbalance problem inherent in Parallel SL, in Mix2SFL, SmashMix in FP supplies averaged FP flow to the server in a combinatorial manner, and GradMix in BP provides averaged splitlayer BP flow to clients, aiming to alleviate the imbalance between FP/BP flows.Furthermore, SFL averages the clientside BP flow with GradMix as in (12), thereby achieving a synchronization of the FP and BP flows throughout the entire model.
In addition, to improve the communication efficiency of Mix2SFL, in DL, GradMix increases the communication efficiency by allowing full-band utilization, resulting in high data rate R DL , and in UL, SFL reduces the communication cost B UL by increasing the model uploading period.Further, the high communication efficiency of Mix2SFL is experimentally verified through latency measurements under limited resources in Section IV.

IV. NUMERICAL EVALUATION
Starting from 1-by-1 combinations between component technologies, this section evaluates the performance of Mix2SFL step by step compared to existing SL algorithms including Vanilla SL and SFL.Additionally, for comparison, we use 1 SplitLr in [13], whose core operation is to scale the server's learning rate separately from that of the client's.This behavior of SplitLr is similar to that of SmashMix in terms of propagating the same FP flow to the server multiple times instead of the mixed FP flow and is detailed in Appendix A, available online.As a performance metric, we utilize top-1 accuracy, total communication round, latency, and information leakage.
There are 20 to 100 clients in our evaluation environment [11], [15], [26], [27], [28], and they store distributed CIFAR-10 data set [29] and fashion-MNIST data set [30].We assume the IID data set environmment, where each client contains 10% of the total data set, 5,000 for CIFAR-10 and 6,000 for fashion-MNIST.Meanwhile, the neural network used for a basis is LeNet-5 model [31].The split-layer is located after two convolutional layers following a max-pooling layer each.Therefore, the clients act as a feature extractor each containing 2,872 parameters, and the server stores three fully connected layers with ReLU activation, total number of 59,134 parameters.We train the network with SGD optimizer with η = 0.004 learning rate, 0.9 momentum, and weight decay of 5e − 4. We iterate training for 10,000 communication rounds, with batch size |B i | = 64 for all i.Hyperparameters in Table I and Fig. 6 are as follows: n s = |C|/5, φ = 0.5, T s = 1, and α = 0.5.In addition, a list of notations is referred to Appendix B, available online.

A. Scalability, Accuracy, and Convergence Speed
As a comparison group of Parallel SL in Tables I and II as well as Fig. 6, we utilize 3 SFL as the baseline and compose a one-to-one combination with other component techniques ( 1 SmashMix, 1 SplitLr, 2 GradMix).This is because SFL is the only existing Parallel SL algorithm that can solve the inter-client imbalance of the two imbalance problems by default.Table I shows the evaluation of top-1 accuracy as well as total communication round.Here, total communication round indicates the number of iteration until achieving the corresponding top-1 accuracy, which can be considered as convergence speed.
First, in terms of scalability, except for Vanilla SL, only two of the Parallel SL combinations, including 1 SmashMix and 1 SplitLr, satisfy scalability in the sense that top-1 accuracy increases according to the number of clients |C|.The thing to note is the case with 1 performs better than the case with 1 especially when the number of clients is small (less than or equal to 60).On the other hand, when the number of clients exceeds 60, vice versa.In the case of SmashMix where its performance gain is heavily dependent on the number of mixup between smashed data uploaded by the client, the gain may be marginal with a small number of clients, since the upper bound of n s is determined by |C| (n s ≤ |C| − 1).Therefore, the accuracy gain through SmashMix reverses that of SplitLr when the number of clients increases.Regarding convergence speed, the combination 1 + 3 is superior to other SL algorithms.If lower model segment aggregation of 3 improved the convergence speed of the lower model segment only, 1 may improve the convergence speed of the upper model segment through providing more sample to the server-side, and accordingly this fastens the entire model's convergence speed.Furthermore, as the number of clients increases, the faster the model reaches the top performance.
For brief validation of 1 + 3 on different data sets, Fig. 6(b) displays the learning curve for fashion-MNIST data set when |C| = 100, while Fig. 6(a) shows it for CIFAR-10 data set.Moreover, to figure out the impact of data set's non-IIDness, we set the data distribution in Fig. 6(c) and (d) as a dirichlet distribution [32] with dirichlet concentration parameters α D of 2 and 6, respectively.As a result, it is demonstrated in Fig. 6 that the superiority in terms of accuracy and convergence speed of 1 + 3 holds for different data set types as well as data distributions.In addition, it is noteworthy that the small accuracy drop of Parallel SL compared to Vanilla SL shows the robustness against non-IIDness of its parallelization effect.
One step more, to assess the accuracy when communication efficiency is improved, Table II measures the accuracy of the two scalable and high-accurate combinations in Table I ( 1 + 3 & 1 + 3 ) by increasing T s (φ = 0 by default) or applying 2 GradMix with φ > 0 while maintaining T s = 1.First, both solutions show drastic accuracy drop as T s increases, showing accuracy-communication efficiency tradeoff.When combining 3 as an alternative, the accuracy improves as φ increases, while the communication efficiency improvement when φ increases is already shown in Fig. 4.This improved accuracycommunication efficiency as well as scalability is particularly evident in 1 + 2 + 3 (Mix2SFL), achieving the tri-fold goal even with fast convergence in the Parallel SL architecture, by successfully matching the FP-BP flow under model synchronization.

B. Latency Analysis
In this subsection, we theoretically analyze the resulting latency of Mix2SFL by comparing it with various SL algorithms.Note that among the hyperparameters, n s and α have little or no effect on latency.This means that the latency of Mix2SFL is determined by φ or T s , among which T s is fixed to 1 considering the accuracy drop shown in Fig. 5.Moreover, we add LocSFL [33] as a comparator for latency analysis.In brief, LocSFL replaces the global gradient with the local gradient obtained via the auxiliary network with weight w a,i attached to the client so as to enable omitting DL communication.
Then, Table III derives the computation-communication latency of Mix2SFL as well as Vanilla SL, SFL and LocSFL.For simplicity, the computation time T comp considers only the computing time of FP or BP at the client-side.Moreover, when measuring the communication time, it is assumed that bandwidth is equally distributed to all clients and channel fluctuations are not considered.Further, T UL & T DL are unit times of UL & DL communication under full band utilization, respectively.The following parameters are used to measure the latency of Fig. 7: In Fig. 7, minimum and maximum values exist only in Mix2SFL and SFL, whose latency varies according to the hyperparameters.First, it can be seen that overall latency is especially large when T comp : T UL is 1 : 10.With the help of the parallelization of computing resources, computational latency does not vary significantly even if the number of clients increases.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.But the communication latency can become bottleneck since all clients share limited bandwidth.This highlights the specialty of Mix2SFL.Mix2SFL can maintain low latency in a poor communication environment thanks to its use of averaged split-layer gradient that enables bandwidth sharing between clients.As a result, this leads to low mean and variance in the latency of Mix2SFL.On the other hand, SFL has strengths in terms of minimum latency.In SFL, adjusting the weight aggregation period T s can greatly reduce model aggregation latency occupying a large portion of communication latency, while compromising accuracy as in Fig. 5. Taken together, Mix2SFL is the best SL technique in terms of latency, considering its mean and variance simultaneously.

C. Information Leakage
Another important aspect of SL is its preservation ability against privacy leakage of client-side input data that results from sending smashed data.In this subsection, we experimentally measure the privacy guarantee for the reconstruction attack of Mix2SFL's component technologies: SmashMix, GradMix, and SFL.For measurement, we use a decoder, an upper model segment of ResNet AutoEncoder pre-trained with CIFAR-10 data set from PyTorch Lightning bolts in [34].Here, mean squared error (MSE) is used to measure the reconstruction loss between the input data and the output of the decoder, referred to as the reconstructed data.At this time, a large reconstruction loss in Table IV means that privacy is well protected, and vice versa.
In Table IV, the reconstruction loss of GradMix increases as fraction φ increases.That is to say, the decoder has more difficulty in restoring input data from the smashed data of clients, as more participants share the averaged split-layer gradient.As the number of participants increases, the proportion of one's own gradient that is reflected in generating the averaged split-layer gradient is relatively reduced, and thus the privacy of that gradient is naturally protected.In the case of SmashMix, it has a similar tendency to GradMix.This is because, although SmashMix occurs in FP while GradMix occurs in BP, both techniques are centered on exchanging information (i.e., smashed data or splitlayer gradient) between clients commonly.Similar to the phenomenon in which GradMix's privacy is better preserved thanks to the "Hiding in the Crowd" effect [35] as φ increases, enlarging n s strengthens SmashMix's privacy guarantee.In addition, Table IV demonstrates that indirectly influencing the distribution of smashed data by touching the averaging of weights has a weaker impact than directly averaging smashed data or gradients, leading to inconsistent results for T s .Comparing all algorithms above, Grad-Mix with φ = 1 performs the best in terms of privacy guarantee.
Furthermore, Fig. 8 shows an implementation of input data as well as reconstructed data regarding φ and T s , visually confirming the aforementioned tendency of the reconstruction loss with respect to the hyperparameters.Moreover, Mix2SFL's privacy guarantee measurement according to joint changes in parameters is not explored yet and is left as future work.
Impacts of Mix2SFL and its Hyperparameters: Mix2SFL's SmashMix and GradMix correct the mismatch of FP and BP flow between server-client, respectively.Then, SFL solves the  With respect to hyperparameters, except for n s whose optimization is deferred to future work, fixing T s = 1 and increasing φ in Mix2SFL is the optimal solution in all respects.

V. CONCLUSION
We investigated the parallel implementation of SL that can simultaneously solve the excessive computation-communication load of FL and large latency of Vanilla SL.Existing Parallel SL algorithms, however, had a limitation in that scalability is not guaranteed, and we pointed out that the unscalability of Parallel SL comes from its own structural problem, dubbed update imbalance problem.We divided this problem into two parts, one for the imbalance on FP-BP flows between client and server and the other for the absence of model integration.We combinatorially utilized the existing component technologies of Parallel SL and proposed a novel SL framework coined Mix2SFL.In Mix2SFL, SmashMix and GradMix supply averaged FP and BP flows to the server and client, respectively, and SFL unifies the gradient that is backpropagated to the client.Numerical evaluation corroborated that Mix2SFL succeeded in ensuring scalability by resolving the update imbalance problem.Besides, simulation results proved that the proposed algorithm outperformed the existing SL techniques in terms of accuracy, convergence speed, latency, and privacy guarantee.As a future work, in relation to communication-efficiency aspect, we consider a study on the Parallel SL structure to adaptively control the batch size under an environment in which the channel gain fluctuates.Research on optimization of various hyperparameters and convergence analysis of Mix2SFL could also be an interesting topic.Lastly, as in [36], the robustness measurement for various privacy attacks other than the reconstruction attack is deferred to future work.

Fig. 2 .
Fig. 2. Demonstration of the three component technologies of Mix2SFL.

Fig. 3 .
Fig. 3. Top-1 accuracy of SmashMix w.r.t the number n s of mixed-up smashed data and the number |C| of participating clients.

Fig. 5 .
Fig. 5. Top-1 accuracy and cumulative UL cost (in Mbit) of SFL w.r.t lower model aggregation interval T s (in log scale).

Fig. 8 .
Fig. 8. Visualization of input data and corresponding reconstructed data of GradMix (Up) and SFL (Bottom) depending on each hyper-parameter φ and T s , respectively.

TABLE I TOP
-1 ACCURACY AND TOTAL COMMUNICATION ROUND OF MULTIPLE SL ALGORITHMS

TABLE II TOP
-1 ACCURACY OF COMBINATIONS DERIVED FROM THE COMPONENT TECHNOLOGIES W.R.T |C| AND HYPERPARAMETERS TABLE III LATENCY COMPARISON DURING T COMMUNICATION ROUNDS OF SL ALGORITHMS WITH n CLIENTS TABLE IV RECONSTRUCTION LOSS OF COMPONENT TECHNOLOGIES OF MIX2SFL W.R.T HYPERPARAMETERS WHEN |C| = 10