Introduction
In recent years, the increasing availability of advanced hardware support and large-scale datasets have led to the widespread adoption of massive deep neural network (DNN) models as the primary machine learning approaches. However, training massive DNN models on large-scale datasets takes a considerable amount of time, leading to an increasing demand for efficient model training. To efficiently train massive DNN models, for example, VGG [16] and ResNet [31] in an image processing task, or large language models [3], [4], distributed training has been extensively studied [5], [6], [7]. In general, distributed training is divided into two types: data parallelism and model parallelism.
Data parallelism partitions and distributes the training datasets across multiple computing nodes. In the most common data parallelism architecture, each computing node has a local copy of the master DNN model, and the synchronization between the local model and the master model takes place in the parameter server where the master model resides. Depending on the synchronization method, data parallelism is divided into two types: synchronous data parallelism [8], [9] (SDP) and asynchronous data parallelism [10], [11] (ADP). In SDP, the parameter server waits for all computing nodes to complete their works. Then, it synchronizes the master model with the local models in the computing nodes. Because all computing nodes wait for the synchronization to be completed, the training time of SDP is determined by the slowest computing node. In ADP, however, the parameter server does not wait for all computing nodes to finish their works. It synchronizes with the computing node that completes the computation first. Therefore, ADP can accelerate training speed more than SDP. However, since the parameter server accepts the gradients from the computing nodes in the order of their arrival and applies them to the master model, the late-arriving gradients cannot be correctly applied to the exact version of the master model parameters used to compute the gradients. Instead, they will be applied to the master model parameters that have already been updated by the early arrived gradients. These late-arriving gradients are called the delayed gradients (or stale gradients), and it is the major cause of the negative effect on training performance in ADP [12].
Model parallelism partitions the model parameters and distributes them across multiple computing nodes [11]. In naive model parallelism, the hardware utilization is low since only one computing node may be active at a time [13]. To mitigate this problem, pipeline parallelism has been proposed [13], [14], [15]. Pipeline parallelism can process multiple mini-batches simultaneously in all computing nodes, resulting in a significant acceleration in computation. Though pipeline parallelism can train DNN models efficiently in terms of training time, it suffers from two problems: the weight inconsistency and delayed gradient problems. The model parameters used in the forward pass and backward pass are different in naive pipeline parallelism, which is called the weight inconsistency problem. In addition, since the model parameters used during forward pass are not maintained until applying the gradients, it results in the delayed gradient problem as in ADP. These problems deteriorate model convergence, causing longer training time and higher error rates.
Fig. 1(a) illustrates an example of pipeline parallelism scheduling timeline [15] having the weight inconsistency and delayed gradient problems. Let
(a) An example of pipeline parallelism scheduling timeline [15] having the weight inconsistency and delayed gradient problems. (b) An example of pipeline parallelism scheduling timeline [15] utilizing the weight stashing and vertical sync methods.
Existing studies on pipeline parallelism have proposed various methods to address these problems by synchronously updating model parameters [13], [16], predicting correct model parameters [17], [18], or generating multiple model replicas [15]. However, these methods have drawbacks such as becoming less accurate as the number of computing nodes increases, or underutilizing hardware efficiency. For example, generating multiple model replicas is an approach to mitigate the weight inconsistency problem. However, delayed gradient problem may arise depending on the synchronization method (e.g., Applying ADP to synchronize multiple model replicas, as seen in [15], may result in the delayed gradient problem, which will be explained further in Section II).
In this paper, we propose a novel pipeline parallelism method, called EA-Pipe, that aims to address the delayed gradient problem when generating multiple model replicas to mitigate the weight inconsistency problem. To this end, we view the delayed gradient problem from a data parallelism perspective, and apply an optimization method previously studied in the area of data parallelism. Especially, EA-Pipe utilizes the moving average based elastic force called elastic averaging [19]. The elastic averaging algorithm showed comparable performance with prior ADP synchronization methods, while alleviating the delayed gradient problem and achieving more stable learning property. Therefore, it can be expected that similar effects can be achieved when applying the algorithm to the pipeline parallelism. The contributions of this paper are as follows:
Unlike existing methods, we approach the problems in pipeline parallelism through a data parallelism perspective. Thereby, we can utilize the advantages of the existing ADP optimization methods such as the elastic averaging algorithm for pipeline parallelism.
We proposed a novel round-robin parallel training algorithm by combining the pipeline parallelism and asynchronous data parallelism to solve the weight inconsistency and delayed gradient problems.
We analyzed the convergence property of the proposed method and found that the error bound is the same as the synchronous elastic averaging case. To the best of the author’s knowledge, this is the first theoretical convergence analysis of the round-robin asynchronous data parallelism through the elastic averaging pipeline parallelism.
The experimental results indicate that EA-Pipe not only enhances training speed but also trains DNN models as stably as SGD. In particular, in the experiments using the CIFAR-100 and ImageNet datasets, EA-Pipe demonstrated error rates that were 2.58% and 2.19% lower, respectively, compared to the baseline pipeline parallelization method, PipeDream.
The rest of the paper is organized as follows: In Section II, we review related works concerning EA-Pipe. In Section III, the work scheduling, algorithm, and convergence property of EA-Pipe are explained. In Section IV, we verify the proposed method through three image classification experiments using three different models and datasets. The conclusion of this study and suggestions for future research are discussed in Section V, followed by the Appendix including the detail proof of the convergence analysis.
Related Work
A. PipeDream
PipeDream [15] follows a scheduling strategy in which each computing node executes the forward and backward passes alternatively for different mini-batches, ensuring high hardware utilization. Fig. 1(a) shows the pipeline scheduling of PipeDream. However, this pipeline scheduling leads to the weight inconsistency and delayed gradient problems, since the versions of the model parameters used in the forward and backward passes are inconsistent. To mitigate these problems, PipeDream proposed the weight stashing and vertical sync methods. The weight stashing method generates as many model replicas as the number of active mini-batches for each computing node. Thereby each mini-batch ends up using the same model parameters in both the forward and backward passes. The vertical sync method generates as many model replicas as the maximum number of active mini-batches in the pipeline, ensuring the time synchronous weight versions are available throughout all computing nodes. The maximum number of active mini-batches is always equal to the number of computing nodes. Fig.1(b) shows PipeDream utilizing the weight stashing and vertical sync methods.
With these two methods, PipeDream can solve the weight inconsistency problem effectively. However, during the backward pass, the computed gradients are applied to the model parameters that have already been updated by the earlier gradients. Therefore, the delayed gradient problem still remains. For example,
EA-Pipe addresses the persistent delayed gradient problem remained even after resolving the weight inconsistency problem by generating multiple model replicas in PipeDream. To tackle this problem, EA-Pipe approaches the delayed gradient problem from the perspective of data parallelism and applies the elastic averaging algorithm previously studied in the context of data parallelism.
B. Elastic Averaging Algorithm
The elastic averaging algorithm [19] was proposed to reduce the communication cost in data parallelism, by enabling each computing node to conduct more local training computations and explore more extensively away from the master model before synchronization. Each local model is maintained as if it is connected with the master model by an elastic force, which constrains the distance between the local and master model parameters. The stronger the elastic force is, the more the distance between the local and master model parameters is constrained. While training, each local model fluctuates around the master model. Therefore, the risk of the master model falling into local optima may be reduced. The synchronization equation of the elastic averaging algorithm in data parallelism is as follows:\begin{align*} \mathbf {x} &\leftarrow \mathbf {x}_{n} \\ \mathbf {x}_{n} &\leftarrow \mathbf {x}_{n} - \alpha (\mathbf {x} - \bar {\mathbf {x}}) \\ \bar {\mathbf {x}} &\leftarrow \bar {\mathbf {x}} + \alpha (\mathbf {x} - \bar {\mathbf {x}}){,} \tag{1}\end{align*}
EA-Pipe incorporates the elastic averaging algorithm into pipeline parallelism. As a result, the synchronization of model parameters occurs partially, unlike the elastic averaging algorithm in data parallelism. Further details on parameter synchronization in EA-Pipe will be discussed in Section III.
Proposed Method
In this section, we propose a novel pipeline parallelism method, called EA-Pipe, which mitigates the weight inconsistency and delayed gradient problems at the same time. This Section is divided into four subsections. Firstly, we introduce the pipeline scheduling of EA-Pipe. Secondly, we present the algorithm of EA-Pipe in pseudo-code. Thirdly, we analyze the convergence property of EA-Pipe. Lastly, we compare EA-Pipe with a recent work that uses similar approaches.
A. Pipeline Scheduling of EA-Pipe
EA-Pipe follows the same structure as PipeDream in Fig. 1(b). However, unlike PipeDream, the local models in EA-Pipe update their parameters locally without synchronizing with the master model parameters until a specific synchronization period is reached. Fig. 2 shows an example workflow of EA-Pipe with two computing nodes when synchronization period is set to 2. In Fig. 2(a), the local model parameters are updated locally without synchronization with the master model parameters. For example, at the backward pass
An example workflow of EA-Pipe with two computing nodes when synchronization period is set to 2. In (a), the highlighted boxes represent the backward passes and synchronization with the master model. In (b),
Fig. 2(b) shows how the parameters are updated in the second computing node by EA-Pipe in a virtual parameter space. For example, the second computing node updates the local model parameters
B. Algorithm
A pseudo-code of EA-Pipe is shown in Algorithm 1, which is executed by all computing nodes simultaneously.
EA-Pipe: Executed by Computing Node
C. Convergence Analysis
In this section, we will discuss the convergence property of EA-Pipe. Previous studies have limited analysis on the elastic averaging algorithm. For example, the analysis in [19] covered only one local update for quadratic objective functions. The convergence analysis presented in [20] is only for a synchronous elastic averaging algorithm. Therefore, the convergence analysis of an asynchronous elastic averaging algorithm remains an open question. In EA-Pipe, the local models are synchronized with the master model sequentially, which can be considered as a round-robin data parallelism [21]. As shown in Fig. 2(b), we can observe that the synchronization order between the master model
The objective function \begin{equation*} F(\mathbf {x}):= \frac {1}{N}\sum ^{N}_{i=1}\mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})]{,} \tag{2}\end{equation*}
Based on the analysis in [20] and [22] which proves the convergence rate of distributed SGD algorithms with local updates on non-convex objectives, we can derive the following theorem, which guarantees that EA-Pipe converges to stationary points of non-convex objective functions with the same error bound as a synchronous elastic averaging algorithm.
Theorem 1:
Let \begin{align*} &\hspace {-1pc}\frac {1}{K}\sum _{k=0}^{K-1}\mathbb {E}[\|{\nabla }F(\mathbf {y}_{k})\|^{2}] \\ &\leq \frac {2(F(\mathbf {y}_{0})-F_{\mathrm {inf}})}{\eta _{\mathrm {eff}}K} +\frac {\eta _{\mathrm {eff}}L\sigma ^{2}}{N} \\ &\quad + \eta ^{2}_{\mathrm {eff}}L^{2}\sigma ^{2}\left ({\frac {1+\zeta ^{2}}{1-\zeta ^{2}}\tau - 1}\right)\left ({1+\frac {1}{N}}\right)^{2}, \tag{3}\end{align*}
Theorem 1 demonstrates that if the learning rate
D. Comparison to Similar Works
Recently, AvgPipe is proposed to enhance the throughput of pipeline parallelism by combining elastic averaging algorithm into pipeline parallelism [24]. Although AvgPipe and EA-Pipe shares similarities, we highlight some distinctions between their approach and ours. First, while AvgPipe is designed to operate at micro-batch level which has a constraint on the size of mini-batch, EA-Pipe is devised to operate at the mini-batch level without any size constraint. Second, while AvgPipe introduced elastic averaging algorithm to increase the throughput of pipeline parallelism, our approach proposes integrating the elastic averaging algorithm to address the delayed gradient problems. Last but not least, in contrast to [24], we conducted an analysis of the convergence property of pipeline parallelism with the elastic averaging algorithm.
Experiments
In this section, we explain three types of experiments conducted to verify the performance of the proposed method: small-scale, mid-scale, and large-scale experiments. The small-scale experiment evaluates EA-Pipe for an image classification task using CIFAR-10 dataset with VGG-16 [16] as the DNN model. The CIFAR-10 dataset has 60,
We compared four training methods: the conventional sequential SGD (denoted as SGD in the Tables and Figures), PipeDream [15], SpecTrain [17], and the proposed EA-Pipe. Since SGD with momentum is commonly employed in computer vision tasks, we chose SGD as the baseline of the sequential training method [25]. We included SpecTrain, which addresses the weight inconsistency and delayed gradient problems through model parameter prediction. All methods were implemented in CUDA/C++. The performance of each training method was measured on a server with an Intel Xeon Silver 4110 CPU and eight NVIDIA GeForce RTX 2080 Ti GPUs.
The performances of the models trained with each algorithm were measured by the ratio of the misclassified images in the test data. We ran three repetitions of the experiments each with different random initial weights, and the average performance with 95% confidence interval for each method is reported here. The best case results of the three repetitions can be found in Appendices B, C, and D. The same evaluation protocol was consistently maintained throughout the subsequent experiments.
In the second experiment (mid-scale experiment), we scaled up from the first experiment, assessing EA-Pipe for an image classification task using CIFAR-100 dataset with ResNet-34 [31] as the DNN models. The CIFAR-100 dataset closely resembles CIFAR-10, except that it has 100 classes containing 600 images each. ResNet-34 is a variant of the residual network architecture with 34 layers, utilizing skip connections to address training challenges in deep neural networks. We compared four training methods, as in the previous experiment.
The third experiment compares EA-Pipe and PipeDream to further verify the effect of the delayed gradients on a larger scale of parallelism. We conducted the experiment using the ResNet-50 model and the ImageNet dataset, simulating PipeDream and EA-Pipe on a virtual 50-GPU environment. The ImageNet dataset contains 1,281,167 training images, 50,000 validation images and 100,000 test images in 1,000 classes. The main objective of the third experiment is to evaluate the effectiveness of EA-Pipe in addressing the delayed gradient problem for large-scale environments. The simulation experiment was conducted on a server with an Intel i5-10600K CPU and an NVIDIA RTX 2080 Ti GPU.
A. Experimental Results on CIFAR-10 With VGG-16 (Small-Scale Experiment)
We conducted small-scale experiments using the CIFAR-10 dataset and the VGG-16 model with varying number of GPUs (one, two, four, and eight). For all four methods, we tried batch sizes of 128, 64, 32, and 16, and learning rates of 0.1 and 0.01. The number of training epochs was set to 100. Learning rates were reduced to one tenth after every 30 epochs. For SpecTrain training, however, it did not converge with the previously chosen learning rate candidates. Thus, we conducted additional SpecTrain training using learning rates of 0.001, 0.0005, 0.0003, 0.0001, paired with the batch sizes of 128, 64, 32, and 16, respectively. For EA-Pipe, we set the communication period to 1 and measured the best performance by varying the elastic force (0.1, 0.3, and 0.5).
Table 1 summarizes the results of the small-scale experiment. It was found that a batch size of 16 and a learning rate of 0.01 were the optimal hyperparameters for SGD, PipeDream, and EA-Pipe, while a batch size of 32 and a learning rate of 0.0003 were the optimal hyperparameters for SpecTrain. EA-Pipe showed the best performance at an elastic force of 0.3. SGD showed the lowest error rate of 7.36%. SpecTrain succeeded in training only when the learning rate was set to a small value (0.0005), while reaching higher error rates compared to PipeDream and EA-Pipe. PipeDream and the proposed method, EA-Pipe, did not exhibit significant differences. Therefore, additional validation of the proposed method was necessary through larger scale experiments, which will be discussed shortly.
Table 2 shows the computational overhead for the four training methods in the small-scale experiment. The computational overhead was calculated by dividing the total training time (in seconds) by the total number of epochs. SpecTrain showed a lower computational overhead compared to PipeDream and EA-Pipe since it used a batch size of 32, while PipeDream and EA-Pipe used a batch size of 16. These were the optimal values for each training method. Though SpecTrain was the fastest, it showed the worst image classification accuracy. Both PipeDream and EA-Pipe showed a reduction in computational overhead as the number of GPUs increased. EA-Pipe exhibited a slightly higher computational overhead compared to PipeDream.
Table 3 shows the statistical efficiency of the four training methods measured by the total training time to reach the lowest error rates. As can be seen in Tables 2 and 3, we doubt that EA-Pipe is appropriate for small scale parallelism.
B. Experimental Results on CIFAR-100 With ResNet-34 (Mid-Scale Experiment)
We conducted a mid-scale experiment using the CIFAR-100 dataset and the ResNet-34 model on 8 GPUs. For all four methods, we employed batch sizes of 64, 32, and 16, and learning rates of 0.1 and 0.01. The number of training epochs was set to 100. Learning rates were reduced to one tenth after every 30 epochs. For SpecTrain, however, it did not converge with the previously chosen learning rate candidates. Thus, we conducted additional SpecTrain training using learning rates of 0.001, 0.0005, 0.0003, paired with the batch sizes of 64, 32, and 16, respectively. For EA-Pipe, we set the communication period to 1 and measured the best performance by varying the elastic force (0.1, 0.3, and 0.5).
Table 4 presents the results of the mid-scale experiment. Both PipeDream and EA-Pipe reached at the lowest error rates at a batch size of 16 and a learning rate of 0.01. Except for when using a batch size of 16 and a learning rate of 0.1, EA-Pipe achieved lower error rates than PipeDream in all cases. SpecTrain failed to be trained at batch sizes of 64, 32, and 16, and learning rates of 0.1 and 0.01. Similar to the first experiment (Section IV-A), SpecTrain only succeeded in training when the learning rate was set to a smaller value (0.0003), while reaching higher error rates compared to PipeDream and EA-Pipe. We can suspect that using the weight prediction of SpecTrain to address the weight inconsistency and delayed gradient problems might not be effective.
Fig. 3 depicts the error rate curves for the three methods (SGD, PipeDream, and EA-Pipe). We did not include SpecTrain, since it showed the worst error rates. The robustness of EA-Pipe can be observed in the error rate graphs. At high learning rates (e.g., between 0 and 30 epochs), PipeDream exhibits slow convergence speed and unstable model training trends. On the other hand, EA-Pipe shows stable model training trends similar to SGD.
Training curves for the three training methods using the CIFAR-100 dataset with the ResNet-34 model on 8 GPUs.
In Table 5, we compare the computational overhead of the four training methods in the mid-scale experiment. Similar to the small-scale experiment, EA-Pipe exhibited a slightly higher computational overhead compared to PipeDream. However, as can be seen in Table 6, in terms of the total training time taken to reach the lowest error rate, EA-Pipe is the most efficient one among the four training methods. We suspect that the effciency of EA-Pipe will be more evident as the size of parallelism gets larger, which is discussed in the next section.
C. Experimental Results of Large-Scale Parallelism
In the previous section, we observed statistical efficiency and stability in training for EA-Pipe. However, it was not enough to confirm the adverse effect of the delayed gradient problem. Therefore, a large-scale experiment was conducted. Due to the lack of available computing accelerators, we conducted a large-scale parallelism experiment of EA-Pipe and PipeDream running in a 50-GPU simulated environment. To carry out the evaluation, we chose an image classification task on the ImageNet dataset using the ResNet-50 model.
Since an ImageNet experiment requires a significant amount of time for a training method to complete the whole training process, we selected a batch size of 64, which is the maximum size that the GPU memory can accommodate. To determine the optimal learning rate, we first evaluated 10% of the ImageNet dataset and found out that 0.06 was the best learning rate among the candidates of 0.1, 0.06, 0.03, and 0.01. Then, we evaluated the entire ImageNet dataset using the learning rate of 0.06, as well as two adjacent values (0.08 and 0.04). As a result, we identified 0.04 as the optimal learning rate. For the sake of fairness, PipeDream and EA-Pipe utilizes the same batch size and learning rate values. The number of training epochs was set to 100. Learning rates were reduced to one tenth after every 30 epochs. For EA-Pipe, we set the communication period to 1 and the elastic force to 0.1.
Table 7 shows the results of the three training methods using the ImageNet dataset and the ResNet-50 model on 50 GPUs. In the large-scale setting, EA-Pipe achieved lower error rate than PipeDream. As we suspected earlier, the delayed gradients become more problematic in large-scale parallelism, validating the need for a method solving the delayed gradient problem efficiently, such as EA-Pipe.
Fig. 4 depicts the error rate curves for these methods. We can observe that EA-Pipe shows a more stable learning property compared to PipeDream at large learning rates as observed in the previous section. This is due to the following fact. Since the effective weight change (i.e. gradients multiplied by the learning rate) is relatively large compared to small learning rate cases, the delayed gradients become more problematic in PipeDream.
Training curves of the three training methods using the ImageNet dataset and the ResNet-50 model on 50 GPUs.
Conclusion and Future Work
In this study, a novel pipelined parallel SGD algorithm, EA-Pipe, has been proposed to mitigate the delayed gradient problem that occurs in pipeline parallelism. It utilizes the multiple model replicas and synchronizes them based on an elastic averaging scheme. Some conventional approaches reduce the batch size and learning rate to alleviate the delayed gradient problem to some extent, which can reduce the GPU hardware utilization and/or increase the training time. Our proposed method does not need to adjust the batch size and learning rate, thereby reducing the hyperparameter optimization time. The experimental results confirmed that the proposed method can achieve comparable error rates to SGD and show the efficacy of parallel training in large-scale environments. In addition, we analyzed the convergence property of EA-Pipe, and confirmed that the error bound is the same as the synchronous elastic averaging algorithm. However, the proposed method could have a disadvantage in terms of memory utilization because it creates multiple model replicas. This disadvantage could constrain the training of very large models, which can be left to a future work.
Appendix AProof of Theorem 1
Proof of Theorem 1
In EA-Pipe, the local models are synchronized with the master model in round-robin manner, which makes the model parallelism of EA-Pipe as the round-robin data parallelism. At each time step
Algorithm 2 Round-Robin Elastic Averaging Algorithm for Local Model Parameter \mathbf{x}^{i}
Initialize
repeat
if
Wait until
else
end if
until
Algorithm 3 Round-Robin Elastic Averaging Algorithm for Master Model Parameter \bar{\mathbf{x}}
Initialize
repeat
if
for
Wait until
end for
end if
until
These algorithms optimize the following objective function.\begin{equation*} F(\mathbf {x}):= \frac {1}{N}\sum ^{N}_{i=1}\mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})] + {\rho }\|\mathbf {x}^{i} - \bar {x}\|^{2} \tag{4}\end{equation*}
\begin{equation*} F(\mathbf {x}):= \frac {1}{N}\sum ^{N}_{i=1}\mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})] \tag{5}\end{equation*}
In order to analyze the convergence property of the round-robin elastic averaging algorithms, we utilize the Theorem 1 in [20], which is based on the following assumptions.
Assumption 1 (L
-Smoothness):
We assume that each local objective function \begin{equation*} \|{\nabla }F_{i}(\mathbf {x}) - {\nabla }F_{i}(\mathbf {y})\| \leq L\|\mathbf {x} - \mathbf {y}\|, \tag{6}\end{equation*}
Assumption 2 (Lower Bound):
We assume that \begin{equation*} F(\mathbf {x}) \geq F_{\mathrm {inf}}. \tag{7}\end{equation*}
Assumption 3 (Unbiased Gradients):
We assume that the stochastic gradients are unbiased estimators of local objectives gradients such that \begin{equation*} \mathbb {E}_{\mathbf {s}{\sim }{\mathcal {D}}_{i}}[g(\mathbf {x}, \mathbf {s})] = {\nabla }F(\mathbf {x}),\quad {g(\mathbf {x}, \mathbf {s}) = \nabla \mathcal {L}(\mathbf {x},\mathbf {s})}. \tag{8}\end{equation*}
Assumption 4 (Bounded Variance):
We assume that the variance of stochastic gradients is bounded by some constants \begin{equation*} \mathbb {E}_{\mathbf {s}{\sim }{\mathcal {D}}_{i}}[\|{\nabla }F(\mathbf {x}) - g(\mathbf {x}, \mathbf {s})\|^{2}] \leq \beta \|\nabla F(\mathbf {x})\|^{2} + \sigma ^{2}. \tag{9}\end{equation*}
Assumption 5 (Mixing Matrix):
We assume that the mixing matrix \begin{equation*} \max \{|\lambda _{2}(\mathbf {W})|,\cdots \} < \lambda _{1}(\mathbf {W}) = 1. \tag{10}\end{equation*}
In order to utilize the proof techniques in [20] and [22], we build a matrix-form update rule. Let matrices \begin{align*} \mathbf {X}_{k} &= [\mathbf {x}^{1,k}, \cdots, \mathbf {x}^{N,k}, \bar {\mathbf {x}}^{k}], \tag{11}\\ \mathbf {G}_{k} &= [g(\mathbf {x}^{1,k},\mathbf {s}^{1,k}), \cdots, g(\mathbf {x}^{N,k},\mathbf {s}^{N,k}), \mathbf {0}]. \tag{12}\end{align*}
\begin{equation*} \mathbf {X}_{k+1} = (\mathbf {X}_{k} - \eta \cdot \mathbf {G}_{k}) \cdot \mathbf {S}_{k}{,} \tag{13}\end{equation*}
\begin{align*} \mathbf {S}_{k} = \begin{cases} \displaystyle \mathbf {W} & k \text {mod } \tau = 0 \\ \displaystyle \mathbf {I} & \text {Otherwise} \end{cases}. \tag{14}\end{align*}
Remark 1:
Instead of Eq. (13), one can use an alternative rule:
The mixing matrix
is a doubly-stochastic matrix.\mathbf {W} is a primitive and irreducible matrix.\mathbf {W} is a positive matrix.\mathbf {W}^{\top }\mathbf {W}
Proof of Condition 1
Lemma 1:
Let \begin{align*} \mathbf {M}^{1} = \begin{pmatrix} 1-\alpha & 0 & \alpha \\ 0 & 1 & 0 \\ \alpha & 0 & 1-\alpha \end{pmatrix}, \: \mathbf {M}^{2} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1-\alpha & \alpha \\ 0 & \alpha & 1-\alpha \end{pmatrix},\end{align*}
Proof:
Since \begin{equation*} \sum _{i=1}^{N+1} \mathbf {M}^{l}_{ij} = \sum _{j=1}^{N+1} \mathbf {M}^{l}_{ij} = 1,\end{equation*}
Lemma 2:
Let
Proof:
Let \begin{align*} \sum _{i=1}^{n} c_{ij} &= \sum _{i=1}^{n} \left({\sum _{k=1}^{n} a_{ik}b_{kj}}\right) \\ &= \sum _{k=1}^{n} \left({b_{kj} \sum _{i=1}^{n} a_{ik}}\right) \\ &= \sum _{k=1}^{n} b_{kj} \qquad \qquad \because \sum _{i=1}^{n} a_{ik} = 1 \\ &= 1\end{align*}
Proof of Condition 2
According to Lemma 3, we can see that
Lemma 3:
The mixing matrix \begin{align*} \mathbf {W} = \left ({\begin{array}{cccc} & & & w_{1(N+1)} \\ &{ {{\mathbf {A}}}} & & \vdots \\& & &w_{N(N+1)}\qquad \\ w_{(N+1)1} & \cdots & w_{(N+1)N} & w_{(N+1)(N+1)} \end{array} }\right), \tag{15}\end{align*}
Since all the local models are synchronized with the master model, the elements in the last row and the last column of
Proof of Condition 3
According to Eq. (15), the entries in the last row and the last column of
Proof of Theorem 1
Now that we have confirmed that the mixing matrix \begin{align*} \mathbf {X}_{k+1} &= (\mathbf {X}_{k} - \eta \cdot \mathbf {G}_{k}) \cdot \mathbf {S}_{k} \\ \mathbf {S}_{k} &= \begin{cases} \displaystyle \mathbf {W} & k \text {mod } \tau = 0 \\ \displaystyle \mathbf {I}_{k} & \text {Otherwise} \end{cases} \end{align*}
\begin{align*} \mathbf {X}_{k+1}\mathbf {v} &= \mathbf {X}_{k}\mathbf {v} - \eta \cdot \mathbf {G}_{k}\mathbf {v}\tag{16}\\ &= \mathbf {X}_{k}\mathbf {v} - \frac {\eta }{N+1}{\sum _{i=1}^{N} g(\mathbf {x}^{i,k},\mathbf {s}^{i,k})} {.} \tag{17}\end{align*}
To simplify the equation, we define an averaged variable \begin{equation*} \mathbf {y}_{k+1} = \mathbf {y}_{k} - \frac {\eta _{\text {eff}}}{N}\sum _{i=1}^{N} g(\mathbf {x}^{i,k}, \mathbf {s}^{i,k}){.} \tag{18}\end{equation*}
By utilizing the intermediate result from the proof of Lemma 2 in [20] (specifically, Eq. (61) in [20]), we get the following equation (when \begin{align*} &\hspace {-.5pc}\frac {1}{K}\sum _{k=0}^{K-1}\mathbb {E}[\|\nabla F(\mathbf {y}_{k})\|^{2}] \\ &\leq \frac {2[F(\mathbf {y}_{0}) - F_{\text {inf}}]}{\eta _{\text {eff}}K} +\frac {\eta _{\text {eff}}L\sigma ^{2}}{N} \\ &\quad + \frac {L^{2}}{KN}\sum _{k=0}^{K-1}\sum _{i=1}^{N}\mathbb {E}[\|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2}] \\ &\quad - \left [{1 - \eta _{\text {eff}}{L}\left ({\frac {\beta }{N} + 1}\right)}\right]\frac {1}{KN}\sum _{k=0}^{K-1}\sum _{i=1}^{N}\mathbb {E}[\|\nabla F(\mathbf {x}^{i,k})\|^{2}]{.} \\{}\tag{19}\end{align*}
\begin{align*} \sum _{i=1}^{N} \|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2} &\leq \sum _{i=1}^{N} \|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2} + \|\mathbf {y}_{k} - \bar {\mathbf {x}}^{k}\|^{2} \tag{20}\\ &=\|\mathbf {X}_{k}(\mathbf {I} - \mathbf {v}\mathbf {1}^{\top})\|^{2}_{\text {F}}{,} \tag{21}\end{align*}
According to the update rule (13) and repeatedly using the fact \begin{align*} \mathbf {X}_{k}(\mathbf {I}-\mathbf {v}\mathbf {1}^{\top}) &= (\mathbf {X}_{k-1}-\eta \mathbf {G}_{k-1})\mathbf {S}_{k-1}(\mathbf {I}-\mathbf {v}\mathbf {1}^{\top}) \tag{22}\\ &= -\eta \sum _{j=0}^{k-1}\mathbf {G}_{j}\left ({\prod _{s=j}^{k-1}\mathbf {S}_{s} - \mathbf {v}\mathbf {1}^{\top} }\right){.} \tag{23}\end{align*}
\begin{equation*} \sum _{i=1}^{N} \|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2} \leq \eta ^{2}\left \lVert{ \sum _{j=0}^{k-1}\mathbf {G}_{j}\left ({\prod _{s=j}^{k-1}\mathbf {S}_{s} - \mathbf {v}\mathbf {1}^{\top} }\right)}\right \rVert _{\text {F}}^{2}{,} \tag{24}\end{equation*}
\begin{equation*} \prod _{k}\mathbf {S}_{k} = \prod _{k}\mathbf {W}_{k}. \tag{25}\end{equation*}
In order to keep utilizing the proof sequence of [20], we need to ensure that
Lemma 4:
Let \begin{equation*} \|\mathbf {W} - \mathbf {J}\|_{\mathrm {op}} = \zeta < 1 \tag{26}\end{equation*}
Proof:
The operator norm of \begin{align*} (\mathbf {W}-\mathbf {J})^{\top }(\mathbf {W}-\mathbf {J}) &= (\mathbf {W}^{\top }-\mathbf {J}^{\top })(\mathbf {W}-\mathbf {J}) \\ &= \mathbf {W}^{\top }\mathbf {W} - \mathbf {J}^{\top }\mathbf {W} - \mathbf {W}^{\top }\mathbf {J} + \mathbf {J}^{\top }\mathbf {J} \\ &= \mathbf {W}^{\top }\mathbf {W} - \mathbf {J} -\mathbf {J} + \mathbf {J} \\ &= \mathbf {W}^{\top }\mathbf {W} - \mathbf {J}{.} \tag{27}\end{align*}
Both
and\mathbf {W} are doubly stochastic matrices.\mathbf {W}^{\top} is a symmetric real matrix, which is diagonalizable [30]. Also, according to Lemma 3, it is a positive matrix.\mathbf {W}^{\top }\mathbf {W} According to Lemma 2,
is a doubly stochastic matrix.\mathbf {W}^{\top }\mathbf {W} Let
, then\mathbf {W}^{\top }\mathbf {W} = \mathbf {Z} .\mathbf {ZJ} = \mathbf {JZ} Since all elements of
are non-negative, all elements of\mathbf {W} are also non-negative.\mathbf {Z}


If all
are negative, the maximum eigenvalue of\lambda _{2}(\mathbf {Z}),\cdots,\lambda _{n}(\mathbf {Z}) is 0.\mathbf {Z-J} If one of
is positive, the maximum eigenvalue of\lambda _{2}(\mathbf {Z}),\cdots,\lambda _{n}(\mathbf {Z}) is positive. Furthermore, according to Perron-Frobenius theorem [31], [32], since\mathbf {Z-J} is a positive primitive stochastic matrix, there is only one eigenvalue 1. All other eigenvalues are smaller than 1.\mathbf {Z}
Lemma 5:
Let \begin{equation*} \|\mathbf {W}^{n}-\mathbf {J}\|_{\mathrm {op}} \leq {\zeta }^{n}, \quad \mathit {where}\quad \|\mathbf {W} - \mathbf {J}\|_{\mathrm {op}} = \zeta \tag{30}\end{equation*}
Proof:
We can prove it by induction on
Base case (
Inductive hypothesis: Assume that the equation holds for
Inductive step: Let \begin{align*} \|\mathbf {W}^{k+1} - \mathbf {J}\|_{\mathrm {op}} &= \|(\mathbf {W}-\mathbf {J})(\mathbf {W}^{k}-\mathbf {J})\|_{\mathrm {op}} \\ &\leq {\|\mathbf {W}-\mathbf {J}\|_{\mathrm {op}}} {\|\mathbf {W}^{k} - \mathbf {J}\|_{\mathrm {op}}} \\ &\leq {\zeta }\cdot {\zeta ^{k}} \because \text {Inductive hypothesis} \\ &= {\zeta ^{k+1}}{.}\end{align*}
By Lemma 5, we do not need Assumption 5 in [20], and we can apply the same procedure of Appendix D.2 in [20] to get the following final result:\begin{align*} &\hspace {-1pc}\frac {1}{K}\sum _{k=0}^{K-1}\mathbb {E}[\|\nabla F(\mathbf {y}_{k})\|^{2}] \\ &\leq \frac {2[F(\mathbf {y}_{0}) - F_{\text {inf}}]}{\eta _{\text {eff}}K} + \frac {\eta _{\text {eff}}L\sigma ^{2}}{N} \\ &\quad + \eta _{\text {eff}}^{2}{L^{2}}\sigma ^{2}\left ({\frac {1 + \zeta ^{2}}{1-\zeta ^{2}}\tau - 1}\right)\left ({1+\frac {1}{N}}\right)^{2}. \tag{31}\end{align*}
Appendix BBest Case Results of the Small-Scale experiment
Best Case Results of the Small-Scale experiment
Table 8 shows the best case results of the four training methods on the CIFAR-10 dataset using the VGG-16 model with varying number of GPUs. We observed the similar trends as in Table 1.
Appendix CBest Case Results of the mid-scale experiment
Best Case Results of the mid-scale experiment
Table 9 shows the best case results of the four training methods using the CIFAR-100 dataset and the ResNet-34 model on 8 GPUs. We observed the similar trends as in Table 4.