Processing math: 100%
Pipeline Parallelism With Elastic Averaging | IEEE Journals & Magazine | IEEE Xplore

Pipeline Parallelism With Elastic Averaging


A novel pipeline parallelism called EA-Pipe to address the weight inconsistency and delayed gradient problems occurring in parallel stochastic gradient descent for large ...

Abstract:

To accelerate the training speed of massive DNN models on large-scale datasets, distributed training techniques, including data parallelism and model parallelism, have be...Show More

Abstract:

To accelerate the training speed of massive DNN models on large-scale datasets, distributed training techniques, including data parallelism and model parallelism, have been extensively studied. In particular, pipeline parallelism, which is derived from model parallelism, has been attracting attention. It splits the model parameters across multiple computing nodes and executes multiple mini-batches simultaneously. However, naive pipeline parallelism suffers from the issues of weight inconsistency and delayed gradients, as the model parameters used in the forward and backward passes do not match, causing unstable training and low performance. In this study, we propose a novel pipeline parallelism technique called EA-Pipe to address the weight inconsistency and delayed gradient problems. EA-Pipe applies an elastic averaging method, which has been studied in the context of data parallelism, to pipeline parallelism. The proposed method maintains multiple model replicas to solve the weight inconsistency problem, and synchronizes the model replicas using an elasticity-based moving average method to mitigate the delayed gradient problem. To verify the efficacy of the proposed method, we conducted three image classification experiments on the CIFAR-10/100 and ImageNet datasets. The experimental results show that EA-Pipe not only accelerates training speed but also demonstrates more stable learning property compared to existing pipeline parallelism techniques. Especially, in the experiments using the CIFAR-100 and ImageNet datasets, EA-Pipe recorded error rates that were 2.58% and 2.19% lower, respectively, than the baseline pipeline parallelization method.
A novel pipeline parallelism called EA-Pipe to address the weight inconsistency and delayed gradient problems occurring in parallel stochastic gradient descent for large ...
Published in: IEEE Access ( Volume: 12)
Page(s): 5477 - 5489
Date of Publication: 05 January 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

In recent years, the increasing availability of advanced hardware support and large-scale datasets have led to the widespread adoption of massive deep neural network (DNN) models as the primary machine learning approaches. However, training massive DNN models on large-scale datasets takes a considerable amount of time, leading to an increasing demand for efficient model training. To efficiently train massive DNN models, for example, VGG [16] and ResNet [31] in an image processing task, or large language models [3], [4], distributed training has been extensively studied [5], [6], [7]. In general, distributed training is divided into two types: data parallelism and model parallelism.

Data parallelism partitions and distributes the training datasets across multiple computing nodes. In the most common data parallelism architecture, each computing node has a local copy of the master DNN model, and the synchronization between the local model and the master model takes place in the parameter server where the master model resides. Depending on the synchronization method, data parallelism is divided into two types: synchronous data parallelism [8], [9] (SDP) and asynchronous data parallelism [10], [11] (ADP). In SDP, the parameter server waits for all computing nodes to complete their works. Then, it synchronizes the master model with the local models in the computing nodes. Because all computing nodes wait for the synchronization to be completed, the training time of SDP is determined by the slowest computing node. In ADP, however, the parameter server does not wait for all computing nodes to finish their works. It synchronizes with the computing node that completes the computation first. Therefore, ADP can accelerate training speed more than SDP. However, since the parameter server accepts the gradients from the computing nodes in the order of their arrival and applies them to the master model, the late-arriving gradients cannot be correctly applied to the exact version of the master model parameters used to compute the gradients. Instead, they will be applied to the master model parameters that have already been updated by the early arrived gradients. These late-arriving gradients are called the delayed gradients (or stale gradients), and it is the major cause of the negative effect on training performance in ADP [12].

Model parallelism partitions the model parameters and distributes them across multiple computing nodes [11]. In naive model parallelism, the hardware utilization is low since only one computing node may be active at a time [13]. To mitigate this problem, pipeline parallelism has been proposed [13], [14], [15]. Pipeline parallelism can process multiple mini-batches simultaneously in all computing nodes, resulting in a significant acceleration in computation. Though pipeline parallelism can train DNN models efficiently in terms of training time, it suffers from two problems: the weight inconsistency and delayed gradient problems. The model parameters used in the forward pass and backward pass are different in naive pipeline parallelism, which is called the weight inconsistency problem. In addition, since the model parameters used during forward pass are not maintained until applying the gradients, it results in the delayed gradient problem as in ADP. These problems deteriorate model convergence, causing longer training time and higher error rates.

Fig. 1(a) illustrates an example of pipeline parallelism scheduling timeline [15] having the weight inconsistency and delayed gradient problems. Let \mathbf {x}_{2} indicate the model parameters used in the forward pass of the 2nd mini-batch in the 2nd computing node (denoted as F_{2}^{2} ). \mathbf {x}_{2} is updated first in the backward pass of the 1st mini-batch in the 2nd computing node (denoted as B_{2}^{1} ). Let \mathbf {x}_{2}^{\prime} indicate the updated model parameters after B_{2}^{1} . Then, \mathbf {x}_{2}^{\prime} is updated to \mathbf {x}_{2}^{\prime \prime } in the backward pass of the 2nd mini-batch in the 2nd computing node (denoted as B_{2}^{2} ). Since the model parameters used during B_{2}^{2} , that is, \mathbf {x}_{2}^{\prime} , differ from the one used in F_{2}^{2} , that is, \mathbf {x}_{2} , the weight inconsistency problem occurs. Furthermore, the model parameters used for gradient computation are \mathbf {x}_{2} , whereas the parameters to which the gradients are applied are \mathbf {x}_{2}^{\prime} . This causes the delayed gradient problem because the model parameters \mathbf {x}_{2} , which are used during the forward pass F_{2}^{2} , are not preserved until the gradients are applied.

FIGURE 1. - (a) An example of pipeline parallelism scheduling timeline [15] having the weight inconsistency and delayed gradient problems. (b) An example of pipeline parallelism scheduling timeline [15] utilizing the weight stashing and vertical sync methods. 
$F_{n}^{t_{\text {f}}}$
 and 
$B_{n}^{t_{\text {b}}}$
 represent the forward and backward passes, respectively, at computing node 
$n$
 for mini-batch index 
$t_{\text {f}}$
 and 
$t_{\text {b}}$
, respectively. In (a), 
$\mathbf {x}_{n}$
 represents the partitioned model parameters at computing node 
$n$
. In (b), 
$\mathbf {x}_{n}^{i}$
 and 
$\overline {\mathbf {x}}_{n}$
 represent the 
$i$
-th replica of the partitioned local model parameters and the partitioned master model parameters, respectively, at computing node 
$n$
. We can calculate the index 
$i$
 for the local model used in 
$F_{n}^{t_{\text {f}}}$
 and 
$B_{n}^{t_{\text {b}}}$
 by 
$i = \text {mod}(t_{\text {f}} - 1, N) + 1$
 and 
$i = \text {mod}(t_{\text {b}} - 1, N) + 1$
, respectively, where 
$N$
 is the number of computing nodes. The blue arrows represent the value propagated through the forward pass, while the red arrows represent the error propagated through the backward pass. 
$g(x)$
 represents the gradient computed using model parameter 
$\mathbf {x}$
. Though the weight inconsistency problem is resolved by using the weight stashing and vertical sync methods, the delayed gradient problem still remains in updating the master model.
FIGURE 1.

(a) An example of pipeline parallelism scheduling timeline [15] having the weight inconsistency and delayed gradient problems. (b) An example of pipeline parallelism scheduling timeline [15] utilizing the weight stashing and vertical sync methods. F_{n}^{t_{\text {f}}} and B_{n}^{t_{\text {b}}} represent the forward and backward passes, respectively, at computing node n for mini-batch index t_{\text {f}} and t_{\text {b}} , respectively. In (a), \mathbf {x}_{n} represents the partitioned model parameters at computing node n . In (b), \mathbf {x}_{n}^{i} and \overline {\mathbf {x}}_{n} represent the i -th replica of the partitioned local model parameters and the partitioned master model parameters, respectively, at computing node n . We can calculate the index i for the local model used in F_{n}^{t_{\text {f}}} and B_{n}^{t_{\text {b}}} by i = \text {mod}(t_{\text {f}} - 1, N) + 1 and i = \text {mod}(t_{\text {b}} - 1, N) + 1 , respectively, where N is the number of computing nodes. The blue arrows represent the value propagated through the forward pass, while the red arrows represent the error propagated through the backward pass. g(x) represents the gradient computed using model parameter \mathbf {x} . Though the weight inconsistency problem is resolved by using the weight stashing and vertical sync methods, the delayed gradient problem still remains in updating the master model.

Existing studies on pipeline parallelism have proposed various methods to address these problems by synchronously updating model parameters [13], [16], predicting correct model parameters [17], [18], or generating multiple model replicas [15]. However, these methods have drawbacks such as becoming less accurate as the number of computing nodes increases, or underutilizing hardware efficiency. For example, generating multiple model replicas is an approach to mitigate the weight inconsistency problem. However, delayed gradient problem may arise depending on the synchronization method (e.g., Applying ADP to synchronize multiple model replicas, as seen in [15], may result in the delayed gradient problem, which will be explained further in Section II).

In this paper, we propose a novel pipeline parallelism method, called EA-Pipe, that aims to address the delayed gradient problem when generating multiple model replicas to mitigate the weight inconsistency problem. To this end, we view the delayed gradient problem from a data parallelism perspective, and apply an optimization method previously studied in the area of data parallelism. Especially, EA-Pipe utilizes the moving average based elastic force called elastic averaging [19]. The elastic averaging algorithm showed comparable performance with prior ADP synchronization methods, while alleviating the delayed gradient problem and achieving more stable learning property. Therefore, it can be expected that similar effects can be achieved when applying the algorithm to the pipeline parallelism. The contributions of this paper are as follows:

  • Unlike existing methods, we approach the problems in pipeline parallelism through a data parallelism perspective. Thereby, we can utilize the advantages of the existing ADP optimization methods such as the elastic averaging algorithm for pipeline parallelism.

  • We proposed a novel round-robin parallel training algorithm by combining the pipeline parallelism and asynchronous data parallelism to solve the weight inconsistency and delayed gradient problems.

  • We analyzed the convergence property of the proposed method and found that the error bound is the same as the synchronous elastic averaging case. To the best of the author’s knowledge, this is the first theoretical convergence analysis of the round-robin asynchronous data parallelism through the elastic averaging pipeline parallelism.

  • The experimental results indicate that EA-Pipe not only enhances training speed but also trains DNN models as stably as SGD. In particular, in the experiments using the CIFAR-100 and ImageNet datasets, EA-Pipe demonstrated error rates that were 2.58% and 2.19% lower, respectively, compared to the baseline pipeline parallelization method, PipeDream.

The rest of the paper is organized as follows: In Section II, we review related works concerning EA-Pipe. In Section III, the work scheduling, algorithm, and convergence property of EA-Pipe are explained. In Section IV, we verify the proposed method through three image classification experiments using three different models and datasets. The conclusion of this study and suggestions for future research are discussed in Section V, followed by the Appendix including the detail proof of the convergence analysis.

SECTION II.

Related Work

A. PipeDream

PipeDream [15] follows a scheduling strategy in which each computing node executes the forward and backward passes alternatively for different mini-batches, ensuring high hardware utilization. Fig. 1(a) shows the pipeline scheduling of PipeDream. However, this pipeline scheduling leads to the weight inconsistency and delayed gradient problems, since the versions of the model parameters used in the forward and backward passes are inconsistent. To mitigate these problems, PipeDream proposed the weight stashing and vertical sync methods. The weight stashing method generates as many model replicas as the number of active mini-batches for each computing node. Thereby each mini-batch ends up using the same model parameters in both the forward and backward passes. The vertical sync method generates as many model replicas as the maximum number of active mini-batches in the pipeline, ensuring the time synchronous weight versions are available throughout all computing nodes. The maximum number of active mini-batches is always equal to the number of computing nodes. Fig.1(b) shows PipeDream utilizing the weight stashing and vertical sync methods.

With these two methods, PipeDream can solve the weight inconsistency problem effectively. However, during the backward pass, the computed gradients are applied to the model parameters that have already been updated by the earlier gradients. Therefore, the delayed gradient problem still remains. For example, F_{2}^{2} uses the local model parameters \mathbf {x}_{2}^{2} , which are initialized to \bar {\mathbf {x}}_{2} . However, \bar {\mathbf {x}}_{2} is updated at B_{2}^{1} . Therefore, the gradients calculated at B_{2}^{2} are applied to the master model parameters which have already been updated at B_{2}^{1} . As a result, the delayed gradient problem still remains unresolved.

EA-Pipe addresses the persistent delayed gradient problem remained even after resolving the weight inconsistency problem by generating multiple model replicas in PipeDream. To tackle this problem, EA-Pipe approaches the delayed gradient problem from the perspective of data parallelism and applies the elastic averaging algorithm previously studied in the context of data parallelism.

B. Elastic Averaging Algorithm

The elastic averaging algorithm [19] was proposed to reduce the communication cost in data parallelism, by enabling each computing node to conduct more local training computations and explore more extensively away from the master model before synchronization. Each local model is maintained as if it is connected with the master model by an elastic force, which constrains the distance between the local and master model parameters. The stronger the elastic force is, the more the distance between the local and master model parameters is constrained. While training, each local model fluctuates around the master model. Therefore, the risk of the master model falling into local optima may be reduced. The synchronization equation of the elastic averaging algorithm in data parallelism is as follows:\begin{align*} \mathbf {x} &\leftarrow \mathbf {x}_{n} \\ \mathbf {x}_{n} &\leftarrow \mathbf {x}_{n} - \alpha (\mathbf {x} - \bar {\mathbf {x}}) \\ \bar {\mathbf {x}} &\leftarrow \bar {\mathbf {x}} + \alpha (\mathbf {x} - \bar {\mathbf {x}}){,} \tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features. where \mathbf {x}_{n} and \bar {\mathbf {x}} denote the n -th local and master model parameters, respectively. \alpha indicates the strength of the elastic force between the local and master models.

EA-Pipe incorporates the elastic averaging algorithm into pipeline parallelism. As a result, the synchronization of model parameters occurs partially, unlike the elastic averaging algorithm in data parallelism. Further details on parameter synchronization in EA-Pipe will be discussed in Section III.

SECTION III.

Proposed Method

In this section, we propose a novel pipeline parallelism method, called EA-Pipe, which mitigates the weight inconsistency and delayed gradient problems at the same time. This Section is divided into four subsections. Firstly, we introduce the pipeline scheduling of EA-Pipe. Secondly, we present the algorithm of EA-Pipe in pseudo-code. Thirdly, we analyze the convergence property of EA-Pipe. Lastly, we compare EA-Pipe with a recent work that uses similar approaches.

A. Pipeline Scheduling of EA-Pipe

EA-Pipe follows the same structure as PipeDream in Fig. 1(b). However, unlike PipeDream, the local models in EA-Pipe update their parameters locally without synchronizing with the master model parameters until a specific synchronization period is reached. Fig. 2 shows an example workflow of EA-Pipe with two computing nodes when synchronization period is set to 2. In Fig. 2(a), the local model parameters are updated locally without synchronization with the master model parameters. For example, at the backward pass B_{2}^{1} , the computed gradients cause the local model parameters to be updated from \mathbf {x}^{1}_{2} to {\mathbf {x}^{1}_{2}}^{\prime} . However, once the synchronization period is reached, the backward pass (e.g., B_{2}^{3} ) not only updates the local model parameters but also synchronizes them with the master model parameters based on Eq. (1).

FIGURE 2. - An example workflow of EA-Pipe with two computing nodes when synchronization period is set to 2. In (a), the highlighted boxes represent the backward passes and synchronization with the master model. In (b), 
${\langle }F^{t_{\text {f}}},B_{2}^{t_{\text {b}}}{\rangle }={\langle }F_{1}^{t_{\text {f}}}, F_{2}^{t_{\text {f}}}, B_{2}^{t_{\text {b}}}{\rangle }$
.
FIGURE 2.

An example workflow of EA-Pipe with two computing nodes when synchronization period is set to 2. In (a), the highlighted boxes represent the backward passes and synchronization with the master model. In (b), {\langle }F^{t_{\text {f}}},B_{2}^{t_{\text {b}}}{\rangle }={\langle }F_{1}^{t_{\text {f}}}, F_{2}^{t_{\text {f}}}, B_{2}^{t_{\text {b}}}{\rangle } .

Fig. 2(b) shows how the parameters are updated in the second computing node by EA-Pipe in a virtual parameter space. For example, the second computing node updates the local model parameters \mathbf {x}_{2}^{2} during B_{2}^{2} . As the synchronization period of 2 has not been reached yet, synchronization with the master model does not occur. Then, the second computing node updates \mathbf {x}_{2}^{2} again during B_{2}^{4} . At this point, as the synchronization period of 2 has been reached, the second computing node synchronizes \mathbf {x}_{2}^{2} with the master model parameters \bar {\mathbf {x}}_{2} . The highlighted arrow lines indicate the backward passes for each local model when it reaches the synchronization period, while the dashed arrow lines represent the synchronization process where the local model and master model pull each other.

B. Algorithm

A pseudo-code of EA-Pipe is shown in Algorithm 1, which is executed by all computing nodes simultaneously. \tau and T indicate the synchronization period and the total number of mini-batches, respectively. N represents the total number of computing nodes, hence the number of partitioned master models. The size of each partitioned model is, therefore, one 1/N of the original model size, if it is divided evenly. The term initial phase refers to the beginning of the pipeline scheduling timeline when only forward passes are executed to fill the pipeline. Similarly, the term final phase refers to the ends of the pipeline scheduling timeline when only backward passes are left to be executed to flush the pipeline. The forward pass F_{n}^{t_{\text {f}}} is executed followed by the backward pass B_{n}^{t_{\text {b}}} at each iteration. The model parameters \bar {\mathbf {x}}_{n} and \mathbf {x}_{n}^{i} are synchronized at the end of the backward pass whenever \lceil {t_{\text {b}}/N}\rceil is divided by \tau . g(\mathbf {x}) represents the gradient computed using model parameter \mathbf {x} .

Algorithm 1 - EA-Pipe: Executed by Computing Node 
$n$
 in Parallel With All Other Computing Nodes
Algorithm 1

EA-Pipe: Executed by Computing Node n in Parallel With All Other Computing Nodes

C. Convergence Analysis

In this section, we will discuss the convergence property of EA-Pipe. Previous studies have limited analysis on the elastic averaging algorithm. For example, the analysis in [19] covered only one local update for quadratic objective functions. The convergence analysis presented in [20] is only for a synchronous elastic averaging algorithm. Therefore, the convergence analysis of an asynchronous elastic averaging algorithm remains an open question. In EA-Pipe, the local models are synchronized with the master model sequentially, which can be considered as a round-robin data parallelism [21]. As shown in Fig. 2(b), we can observe that the synchronization order between the master model \bar {\mathbf {x}} and the set of local models \{\mathbf {x}^{1},\mathbf {x}^{2}\} is always round-robin (i.e., \mathbf {x}^{1}{\rightarrow }\mathbf {x}^{2}{\rightarrow }\mathbf {x}^{1}{\rightarrow }\cdots ). Based on this observation, we will discuss the convergence property of EA-Pipe from the perspective of asynchronous elastic averaging algorithm in round-robin data parallelism.

The objective function F(\mathbf {x}) that we are interested in is defined as follows:\begin{equation*} F(\mathbf {x}):= \frac {1}{N}\sum ^{N}_{i=1}\mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})]{,} \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. which is commonly used for convergence analysis in the synchronous or asynchronous data parallelism [20], [22], [23]. Here \mathbf {x}{\in }\mathbb {R}^{d} , N , and {\mathcal {D}}_{i} denote the model parameters, the number of local models, and the local data distribution for the i -th local model, respectively. \mathcal {L} denotes the loss function.

Based on the analysis in [20] and [22] which proves the convergence rate of distributed SGD algorithms with local updates on non-convex objectives, we can derive the following theorem, which guarantees that EA-Pipe converges to stationary points of non-convex objective functions with the same error bound as a synchronous elastic averaging algorithm.

Theorem 1:

Let L , \eta , \tau , \sigma ^{2} , \zeta , and K be the Lipschitz constant, learning rate, synchronization period, bound of gradient variance, magnitude of the second largest eigenvalue (see Appendix A), and the total number of iterations, respectively. Then, the convergence rate of EA-Pipe is as follows: \begin{align*} &\hspace {-1pc}\frac {1}{K}\sum _{k=0}^{K-1}\mathbb {E}[\|{\nabla }F(\mathbf {y}_{k})\|^{2}] \\ &\leq \frac {2(F(\mathbf {y}_{0})-F_{\mathrm {inf}})}{\eta _{\mathrm {eff}}K} +\frac {\eta _{\mathrm {eff}}L\sigma ^{2}}{N} \\ &\quad + \eta ^{2}_{\mathrm {eff}}L^{2}\sigma ^{2}\left ({\frac {1+\zeta ^{2}}{1-\zeta ^{2}}\tau - 1}\right)\left ({1+\frac {1}{N}}\right)^{2}, \tag{3}\end{align*} View SourceRight-click on figure for MathML and additional features. where \mathbf {y}_{k} , F_{\mathrm {inf}} , and \eta _{\mathrm {eff}} denote the average of local and master model parameters at k -th time-step, the lower bound of the objective function, and an effective learning rate \frac {N}{N+1}\eta , respectively.

Theorem 1 demonstrates that if the learning rate \eta is chosen properly and the total number of iterations K is large enough, the error bound of the EA-Pipe algorithm is equal to that of the synchronous elastic averaging algorithm in [20]. The detailed proof of the theorem is provided in Appendix A.

D. Comparison to Similar Works

Recently, AvgPipe is proposed to enhance the throughput of pipeline parallelism by combining elastic averaging algorithm into pipeline parallelism [24]. Although AvgPipe and EA-Pipe shares similarities, we highlight some distinctions between their approach and ours. First, while AvgPipe is designed to operate at micro-batch level which has a constraint on the size of mini-batch, EA-Pipe is devised to operate at the mini-batch level without any size constraint. Second, while AvgPipe introduced elastic averaging algorithm to increase the throughput of pipeline parallelism, our approach proposes integrating the elastic averaging algorithm to address the delayed gradient problems. Last but not least, in contrast to [24], we conducted an analysis of the convergence property of pipeline parallelism with the elastic averaging algorithm.

SECTION IV.

Experiments

In this section, we explain three types of experiments conducted to verify the performance of the proposed method: small-scale, mid-scale, and large-scale experiments. The small-scale experiment evaluates EA-Pipe for an image classification task using CIFAR-10 dataset with VGG-16 [16] as the DNN model. The CIFAR-10 dataset has 60,000\,\,32\times 32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. To prevent overfitting during model training, we employed an image augmentation technique. VGG-16 is a convolutional neural network architecture with 16 layers, including 13 convolutional layers and 3 fully connected layers.

We compared four training methods: the conventional sequential SGD (denoted as SGD in the Tables and Figures), PipeDream [15], SpecTrain [17], and the proposed EA-Pipe. Since SGD with momentum is commonly employed in computer vision tasks, we chose SGD as the baseline of the sequential training method [25]. We included SpecTrain, which addresses the weight inconsistency and delayed gradient problems through model parameter prediction. All methods were implemented in CUDA/C++. The performance of each training method was measured on a server with an Intel Xeon Silver 4110 CPU and eight NVIDIA GeForce RTX 2080 Ti GPUs.

The performances of the models trained with each algorithm were measured by the ratio of the misclassified images in the test data. We ran three repetitions of the experiments each with different random initial weights, and the average performance with 95% confidence interval for each method is reported here. The best case results of the three repetitions can be found in Appendices B, C, and D. The same evaluation protocol was consistently maintained throughout the subsequent experiments.

In the second experiment (mid-scale experiment), we scaled up from the first experiment, assessing EA-Pipe for an image classification task using CIFAR-100 dataset with ResNet-34 [31] as the DNN models. The CIFAR-100 dataset closely resembles CIFAR-10, except that it has 100 classes containing 600 images each. ResNet-34 is a variant of the residual network architecture with 34 layers, utilizing skip connections to address training challenges in deep neural networks. We compared four training methods, as in the previous experiment.

The third experiment compares EA-Pipe and PipeDream to further verify the effect of the delayed gradients on a larger scale of parallelism. We conducted the experiment using the ResNet-50 model and the ImageNet dataset, simulating PipeDream and EA-Pipe on a virtual 50-GPU environment. The ImageNet dataset contains 1,281,167 training images, 50,000 validation images and 100,000 test images in 1,000 classes. The main objective of the third experiment is to evaluate the effectiveness of EA-Pipe in addressing the delayed gradient problem for large-scale environments. The simulation experiment was conducted on a server with an Intel i5-10600K CPU and an NVIDIA RTX 2080 Ti GPU.

A. Experimental Results on CIFAR-10 With VGG-16 (Small-Scale Experiment)

We conducted small-scale experiments using the CIFAR-10 dataset and the VGG-16 model with varying number of GPUs (one, two, four, and eight). For all four methods, we tried batch sizes of 128, 64, 32, and 16, and learning rates of 0.1 and 0.01. The number of training epochs was set to 100. Learning rates were reduced to one tenth after every 30 epochs. For SpecTrain training, however, it did not converge with the previously chosen learning rate candidates. Thus, we conducted additional SpecTrain training using learning rates of 0.001, 0.0005, 0.0003, 0.0001, paired with the batch sizes of 128, 64, 32, and 16, respectively. For EA-Pipe, we set the communication period to 1 and measured the best performance by varying the elastic force (0.1, 0.3, and 0.5).

Table 1 summarizes the results of the small-scale experiment. It was found that a batch size of 16 and a learning rate of 0.01 were the optimal hyperparameters for SGD, PipeDream, and EA-Pipe, while a batch size of 32 and a learning rate of 0.0003 were the optimal hyperparameters for SpecTrain. EA-Pipe showed the best performance at an elastic force of 0.3. SGD showed the lowest error rate of 7.36%. SpecTrain succeeded in training only when the learning rate was set to a small value (0.0005), while reaching higher error rates compared to PipeDream and EA-Pipe. PipeDream and the proposed method, EA-Pipe, did not exhibit significant differences. Therefore, additional validation of the proposed method was necessary through larger scale experiments, which will be discussed shortly.

TABLE 1 Image Classification Error Rates (%) and 95% Confidence Interval of the Four Training Methods Using the CIFAR-10 Dataset and the VGG-16 Model With Varying Number of GPUs (Small-Scale Experiment)
Table 1- 
Image Classification Error Rates (%) and 95% Confidence Interval of the Four Training Methods Using the CIFAR-10 Dataset and the VGG-16 Model With Varying Number of GPUs (Small-Scale Experiment)

Table 2 shows the computational overhead for the four training methods in the small-scale experiment. The computational overhead was calculated by dividing the total training time (in seconds) by the total number of epochs. SpecTrain showed a lower computational overhead compared to PipeDream and EA-Pipe since it used a batch size of 32, while PipeDream and EA-Pipe used a batch size of 16. These were the optimal values for each training method. Though SpecTrain was the fastest, it showed the worst image classification accuracy. Both PipeDream and EA-Pipe showed a reduction in computational overhead as the number of GPUs increased. EA-Pipe exhibited a slightly higher computational overhead compared to PipeDream.

TABLE 2 Computational Overhead of the Four Training Methods in the Small-Scale Experiment, Measured by the Average Training Time (In Seconds) Per Epoch
Table 2- 
Computational Overhead of the Four Training Methods in the Small-Scale Experiment, Measured by the Average Training Time (In Seconds) Per Epoch

Table 3 shows the statistical efficiency of the four training methods measured by the total training time to reach the lowest error rates. As can be seen in Tables 2 and 3, we doubt that EA-Pipe is appropriate for small scale parallelism.

TABLE 3 Statistical Efficiency of the Four Training Methods in the Small-Scale Experiment, Measured by the Total Training Time (In Hours) to Reach the Lowest Error Rates
Table 3- 
Statistical Efficiency of the Four Training Methods in the Small-Scale Experiment, Measured by the Total Training Time (In Hours) to Reach the Lowest Error Rates

B. Experimental Results on CIFAR-100 With ResNet-34 (Mid-Scale Experiment)

We conducted a mid-scale experiment using the CIFAR-100 dataset and the ResNet-34 model on 8 GPUs. For all four methods, we employed batch sizes of 64, 32, and 16, and learning rates of 0.1 and 0.01. The number of training epochs was set to 100. Learning rates were reduced to one tenth after every 30 epochs. For SpecTrain, however, it did not converge with the previously chosen learning rate candidates. Thus, we conducted additional SpecTrain training using learning rates of 0.001, 0.0005, 0.0003, paired with the batch sizes of 64, 32, and 16, respectively. For EA-Pipe, we set the communication period to 1 and measured the best performance by varying the elastic force (0.1, 0.3, and 0.5).

Table 4 presents the results of the mid-scale experiment. Both PipeDream and EA-Pipe reached at the lowest error rates at a batch size of 16 and a learning rate of 0.01. Except for when using a batch size of 16 and a learning rate of 0.1, EA-Pipe achieved lower error rates than PipeDream in all cases. SpecTrain failed to be trained at batch sizes of 64, 32, and 16, and learning rates of 0.1 and 0.01. Similar to the first experiment (Section IV-A), SpecTrain only succeeded in training when the learning rate was set to a smaller value (0.0003), while reaching higher error rates compared to PipeDream and EA-Pipe. We can suspect that using the weight prediction of SpecTrain to address the weight inconsistency and delayed gradient problems might not be effective.

TABLE 4 Image Classification Performance (Average Error Rates in % and 95% Confidence Interval) of the Four Training Methods Using the CIFAR-100 Dataset and the ResNet-34 model on 8 GPUs (Mid-Scale Experiment)
Table 4- 
Image Classification Performance (Average Error Rates in % and 95% Confidence Interval) of the Four Training Methods Using the CIFAR-100 Dataset and the ResNet-34 model on 8 GPUs (Mid-Scale Experiment)

Fig. 3 depicts the error rate curves for the three methods (SGD, PipeDream, and EA-Pipe). We did not include SpecTrain, since it showed the worst error rates. The robustness of EA-Pipe can be observed in the error rate graphs. At high learning rates (e.g., between 0 and 30 epochs), PipeDream exhibits slow convergence speed and unstable model training trends. On the other hand, EA-Pipe shows stable model training trends similar to SGD.

FIGURE 3. - Training curves for the three training methods using the CIFAR-100 dataset with the ResNet-34 model on 8 GPUs.
FIGURE 3.

Training curves for the three training methods using the CIFAR-100 dataset with the ResNet-34 model on 8 GPUs.

In Table 5, we compare the computational overhead of the four training methods in the mid-scale experiment. Similar to the small-scale experiment, EA-Pipe exhibited a slightly higher computational overhead compared to PipeDream. However, as can be seen in Table 6, in terms of the total training time taken to reach the lowest error rate, EA-Pipe is the most efficient one among the four training methods. We suspect that the effciency of EA-Pipe will be more evident as the size of parallelism gets larger, which is discussed in the next section.

TABLE 5 Computational Overhead of the Four Training Methods in the Mid-Scale Experiment, Measured by the Average Training Time (In Seconds) Per Epoch
Table 5- 
Computational Overhead of the Four Training Methods in the Mid-Scale Experiment, Measured by the Average Training Time (In Seconds) Per Epoch
TABLE 6 Statistical Efficiency of the Four Training Methods in the Mid-Scale Experiment, Measured by the Total Training Time (In Hours) to Reach the Lowest Error Rates
Table 6- 
Statistical Efficiency of the Four Training Methods in the Mid-Scale Experiment, Measured by the Total Training Time (In Hours) to Reach the Lowest Error Rates

C. Experimental Results of Large-Scale Parallelism

In the previous section, we observed statistical efficiency and stability in training for EA-Pipe. However, it was not enough to confirm the adverse effect of the delayed gradient problem. Therefore, a large-scale experiment was conducted. Due to the lack of available computing accelerators, we conducted a large-scale parallelism experiment of EA-Pipe and PipeDream running in a 50-GPU simulated environment. To carry out the evaluation, we chose an image classification task on the ImageNet dataset using the ResNet-50 model.

Since an ImageNet experiment requires a significant amount of time for a training method to complete the whole training process, we selected a batch size of 64, which is the maximum size that the GPU memory can accommodate. To determine the optimal learning rate, we first evaluated 10% of the ImageNet dataset and found out that 0.06 was the best learning rate among the candidates of 0.1, 0.06, 0.03, and 0.01. Then, we evaluated the entire ImageNet dataset using the learning rate of 0.06, as well as two adjacent values (0.08 and 0.04). As a result, we identified 0.04 as the optimal learning rate. For the sake of fairness, PipeDream and EA-Pipe utilizes the same batch size and learning rate values. The number of training epochs was set to 100. Learning rates were reduced to one tenth after every 30 epochs. For EA-Pipe, we set the communication period to 1 and the elastic force to 0.1.

Table 7 shows the results of the three training methods using the ImageNet dataset and the ResNet-50 model on 50 GPUs. In the large-scale setting, EA-Pipe achieved lower error rate than PipeDream. As we suspected earlier, the delayed gradients become more problematic in large-scale parallelism, validating the need for a method solving the delayed gradient problem efficiently, such as EA-Pipe.

TABLE 7 Image Classification Error Rates (%) and 95% Confidence Intervals of the Three Training Methods Using the ImageNet Dataset and the ResNet-50 Model on 50 GPUs (Large-Scale Experiment)
Table 7- 
Image Classification Error Rates (%) and 95% Confidence Intervals of the Three Training Methods Using the ImageNet Dataset and the ResNet-50 Model on 50 GPUs (Large-Scale Experiment)

Fig. 4 depicts the error rate curves for these methods. We can observe that EA-Pipe shows a more stable learning property compared to PipeDream at large learning rates as observed in the previous section. This is due to the following fact. Since the effective weight change (i.e. gradients multiplied by the learning rate) is relatively large compared to small learning rate cases, the delayed gradients become more problematic in PipeDream.

FIGURE 4. - Training curves of the three training methods using the ImageNet dataset and the ResNet-50 model on 50 GPUs.
FIGURE 4.

Training curves of the three training methods using the ImageNet dataset and the ResNet-50 model on 50 GPUs.

SECTION V.

Conclusion and Future Work

In this study, a novel pipelined parallel SGD algorithm, EA-Pipe, has been proposed to mitigate the delayed gradient problem that occurs in pipeline parallelism. It utilizes the multiple model replicas and synchronizes them based on an elastic averaging scheme. Some conventional approaches reduce the batch size and learning rate to alleviate the delayed gradient problem to some extent, which can reduce the GPU hardware utilization and/or increase the training time. Our proposed method does not need to adjust the batch size and learning rate, thereby reducing the hyperparameter optimization time. The experimental results confirmed that the proposed method can achieve comparable error rates to SGD and show the efficacy of parallel training in large-scale environments. In addition, we analyzed the convergence property of EA-Pipe, and confirmed that the error bound is the same as the synchronous elastic averaging algorithm. However, the proposed method could have a disadvantage in terms of memory utilization because it creates multiple model replicas. This disadvantage could constrain the training of very large models, which can be left to a future work.

Appendix A

Proof of Theorem 1

In EA-Pipe, the local models are synchronized with the master model in round-robin manner, which makes the model parallelism of EA-Pipe as the round-robin data parallelism. At each time step k , all local computing nodes update their model parameters, and at every synchronization period, they are synchronized with the master model in a round-robin manner. The typical round-robin elastic averaging algorithms for the local models and the master model are Algorithm 2 and Algorithm 3, respectively.

Algorithm 2 Round-Robin Elastic Averaging Algorithm for Local Model Parameter \mathbf{x}^{i}

1:

Initialize \mathbf {x}^{i}

2:

k=0

3:

repeat

4:

if k mod \tau = 0 then

5:

Wait until \bar {\mathbf {x}}^{k} is synchronized with \mathbf {x}^{i-1,k} .

6:

\mathbf {x}^{i,k+1} \leftarrow \mathbf {x}^{i,k} - \eta g(\mathbf {x}^{i,k}, \mathbf {s}^{i,k}) - \alpha (\mathbf {x}^{i,k} - \bar {\mathbf {x}}^{k})

7:

else

8:

\mathbf {x}^{i,k+1} \leftarrow \mathbf {x}^{i,k} - \eta g(\mathbf {x}^{i,k}, \mathbf {s}^{i,k})

9:

end if

10:

k \leftarrow k + 1

11:

until k equals K

Algorithm 3 Round-Robin Elastic Averaging Algorithm for Master Model Parameter \bar{\mathbf{x}}

1:

Initialize \bar {\mathbf {x}}

2:

k=0

3:

repeat

4:

if k mod \tau = 0 then //round-robin elastic averaging

5:

for i = 1 to N do

6:

Wait until \mathbf {x}^{i,k} is ready.

7:

\bar {\mathbf {x}}^{k} \leftarrow \bar {\mathbf {x}}^{k} + \alpha (\mathbf {x}^{i,k} - \bar {\mathbf {x}}^{k})

8:

end for

9:

\bar {\mathbf {x}}^{k+1} \leftarrow \bar {\mathbf {x}}^{k}

10:

end if

11:

k \leftarrow k + 1

12:

until k equals K

K , \eta , \alpha and \tau denote the total number of time steps (iterations), learning rate, elastic force value, and synchronization period, respectively. g(\mathbf {x}^{i,k}, \mathbf {s}^{i,k}) indicates the stochastic gradients computed on a randomly sampled mini-batch \mathbf {s}^{i,k} \sim {\mathcal {D}}_{i} , where i and k indicate the indices for the local model and time-step, respectively. Note that the EA-Pipe in Algorithm 1 is a special case of these algorithms, which effectively implement the round-robin elastic averaging through the pipeline parallelism.

These algorithms optimize the following objective function.\begin{equation*} F(\mathbf {x}):= \frac {1}{N}\sum ^{N}_{i=1}\mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})] + {\rho }\|\mathbf {x}^{i} - \bar {x}\|^{2} \tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. The equivalence of Eq. (4) and the following objective function that we are interested in is studied in the literature and it is known as the global variable consensus problem [26].\begin{equation*} F(\mathbf {x}):= \frac {1}{N}\sum ^{N}_{i=1}\mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})] \tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. Therefore, we will be focusing on the convergence analysis of Eq. (5).

In order to analyze the convergence property of the round-robin elastic averaging algorithms, we utilize the Theorem 1 in [20], which is based on the following assumptions.

Assumption 1 (L -Smoothness):

We assume that each local objective function F_{i}(\mathbf {x}):= \mathbb {E}_{\mathbf {s}\sim {\mathcal {D}}_{i}}[\mathcal {L}(\mathbf {x}, \mathbf {s})] is L-smooth such that \begin{equation*} \|{\nabla }F_{i}(\mathbf {x}) - {\nabla }F_{i}(\mathbf {y})\| \leq L\|\mathbf {x} - \mathbf {y}\|, \tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. where i\in \{1,2,\cdots,N\} and \mathbf {x},\mathbf {y}\in \mathbb {R}^{d} .

Assumption 2 (Lower Bound):

We assume that F(\mathbf {x}) has a lower bound F_{\mathrm {inf}} such that:\begin{equation*} F(\mathbf {x}) \geq F_{\mathrm {inf}}. \tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Assumption 3 (Unbiased Gradients):

We assume that the stochastic gradients are unbiased estimators of local objectives gradients such that \begin{equation*} \mathbb {E}_{\mathbf {s}{\sim }{\mathcal {D}}_{i}}[g(\mathbf {x}, \mathbf {s})] = {\nabla }F(\mathbf {x}),\quad {g(\mathbf {x}, \mathbf {s}) = \nabla \mathcal {L}(\mathbf {x},\mathbf {s})}. \tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Assumption 4 (Bounded Variance):

We assume that the variance of stochastic gradients is bounded by some constants \beta,\sigma \geq 0 such that \begin{equation*} \mathbb {E}_{\mathbf {s}{\sim }{\mathcal {D}}_{i}}[\|{\nabla }F(\mathbf {x}) - g(\mathbf {x}, \mathbf {s})\|^{2}] \leq \beta \|\nabla F(\mathbf {x})\|^{2} + \sigma ^{2}. \tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Assumption 5 (Mixing Matrix):

We assume that the mixing matrix \mathbf {W} satisfies \mathbf {W}\mathbf {1}=\mathbf {1} , \mathbf {W}^{\top }=\mathbf {W} . Besides, the magnitudes of all eigenvalues except the largest one are strictly less than 1 such that \begin{equation*} \max \{|\lambda _{2}(\mathbf {W})|,\cdots \} < \lambda _{1}(\mathbf {W}) = 1. \tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In order to utilize the proof techniques in [20] and [22], we build a matrix-form update rule. Let matrices \mathbf {X}_{k} and \mathbf {G}_{k} \in \mathbb {R}^{(d \times (N+1))} be the stacks of all model parameters and stochastic gradients respectively, as follows:\begin{align*} \mathbf {X}_{k} &= [\mathbf {x}^{1,k}, \cdots, \mathbf {x}^{N,k}, \bar {\mathbf {x}}^{k}], \tag{11}\\ \mathbf {G}_{k} &= [g(\mathbf {x}^{1,k},\mathbf {s}^{1,k}), \cdots, g(\mathbf {x}^{N,k},\mathbf {s}^{N,k}), \mathbf {0}]. \tag{12}\end{align*} View SourceRight-click on figure for MathML and additional features. Then, we can write the update rule of EA-Pipe as follows:\begin{equation*} \mathbf {X}_{k+1} = (\mathbf {X}_{k} - \eta \cdot \mathbf {G}_{k}) \cdot \mathbf {S}_{k}{,} \tag{13}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \mathbf {S}_{k} \in \mathbb {R}^{(N+1) \times (N+1)} is the synchronization matrix which represents the mixing pattern between the local models and the master model. \mathbf {S}_{k} is defined as follows:\begin{align*} \mathbf {S}_{k} = \begin{cases} \displaystyle \mathbf {W} & k \text {mod } \tau = 0 \\ \displaystyle \mathbf {I} & \text {Otherwise} \end{cases}. \tag{14}\end{align*} View SourceRight-click on figure for MathML and additional features. The identity matrix \mathbf {I} means that synchronization does not occur between the local models and the master model. On the other hand, \mathbf {W} is called a mixing matrix which includes the synchronization operation between the local models and the master model based on the round-robin elastic averaging algorithm.

Remark 1:

Instead of Eq. (13), one can use an alternative rule: \mathbf {X}_{k+1} = \mathbf {X}_{k} \cdot \mathbf {S}_{k} - \eta \cdot \mathbf {G}_{k} . However, according to [20], the convergence analysis on Eq. (13) can be extended to the alternative rule. Therefore, we will choose the update rule Eq. (13) to prove Theorem 1.

The mixing matrix \mathbf {W} of the EA-Pipe algorithm does not satisfy Assumption 5. Nonetheless, in order to apply the proof in [20] without Assumption 5, we will first verify whether the mixing matrix \mathbf {W} in the round-robin elastic averaging algorithm satisfies the following conditions.

  1. \mathbf {W} is a doubly-stochastic matrix.

  2. \mathbf {W} is a primitive and irreducible matrix.

  3. \mathbf {W}^{\top }\mathbf {W} is a positive matrix.

SECTION A.

Proof of Condition 1

Lemma 1:

Let \mathbf {M}^{l}{\in }\mathbb {R}^{(N+1)\times (N+1)} denote the linear map between the l -th local model and the master model. For instance, when N=2 , we have \mathbf {M}^{1} and \mathbf {M}^{2} given by: \begin{align*} \mathbf {M}^{1} = \begin{pmatrix} 1-\alpha & 0 & \alpha \\ 0 & 1 & 0 \\ \alpha & 0 & 1-\alpha \end{pmatrix}, \: \mathbf {M}^{2} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1-\alpha & \alpha \\ 0 & \alpha & 1-\alpha \end{pmatrix},\end{align*} View SourceRight-click on figure for MathML and additional features. where 0 < \alpha < 1 . Then, the mixing matrix \mathbf {W} = \mathbf {M}^{1} \times \cdots \times \mathbf {M}^{N} becomes a doubly-stochastic matrix.

Proof:

Since \mathbf {M}^{l} satisfies the following conditions, \mathbf {M}^{l} is a doubly-stochastic matrix.\begin{equation*} \sum _{i=1}^{N+1} \mathbf {M}^{l}_{ij} = \sum _{j=1}^{N+1} \mathbf {M}^{l}_{ij} = 1,\end{equation*} View SourceRight-click on figure for MathML and additional features. where {l}\in \{1,2,\cdots,N\} , \mathbf {M}_{ij} \geq 0 , and {i,j}\in \{1,2,\cdots,N+1\} . According to Lemma 2, the product of two doubly-stochastic matrix is also a doubly-stochastic matrix. Therefore, the mixing matrix \mathbf {W} = \mathbf {M}^{1} \times \cdots \times \mathbf {M}^{N} is a doubly-stochastic matrix.

Lemma 2:

Let \mathbf {A},\mathbf {B} \in \mathbb {R}^{n \times n} be doubly-stochastic matrices. Then \mathbf {C}=\mathbf {AB} is also a doubly-stochastic matrix.

Proof:

Let a_{ij} , b_{ij} , and c_{ij} denote the element in the i -th row and j -th column of matrices \mathbf {A} , \mathbf {B} , and \mathbf {C} , respectively. We can see that the sum of the element in each row of \mathbf {C} is 1. \begin{align*} \sum _{i=1}^{n} c_{ij} &= \sum _{i=1}^{n} \left({\sum _{k=1}^{n} a_{ik}b_{kj}}\right) \\ &= \sum _{k=1}^{n} \left({b_{kj} \sum _{i=1}^{n} a_{ik}}\right) \\ &= \sum _{k=1}^{n} b_{kj} \qquad \qquad \because \sum _{i=1}^{n} a_{ik} = 1 \\ &= 1\end{align*} View SourceRight-click on figure for MathML and additional features. Similarly, we can see that the sum of the element in each column of C is 1. Therefore, \mathbf {C}=\mathbf {AB} is also a doubly stochastic matrix. Moreover, because a_{ij}, b_{ij} \geq 0 for all 1 \leq i,j \leq n , one can ensure c_{ij} \geq 0 .

SECTION B.

Proof of Condition 2

According to Lemma 3, we can see that \mathbf {W} is a primitive and irreducible matrix.

Lemma 3:

The mixing matrix \mathbf {W} \in \mathbb {R}^{(N+1)\times (N+1)} in the round-robin elastic averaging algorithm can be written as follows: \begin{align*} \mathbf {W} = \left ({\begin{array}{cccc} & & & w_{1(N+1)} \\ &{ {{\mathbf {A}}}} & & \vdots \\& & &w_{N(N+1)}\qquad \\ w_{(N+1)1} & \cdots & w_{(N+1)N} & w_{(N+1)(N+1)} \end{array} }\right), \tag{15}\end{align*} View SourceRight-click on figure for MathML and additional features. where \mathbf {A} denotes the non-negative matrix with size of \mathbb {R}^{N{\times }N} and w_{ij} is the entry of \mathbf {W} at i -th row and j -th column.

Since all the local models are synchronized with the master model, the elements in the last row and the last column of \mathbf {W} are always positive. Therefore, no matter what \mathbf {A} is, \mathbf {W}^{2} = \mathbf {W}\mathbf {W} always becomes a positive matrix. Consequently, \mathbf {W} is a primitive and irreducible matrix.

SECTION C.

Proof of Condition 3

According to Eq. (15), the entries in the last row and the last column of \mathbf {W} is always positive, which means the entries in the last row and the last column of \mathbf {W}^{\top } is also always positive. Therefore, \mathbf {\mathbf {W}^{\top }\mathbf {W}} becomes a positive matrix.

SECTION D.

Proof of Theorem 1

Now that we have confirmed that the mixing matrix \mathbf {W} satisfies the required conditions, we can apply the convergence proof technique used in [20] and [22] without Assumption 5. Recall the update rule of the round-robin elastic averaging algorithm.\begin{align*} \mathbf {X}_{k+1} &= (\mathbf {X}_{k} - \eta \cdot \mathbf {G}_{k}) \cdot \mathbf {S}_{k} \\ \mathbf {S}_{k} &= \begin{cases} \displaystyle \mathbf {W} & k \text {mod } \tau = 0 \\ \displaystyle \mathbf {I}_{k} & \text {Otherwise} \end{cases} \end{align*} View SourceRight-click on figure for MathML and additional features. Let \mathbf {v}=\left[{\frac {1}{N+1}, \cdots, \frac {1}{N+1}}\right] \in \mathbb {R}^{N+1} . Then, since \mathbf {W} is a doubly-stochastic matrix, \mathbf {W}\mathbf {v}=\mathbf {v} and hence \mathbf {S}_{k}\mathbf {v}=\mathbf {v} . By multiplying \mathbf {v} on both sides of the update rule, we have:\begin{align*} \mathbf {X}_{k+1}\mathbf {v} &= \mathbf {X}_{k}\mathbf {v} - \eta \cdot \mathbf {G}_{k}\mathbf {v}\tag{16}\\ &= \mathbf {X}_{k}\mathbf {v} - \frac {\eta }{N+1}{\sum _{i=1}^{N} g(\mathbf {x}^{i,k},\mathbf {s}^{i,k})} {.} \tag{17}\end{align*} View SourceRight-click on figure for MathML and additional features.

To simplify the equation, we define an averaged variable \mathbf {y}_{k}:= \mathbf {X}_{k}\mathbf {v} = \frac {1}{N+1}\sum _{i=1}^{N} \mathbf {x}^{i,k} + \frac {1}{N+1}\bar {\mathbf {x}}^{k} and an effective learning rate \eta _{\text {eff}}:= \frac {N}{N+1}\eta . Using these definitions, the update rule (17) becomes \begin{equation*} \mathbf {y}_{k+1} = \mathbf {y}_{k} - \frac {\eta _{\text {eff}}}{N}\sum _{i=1}^{N} g(\mathbf {x}^{i,k}, \mathbf {s}^{i,k}){.} \tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features. Following [27], [28], and [29], we analyze the convergence property of the round-robin elastic averaging algorithm with respect to \mathbf {y}_{k} .

By utilizing the intermediate result from the proof of Lemma 2 in [20] (specifically, Eq. (61) in [20]), we get the following equation (when \eta _{\text {eff}}{L}\left ({1+\frac {\beta }{N}}\right) \leq 1 ):\begin{align*} &\hspace {-.5pc}\frac {1}{K}\sum _{k=0}^{K-1}\mathbb {E}[\|\nabla F(\mathbf {y}_{k})\|^{2}] \\ &\leq \frac {2[F(\mathbf {y}_{0}) - F_{\text {inf}}]}{\eta _{\text {eff}}K} +\frac {\eta _{\text {eff}}L\sigma ^{2}}{N} \\ &\quad + \frac {L^{2}}{KN}\sum _{k=0}^{K-1}\sum _{i=1}^{N}\mathbb {E}[\|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2}] \\ &\quad - \left [{1 - \eta _{\text {eff}}{L}\left ({\frac {\beta }{N} + 1}\right)}\right]\frac {1}{KN}\sum _{k=0}^{K-1}\sum _{i=1}^{N}\mathbb {E}[\|\nabla F(\mathbf {x}^{i,k})\|^{2}]{.} \\{}\tag{19}\end{align*} View SourceRight-click on figure for MathML and additional features. We can derive an upper bound for the third term of the right hand side of Eq. (19) as follows:\begin{align*} \sum _{i=1}^{N} \|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2} &\leq \sum _{i=1}^{N} \|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2} + \|\mathbf {y}_{k} - \bar {\mathbf {x}}^{k}\|^{2} \tag{20}\\ &=\|\mathbf {X}_{k}(\mathbf {I} - \mathbf {v}\mathbf {1}^{\top})\|^{2}_{\text {F}}{,} \tag{21}\end{align*} View SourceRight-click on figure for MathML and additional features. where \|{\cdot }\|_{\text {F}} is the Frobenius matrix norm.

According to the update rule (13) and repeatedly using the fact \mathbf {W}\mathbf {v}=\mathbf {v} , \mathbf {1}^{\top} \mathbf {W}=\mathbf {1}^{\top} and \mathbf {v}^{\top} \mathbf {1}=1 , we have:\begin{align*} \mathbf {X}_{k}(\mathbf {I}-\mathbf {v}\mathbf {1}^{\top}) &= (\mathbf {X}_{k-1}-\eta \mathbf {G}_{k-1})\mathbf {S}_{k-1}(\mathbf {I}-\mathbf {v}\mathbf {1}^{\top}) \tag{22}\\ &= -\eta \sum _{j=0}^{k-1}\mathbf {G}_{j}\left ({\prod _{s=j}^{k-1}\mathbf {S}_{s} - \mathbf {v}\mathbf {1}^{\top} }\right){.} \tag{23}\end{align*} View SourceRight-click on figure for MathML and additional features. Therefore, \begin{equation*} \sum _{i=1}^{N} \|\mathbf {y}_{k} - \mathbf {x}^{i,k}\|^{2} \leq \eta ^{2}\left \lVert{ \sum _{j=0}^{k-1}\mathbf {G}_{j}\left ({\prod _{s=j}^{k-1}\mathbf {S}_{s} - \mathbf {v}\mathbf {1}^{\top} }\right)}\right \rVert _{\text {F}}^{2}{,} \tag{24}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \begin{equation*} \prod _{k}\mathbf {S}_{k} = \prod _{k}\mathbf {W}_{k}. \tag{25}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In order to keep utilizing the proof sequence of [20], we need to ensure that \|\mathbf {W}^{n} - \mathbf {v}\mathbf {1}^{\top }\|_{\mathrm {op}} must be strictly less than 1, where \|{\cdot }\|_{\mathrm {op}} is the operator norm. The following lemmas guarantees that \|\mathbf {W}^{n} - \mathbf {v}\mathbf {1}^{\top }\|_{\mathrm {op}} < 1 .

Lemma 4:

Let \mathbf {W} and \mathbf {J} \in \mathbb {R}^{n \times n} be a asymmetric doubly stochastic matrix with non-negative element and \mathbf {1}\mathbf {1}^{\top} / \mathbf {1}^{\top} \mathbf {1} , respectively. Then, the operator norm on \mathbf {W} - \mathbf {J} is always less than 1. \begin{equation*} \|\mathbf {W} - \mathbf {J}\|_{\mathrm {op}} = \zeta < 1 \tag{26}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Proof:

The operator norm of \mathbf {A} is defined as \sqrt {\lambda _{\text {max}}(\mathbf {A}^{\top }\mathbf {A})} , where \lambda _{\text {max}}(\mathbf {A}^{\top }\mathbf {A}) is the maximum eigenvalue of \mathbf {A}^{\top }\mathbf {A} . That is, \|\mathbf {W}-\mathbf {J}\|_{\mathrm {op}} is the square root on the maximum eigenvalue of (\mathbf {W}-\mathbf {J})^{\top }(\mathbf {W}-\mathbf {J}) . (\mathbf {W}-\mathbf {J})^{\top }(\mathbf {W}-\mathbf {J}) can be expanded as follows: \begin{align*} (\mathbf {W}-\mathbf {J})^{\top }(\mathbf {W}-\mathbf {J}) &= (\mathbf {W}^{\top }-\mathbf {J}^{\top })(\mathbf {W}-\mathbf {J}) \\ &= \mathbf {W}^{\top }\mathbf {W} - \mathbf {J}^{\top }\mathbf {W} - \mathbf {W}^{\top }\mathbf {J} + \mathbf {J}^{\top }\mathbf {J} \\ &= \mathbf {W}^{\top }\mathbf {W} - \mathbf {J} -\mathbf {J} + \mathbf {J} \\ &= \mathbf {W}^{\top }\mathbf {W} - \mathbf {J}{.} \tag{27}\end{align*} View SourceRight-click on figure for MathML and additional features. Here, we can observe the followings:

  1. Both \mathbf {W} and \mathbf {W}^{\top} are doubly stochastic matrices.

  2. \mathbf {W}^{\top }\mathbf {W} is a symmetric real matrix, which is diagonalizable [30]. Also, according to Lemma 3, it is a positive matrix.

  3. According to Lemma 2, \mathbf {W}^{\top }\mathbf {W} is a doubly stochastic matrix.

  4. Let \mathbf {W}^{\top }\mathbf {W} = \mathbf {Z} , then \mathbf {ZJ} = \mathbf {JZ} .

  5. Since all elements of \mathbf {W} are non-negative, all elements of \mathbf {Z} are also non-negative.

Based on the above observations, we can diagonalize \mathbf {Z} and \mathbf {J} simultaneously. We decompose \mathbf {Z} as follows: \begin{align*} \mathbf {Z} = \mathbf {Q}\Lambda \mathbf {Q}^{\top }, \text {where } \Lambda =\text {diag}\{\lambda _{1}(\mathbf {Z}), \lambda _{2}(\mathbf {Z}), \cdots, \lambda _{n}(\mathbf {Z})\}{.} \\{}\tag{28}\end{align*} View SourceRight-click on figure for MathML and additional features. Since \mathbf {Z} is a doubly stochastic matrix, the maximum eigenvalue of \mathbf {Z} , \lambda _{1}(\mathbf {Z}) , is 1. Similarly, the matrix \mathbf {J} can be decomposed as \mathbf {Q{\Lambda _{0}}Q^{\top }} where \mathbf {\Lambda _{0}}=\text {diag}\{1,0,\cdots,0\} . Then, we have: \begin{equation*} \mathbf {Z} - \mathbf {J} = \mathbf {Q}(\mathbf {\Lambda - \Lambda _{0}})\mathbf {Q}^{\top }{.} \tag{29}\end{equation*} View SourceRight-click on figure for MathML and additional features. Now, the maximum eigenvalue of \mathbf {Z-J} is \text {max}\{0, \lambda _{2}(Z), \cdots, \lambda _{n}(Z)\} . Here, we consider two cases:
  • If all \lambda _{2}(\mathbf {Z}),\cdots,\lambda _{n}(\mathbf {Z}) are negative, the maximum eigenvalue of \mathbf {Z-J} is 0.

  • If one of \lambda _{2}(\mathbf {Z}),\cdots,\lambda _{n}(\mathbf {Z}) is positive, the maximum eigenvalue of \mathbf {Z-J} is positive. Furthermore, according to Perron-Frobenius theorem [31], [32], since \mathbf {Z} is a positive primitive stochastic matrix, there is only one eigenvalue 1. All other eigenvalues are smaller than 1.

As a result, \|\mathbf {W-J}\|_{\mathrm {op}} = \zeta is non-negative and strictly less than 1.

Lemma 5:

Let \mathbf {W} and \mathbf {J} \in \mathbb {R}^{n \times n} be an asymmetric doubly stochastic matrix with non-negative element and \mathbf {1}\mathbf {1}^{\top} / \mathbf {1}^{\top} \mathbf {1} , respectively. Then, the operator norm on \mathbf {W}^{n} - \mathbf {J} is always less than or equal to {\zeta }^{n} , where \zeta is the operator norm on \mathbf {W} - \mathbf {J} . \begin{equation*} \|\mathbf {W}^{n}-\mathbf {J}\|_{\mathrm {op}} \leq {\zeta }^{n}, \quad \mathit {where}\quad \|\mathbf {W} - \mathbf {J}\|_{\mathrm {op}} = \zeta \tag{30}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Proof:

We can prove it by induction on n .

Base case (n=1 ): We have \|\mathbf {W}^{1}-\mathbf {J}\|_{\mathrm {op}} = \zeta ^{1} \leq \zeta .

Inductive hypothesis: Assume that the equation holds for n=k .

Inductive step: Let n=k+1 , then \begin{align*} \|\mathbf {W}^{k+1} - \mathbf {J}\|_{\mathrm {op}} &= \|(\mathbf {W}-\mathbf {J})(\mathbf {W}^{k}-\mathbf {J})\|_{\mathrm {op}} \\ &\leq {\|\mathbf {W}-\mathbf {J}\|_{\mathrm {op}}} {\|\mathbf {W}^{k} - \mathbf {J}\|_{\mathrm {op}}} \\ &\leq {\zeta }\cdot {\zeta ^{k}} \because \text {Inductive hypothesis} \\ &= {\zeta ^{k+1}}{.}\end{align*} View SourceRight-click on figure for MathML and additional features. Now we complete the proof on Lemma 5.

By Lemma 5, we do not need Assumption 5 in [20], and we can apply the same procedure of Appendix D.2 in [20] to get the following final result:\begin{align*} &\hspace {-1pc}\frac {1}{K}\sum _{k=0}^{K-1}\mathbb {E}[\|\nabla F(\mathbf {y}_{k})\|^{2}] \\ &\leq \frac {2[F(\mathbf {y}_{0}) - F_{\text {inf}}]}{\eta _{\text {eff}}K} + \frac {\eta _{\text {eff}}L\sigma ^{2}}{N} \\ &\quad + \eta _{\text {eff}}^{2}{L^{2}}\sigma ^{2}\left ({\frac {1 + \zeta ^{2}}{1-\zeta ^{2}}\tau - 1}\right)\left ({1+\frac {1}{N}}\right)^{2}. \tag{31}\end{align*} View SourceRight-click on figure for MathML and additional features.

Appendix B

Best Case Results of the Small-Scale experiment

Table 8 shows the best case results of the four training methods on the CIFAR-10 dataset using the VGG-16 model with varying number of GPUs. We observed the similar trends as in Table 1.

TABLE 8 Image Classification Error Rates (%) of the Four Training Methods Using the CIFAR-10 Dataset and The VGG-16 Model With Varying Number of GPUs
Table 8- 
Image Classification Error Rates (%) of the Four Training Methods Using the CIFAR-10 Dataset and The VGG-16 Model With Varying Number of GPUs

Appendix C

Best Case Results of the mid-scale experiment

Table 9 shows the best case results of the four training methods using the CIFAR-100 dataset and the ResNet-34 model on 8 GPUs. We observed the similar trends as in Table 4.

TABLE 9 Image Classification Error Rates (%) of the Four Training Methods Using the CIFAR-100 Dataset and the ResNet-34 Model on 8 GPUs
Table 9- 
Image Classification Error Rates (%) of the Four Training Methods Using the CIFAR-100 Dataset and the ResNet-34 Model on 8 GPUs

Appendix D

Best Case Results of the large-scale experiment

Table 10 shows the best case results of the three training methods using the ImageNet dataset and the ResNet-50 model on 50 GPUs. We observed the similar trends as in Table 7.

TABLE 10 Image Classification Error Rates (%) of the Three Training Methods Using the ImageNet Dataset and the ResNet-50 Model on 50 GPUs
Table 10- 
Image Classification Error Rates (%) of the Three Training Methods Using the ImageNet Dataset and the ResNet-50 Model on 50 GPUs

References

References is not available for this document.